Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Unicode::LineBreak(3) User Contributed Perl DocumentationUnicode::LineBreak(3)

NAME
       Unicode::LineBreak - UAX	#14 Unicode Line Breaking Algorithm

SYNOPSIS
	   use Unicode::LineBreak;
	   $lb = Unicode::LineBreak->new();
	   $broken = $lb->break($string);

DESCRIPTION
       Unicode::LineBreak performs Line	Breaking Algorithm described in
       Unicode Standard	Annex #14 [UAX #14]. East_Asian_Width informative
       property	defined	by Annex #11 [UAX #11] will be concerned to determine
       breaking	positions.

   Terminology
       Following terms are used	for convenience.

       Mandatory break is obligatory line breaking behavior defined by core
       rules and performed regardless of surrounding characters.  Arbitrary
       break is	line breaking behavior allowed by core rules and chosen	by
       user to perform it.  Arbitrary break includes direct break and indirect
       break defined by	[UAX #14].

       Alphabetic characters are characters usually no line breaks are allowed
       between pairs of	them, except that other	characters provide break
       oppotunities.  Ideographic characters are characters that usually allow
       line breaks both	before and after themselves.  [UAX #14]	classifies
       most of alphabetic to AL	and most of ideographic	to ID (These terms are
       inaccurate from the point of view by grammatology).  On several
       scripts,	breaking positions are not obvious by each characters
       therefore heuristic based on dictionary is used.

       Number of columns of a string is	not always equal to the	number of
       characters it contains: Each of characters is either wide, narrow or
       nonspacing; they	occupy 2, 1 or 0 columns, respectively.	 Several
       characters may be both wide and narrow by the contexts they are used.
       Characters may have more	various	widths by customization.

PUBLIC INTERFACE
   Line	Breaking
       new ([KEY => VALUE, ...])
	   Constructor.	 About KEY => VALUE pairs see "Options".

       break (STRING)
	   Instance method.  Break Unicode string STRING and returns it.  In
	   array context, returns array	of lines contained in the result.

       break_partial (STRING)
	   Instance method.  Same as break() but accepts incremental inputs.
	   Give	"undef"	as STRING argument to specify that input was
	   completed.

       config (KEY)
       config (KEY => VALUE, ...)
	   Instance method.  Get or update configuration.  About KEY =>	VALUE
	   pairs see "Options".

       copy
	   Copy	constructor.  Create a copy of object instance.

   Getting Informations
       breakingRule (BEFORESTR,	AFTERSTR)
	   Instance method.  Get possible line breaking	behavior between
	   strings BEFORESTR and AFTERSTR.  See	"Constants" for	returned
	   value.

	   Note: This method gives just	approximate description	of line
	   breaking behavior.  Use break() and so on to	wrap actual texts.

       context ([Charset => CHARSET], [Language	=> LANGUAGE])
	   Function.  Get language/region context used by character set
	   CHARSET or language LANGUAGE.

   Options
       "new" and "config" methods accept following pairs.  Some	of them	affect
       number of columns ([E]),	grapheme cluster segmentation ([G]) (see also
       Unicode::GCString) or line breaking behavior ([L]).

       BreakIndent => "YES" | "NO"
	   [L] Always allows break after SPACEs	at beginning of	line, a.k.a.
	   indent.  [UAX #14] does not take account of such usage of SPACE.
	   Default is "YES".

	   Note: This option was introduced at release 1.011.

       CharMax => NUMBER
	   [L] Possible	maximum	number of characters in	one line, not counting
	   trailing SPACEs and newline sequence.  Note that number of
	   characters generally	doesn't	represent length of line.  Default is
	   998.	 0 means unlimited (as of release 2012.01).

       ColMin => NUMBER
	   [L] Minimum number of columns which line broken arbitrarily may
	   include, not	counting trailing spaces and newline sequences.
	   Default is 0.

       ColMax => NUMBER
	   [L] Maximum number of columns line may include not counting
	   trailing spaces and newline sequence.  In other words, maximum
	   length of line.  Default is 76.

       See also	"Urgent" option	and "User-Defined Breaking Behaviors".

       ComplexBreaking => "YES"	| "NO"
	   [L] Performs	heuristic breaking on South East Asian complex
	   context.  Default is, if word segmentation for South	East Asian
	   writing systems is enabled, "YES".

       Context => CONTEXT
	   [E][L] Specify language/region context.  Currently available
	   contexts are	"EASTASIAN" and	"NONEASTASIAN".	 Default context is
	   "NONEASTASIAN".

	   In "EASTASIAN" context, characters with East_Asian_Width property
	   ambiguous (A) are treated as	"wide" and with	Line Breaking Class AI
	   as ideographic (ID).

	   In "NONEASTASIAN" context, characters with East_Asian_Width
	   property ambiguous (A) are treated as "narrow" and with Line
	   Breaking Class AI as	alphabetic (AL).

       EAWidth => "[" ORD "=>" PROPERTY	"]"
       EAWidth => "undef"
	   [E] Tailor classification of	East_Asian_Width property.  ORD	is UCS
	   scalar value	of character or	array reference	of them.  PROPERTY is
	   one of East_Asian_Width property values and extended	values (See
	   "Constants").  This option may be specified multiple	times.	If
	   "undef" is specified, all tailoring assigned	before will be
	   canceled.

	   By default, no tailorings are available.  See also "Tailoring
	   Character Properties".

       Format => METHOD
	   [L] Specify the method to format broken lines.

	   "SIMPLE"
	       Default method.	Just only insert newline at arbitrary breaking
	       positions.

	   "NEWLINE"
	       Insert or replace newline sequences with	that specified by
	       "Newline" option, remove	SPACEs leading newline sequences or
	       end-of-text.  Then append newline at end	of text	if it does not
	       exist.

	   "TRIM"
	       Insert newline at arbitrary breaking positions. Remove SPACEs
	       leading newline sequences.

	   "undef"
	       Do nothing, even	inserting any newlines.

	   Subroutine reference
	       See "Formatting Lines".

       HangulAsAL => "YES" | "NO"
	   [L] Treat hangul syllables and conjoining jamos as alphabetic
	   characters (AL).  Default is	"NO".

       LBClass => "[" ORD "=>" CLASS "]"
       LBClass => "undef"
	   [G][L] Tailor classification	of line	breaking property.  ORD	is UCS
	   scalar value	of character or	array reference	of them.  CLASS	is one
	   of line breaking classes (See "Constants").	This option may	be
	   specified multiple times.  If "undef" is specified, all tailoring
	   assigned before will	be canceled.

	   By default, no tailorings are available.  See also "Tailoring
	   Character Properties".

       LegacyCM	=> "YES" | "NO"
	   [G][L] Treat	combining characters lead by a SPACE as	an isolated
	   combining character (ID).  As of Unicode 5.0, such use of SPACE is
	   not recommended.  Default is	"YES".

       Newline => STRING
	   [L] Unicode string to be used for newline sequence.	Default	is
	   "\n".

       Prep => METHOD
	   [L] Add user-defined	line breaking behavior(s).  This option	may be
	   specified multiple times.  Following	methods	are available.

	   "NONBREAKURI"
	       Won't break URIs.

	   "BREAKURI"
	       Break URIs according to a rule suitable for printed materials.
	       For more	details	see [CMOS], sections 6.17 and 17.11.

	   "[" REGEX, SUBREF "]"
	       The sequences matching regular expression REGEX will be broken
	       by subroutine referred by SUBREF.  For more details see "User-
	       Defined Breaking	Behaviors".

	   "undef"
	       Cancel all methods assigned before.

       Sizing => METHOD
	   [L] Specify method to calculate size	of string.  Following options
	   are available.

	   "UAX11"
	       Default method.	Sizes are computed by columns of each
	       characters accoring to built-in character database.

	   "undef"
	       Number of grapheme clusters (see	Unicode::GCString) contained
	       in the string.

	   Subroutine reference
	       See "Calculating	String Size".

	   See also "ColMax", "ColMin" and "EAWidth" options.

       Urgent => METHOD
	   [L] Specify method to handle	excessing lines.  Following options
	   are available.

	   "CROAK"
	       Print error message and die.

	   "FORCE"
	       Force breaking excessing	fragment.

	   "undef"
	       Default method.	Won't break excessing fragment.

	   Subroutine reference
	       See "User-Defined Breaking Behaviors".

       ViramaAsJoiner => "YES" | "NO"
	   [G] Virama sign ("halant" in	Hindi, "coeng" in Khmer) and its
	   succeeding letter are not broken.  Default is "YES".	 Note: This
	   option was introduced by release 2012.001_29.  On previous
	   releases, it	was fixed to "NO".  "Default" grapheme cluster defined
	   by [UAX #29]	does not include this feature.

   Constants
       "EA_Na",	"EA_N",	"EA_A",	"EA_W",	"EA_H",	"EA_F"
	   Index values	to specify six East_Asian_Width	property values
	   defined by [UAX #11]: narrow	(Na), neutral (N), ambiguous (A), wide
	   (W),	halfwidth (H) and fullwidth (F).

       "EA_Z"
	   Index value to specify nonspacing characters.

	   Note: This "nonspacing" value is extension by this module, not a
	   part	of [UAX	#11].

       "LB_BK",	"LB_CR", "LB_LF", "LB_NL", "LB_SP", "LB_OP", "LB_CL", "LB_CP",
       "LB_QU",	"LB_GL", "LB_NS", "LB_EX", "LB_SY", "LB_IS", "LB_PR", "LB_PO",
       "LB_NU",	"LB_AL", "LB_HL", "LB_ID", "LB_IN", "LB_HY", "LB_BA", "LB_BB",
       "LB_B2",	"LB_CB", "LB_ZW", "LB_CM", "LB_WJ", "LB_H2", "LB_H3", "LB_JL",
       "LB_JV",	"LB_JT", "LB_SG", "LB_AI", "LB_CJ", "LB_SA", "LB_XX", "LB_RI"
	   Index values	to specify 40 line breaking property values (classes)
	   defined by [UAX #14].

	   Note: Property value	CP was introduced by Unicode 5.2.0.  Property
	   values HL and CJ were introduced by Unicode 6.1.0.  Property	value
	   RI was introduced by	Unicode	6.2.0.

       "MANDATORY", "DIRECT", "INDIRECT", "PROHIBITED"
	   Four	values to specify line breaking	behaviors: Mandatory break;
	   Both	direct break and indirect break	are allowed; Indirect break is
	   allowed but direct break is prohibited; Prohibited break.

       "Unicode::LineBreak::SouthEastAsian::supported"
	   Flag	to determin if word segmentation for South East	Asian writing
	   systems is enabled.	If this	feature	was enabled, a non-empty
	   string is set.  Otherwise, "undef" is set.

	   N.B.: Current release supports Thai script of modern	Thai language
	   only.

       "UNICODE_VERSION"
	   A string to specify version of Unicode standard this	module refers.

CUSTOMIZATION
   Formatting Lines
       If you specify subroutine reference as a	value of "Format" option, it
       should accept three arguments:

	   $MODIFIED = &subroutine(SELF, EVENT,	STR);

       SELF is a Unicode::LineBreak object, EVENT is a string to determine the
       context that subroutine was called in, and STR is a fragment of Unicode
       string leading or trailing breaking position.

	   EVENT |When Fired	       |Value of STR
	   -----------------------------------------------------------------
	   "sot" |Beginning of text    |Fragment of first line
	   "sop" |After	mandatory break|Fragment of next line
	   "sol" |After	arbitrary break|Fragment on sequel of line
	   ""	 |Just before any      |Complete line without trailing
		 |breaks	       |SPACEs
	   "eol" |Arbitrary break      |SPACEs leading breaking	position
	   "eop" |Mandatory break      |Newline	and its	leading	SPACEs
	   "eot" |End of text	       |SPACEs (and newline) at	end of
		 |		       |text
	   -----------------------------------------------------------------

       Subroutine should return	modified text fragment or may return "undef"
       to express that no modification occurred.  Note that modification in
       the context of "sot", "sop" or "sol" may	affect decision	of successive
       breaking	positions while	in the others won't.

       Note: String arguments are actually sequences of	grapheme clusters.
       See Unicode::GCString.

       For example, following code folds lines removing	trailing spaces:

	   sub fmt {
	       if ($_[1] =~ /^eo/) {
		   return "\n";
	       }
	       return undef;
	   }
	   my $lb = Unicode::LineBreak->new(Format => \&fmt);
	   $output = $lb->break($text);

   User-Defined	Breaking Behaviors
       When a line generated by	arbitrary break	is expected to be beyond
       measure of either CharMax, ColMax or ColMin, urgent break may be
       performed on successive string.	If you specify subroutine reference as
       a value of "Urgent" option, it should accept two	arguments:

	   @BROKEN = &subroutine(SELF, STR);

       SELF is a Unicode::LineBreak object and STR is a	Unicode	string to be
       broken.

       Subroutine should return	an array of broken string STR.

       Note: String argument is	actually a sequence of grapheme	clusters.  See
       Unicode::GCString.

       For example, following code inserts hyphen to the name of several
       chemical	substances (such as Titin) so that it may be folded:

	   sub hyphenize {
	       return map {$_ =~ s/yl$/yl-/; $_} split /(\w+?yl(?=\w))/, $_[1];
	   }
	   my $lb = Unicode::LineBreak->new(Urgent => \&hyphenize);
	   $output = $lb->break("Methionylthreonylthreonylglutaminylarginyl...");

       If you specify [REGEX, SUBREF] array reference as any of	"Prep" option,
       subroutine should accept	two arguments:

	   @BROKEN = &subroutine(SELF, STR);

       SELF is a Unicode::LineBreak object and STR is a	Unicode	string matched
       with REGEX.

       Subroutine should return	an array of broken string STR.

       For example, following code will	break HTTP URLs	using [CMOS] rule.

	   my $url = qr{http://[\x21-\x7E]+}i;
	   sub breakurl	{
	       my $self	= shift;
	       my $str = shift;
	       return split m{(?<=[/]) (?=[^/])	|
			      (?<=[^-.]) (?=[-~.,_?\#%=&]) |
			      (?<=[=&])	(?=.)}x, $str;
	   }
	   my $lb = Unicode::LineBreak->new(Prep => [$url, \&breakurl]);
	   $output = $lb->break($string);

       Preserving State

       Unicode::LineBreak object can behave as hash reference.	Any items may
       be preserved throughout its life.

       For example, following code will	separate paragraphs with empty lines.

	   sub paraformat {
	       my $self	= shift;
	       my $action = shift;
	       my $str = shift;

	       if ($action eq 'sot' or $action eq 'sop') {
		   $self->{'line'} = '';
	       } elsif ($action	eq '') {
		   $self->{'line'} = $str;
	       } elsif ($action	eq 'eol') {
		   return "\n";
	       } elsif ($action	eq 'eop') {
		   if (length $self->{'line'}) {
		       return "\n\n";
		   } else {
		       return "\n";
		   }
	       } elsif ($action	eq 'eot') {
		   return "\n";
	       }
	       return undef;
	   }
	   my $lb = Unicode::LineBreak->new(Format => \&paraformat);
	   $output = $lb->break($string);

   Calculating String Size
       If you specify subroutine reference as a	value of "Sizing" option, it
       will be called with five	arguments:

	   $COLS = &subroutine(SELF, LEN, PRE, SPC, STR);

       SELF is a Unicode::LineBreak object, LEN	is size	of preceding string,
       PRE is preceding	Unicode	string,	SPC is additional SPACEs and STR is a
       Unicode string to be processed.

       Subroutine should return	calculated number of columns of	"PRE.SPC.STR".
       The number of columns may not be	an integer: Unit of the	number may be
       freely chosen, however, it should be same as those of "ColMin" and
       "ColMax"	option.

       Note: String arguments are actually sequences of	grapheme clusters.
       See Unicode::GCString.

       For example, following code processes lines with	tab stops by each
       eight columns.

	   sub tabbedsizing {
	       my ($self, $cols, $pre, $spc, $str) = @_;

	       my $spcstr = $spc.$str;
	       while ($spcstr->lbc == LB_SP) {
		   my $c = $spcstr->item(0);
		   if ($c eq "\t") {
		       $cols +=	8 - $cols % 8;
		   } else {
		       $cols +=	$c->columns;
		   }
		   $spcstr = $spcstr->substr(1);
	       }
	       $cols +=	$spcstr->columns;
	       return $cols;
	   };
	   my $lb = Unicode::LineBreak->new(LBClass => [ord("\t") => LB_SP],
					    Sizing => \&tabbedsizing);
	   $output = $lb->break($string);

   Tailoring Character Properties
       Character properties may	be tailored by "LBClass" and "EAWidth"
       options.	 Some constants	are defined for	convenience of tailoring.

       Line Breaking Properties

       Non-starters of Kana-like Characters

       By default, several hiragana, katakana and characters corresponding to
       kana are	treated	as non-starters	(NS or CJ).  When the following
       pair(s) are specified for value of "LBClass" option, these characters
       are treated as normal ideographic characters (ID).

       "KANA_NONSTARTERS() => LB_ID"
	   All of characters below.

       "IDEOGRAPHIC_ITERATION_MARKS() => LB_ID"
	   Ideographic iteration marks.	 U+3005	IDEOGRAPHIC ITERATION MARK,
	   U+303B VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D HIRAGANA
	   ITERATION MARK, U+309E HIRAGANA VOICED ITERATION MARK, U+30FD
	   KATAKANA ITERATION MARK and U+30FE KATAKANA VOICED ITERATION	MARK.

	   N.B.	Some of	them are neither hiragana nor katakana.

       "KANA_SMALL_LETTERS() =>	LB_ID"
       "KANA_PROLONGED_SOUND_MARKS() =>	LB_ID"
	   Hiragana or katakana	small letters: Hiragana	small letters U+3041
	   A, U+3043 I,	U+3045 U, U+3047 E, U+3049 O, U+3063 TU, U+3083	YA,
	   U+3085 YU, U+3087 YO, U+308E	WA, U+3095 KA, U+3096 KE.  Katakana
	   small letters U+30A1	A, U+30A3 I, U+30A5 U, U+30A7 E, U+30A9	O,
	   U+30C3 TU, U+30E3 YA, U+30E5	YU, U+30E7 YO, U+30EE WA, U+30F5 KA,
	   U+30F6 KE.  Katakana	phonetic extensions U+31F0 KU -	U+31FF RO.
	   Halfwidth katakana small letters U+FF67 A - U+FF6F TU.

	   Hiragana or katakana	prolonged sound	marks: U+30FC KATAKANA-
	   HIRAGANA PROLONGED SOUND MARK and U+FF70 HALFWIDTH KATAKANA-
	   HIRAGANA PROLONGED SOUND MARK.

	   N.B.	These letters are optionally treated either as non-starter or
	   as normal ideographic.  See [JIS X 4051] 6.1.1, [JLREQ] 3.1.7 or
	   [UAX14].

	   N.B.	U+3095,	U+3096,	U+30F5,	U+30F6 are considered to be neither
	   hiragana nor	katakana.

       "MASU_MARK() => LB_ID"
	   U+303C MASU MARK.

	   N.B.	Although this character	is not kana, it	is usually regarded as
	   abbreviation	to sequence of hiragana	a3/4 a or katakana a a^1, MA
	   and SU.

	   N.B.	This character is classified as	non-starter (NS) by [UAX #14]
	   and as the class corresponding to ID	by [JIS	X 4051]	and [JLREQ].

       Ambiguous Quotation Marks

       By default, some	punctuations are ambiguous quotation marks (QU).

       "BACKWARD_QUOTES() => LB_OP, FORWARD_QUOTES() =>	LB_CL"
	   Some	languages (Dutch, English, Italian, Portugese, Spanish,
	   Turkish and most East Asian)	use rotated-9-style punctuations (a a)
	   as opening and 9-style punctuations (a a) as	closing	quotation
	   marks.

       "FORWARD_QUOTES() => LB_OP, BACKWARD_QUOTES() =>	LB_CL"
	   Some	others (Czech, German and Slovak) use 9-style punctuations (a
	   a) as opening and rotated-9-style punctuations (a a)	as closing
	   quotation marks.

       "BACKWARD_GUILLEMETS() => LB_OP,	FORWARD_GUILLEMETS() =>	LB_CL"
	   French, Greek, Russian etc. use left-pointing guillemets (A<< a^1)
	   as opening and right-pointing guillemets (A>> ao) as	closing
	   quotation marks.

       "FORWARD_GUILLEMETS() =>	LB_OP, BACKWARD_GUILLEMETS() =>	LB_CL"
	   German and Slovak use right-pointing	guillemets (A>>	ao) as opening
	   and left-pointing guillemets	(A<< a^1) as closing quotation marks.

       Danish, Finnish,	Norwegian and Swedish use 9-style or right-pointing
       punctuations (a a A>> ao) as both opening and closing quotation marks.

       IDEOGRAPHIC SPACE

       "IDEOGRAPHIC_SPACE() => LB_BA"
	   U+3000 IDEOGRAPHIC SPACE won't be placed at beginning of line.
	   This	is default behavior.

       "IDEOGRAPHIC_SPACE() => LB_ID"
	   IDEOGRAPHIC SPACE can be placed at beginning	of line.  This was
	   default behavior by Unicode 6.2 and earlier.

       "IDEOGRAPHIC_SPACE() => LB_SP"
	   IDEOGRAPHIC SPACE won't be placed at	beginning of line, and will
	   protrude from end of	line.

       East_Asian_Width	Properties

       Some particular letters of Latin, Greek and Cyrillic scripts have
       ambiguous (A) East_Asian_Width property.	 Thus, these characters	are
       treated as wide in "EASTASIAN" context.	Specifying "EAWidth => [
       AMBIGUOUS_"*"() => EA_N ]", those characters are	always treated as
       narrow.

       "AMBIGUOUS_ALPHABETICS()	=> EA_N"
	   Treat all of	characters below as East_Asian_Width neutral (N).

       "AMBIGUOUS_CYRILLIC() =>	EA_N"
       "AMBIGUOUS_GREEK() => EA_N"
       "AMBIGUOUS_LATIN() => EA_N"
	   Treate letters having ambiguous (A) width of	Cyrillic, Greek	and
	   Latin scripts as neutral (N).

       On the other hand, despite several characters were occasionally
       rendered	as wide	characters by number of	implementations	for East Asian
       character sets, they are	given narrow (Na) East_Asian_Width property
       just because they have fullwidth	(F) compatibility characters.
       Specifying "EAWidth" as below, those characters are treated as
       ambiguous --- wide on "EASTASIAN" context.

       "QUESTIONABLE_NARROW_SIGNS() => EA_A"
	   U+00A2 CENT SIGN, U+00A3 POUND SIGN,	U+00A5 YEN SIGN	(or yuan
	   sign), U+00A6 BROKEN	BAR, U+00AC NOT	SIGN, U+00AF MACRON.

   Configuration File
       Built-in	defaults of option parameters for "new"	and "config" method
       can be overridden by configuration files:
       Unicode/LineBreak/Defaults.pm.  For more	details	read
       Unicode/LineBreak/Defaults.pm.sample.

BUGS
       Please report bugs or buggy behaviors to	developer.

       CPAN Request Tracker:
       <http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-LineBreak>.

VERSION
       Consult $VERSION	variable.

   Incompatible	Changes
       Release 2012.06
	   o   eawidth() method	was deprecated.	 "columns" in
	       Unicode::GCString may be	used instead.

	   o   lbclass() method	was deprecated.	 Use "lbc" in
	       Unicode::GCString or "lbcext" in	Unicode::GCString.

   Conformance to Standards
       Character properties this module	is based on are	defined	by Unicode
       Standard	version	8.0.0.

       This module is intended to implement UAX14-C2.

IMPLEMENTATION NOTES
       o   Some	ideographic characters may be treated either as	NS or as ID by
	   choice.

       o   Hangul syllables and	conjoining jamos may be	treated	as either ID
	   or AL by choice.

       o   Characters assigned to AI may be resolved to	either AL or ID	by
	   choice.

       o   Character(s)	assigned to CB are not resolved.

       o   Characters assigned to CJ are always	resolved to NS.	 More flexible
	   tailoring mechanism is provided.

       o   When	word segmentation for South East Asian writing systems is not
	   supported, characters assigned to SA	are resolved to	AL, except
	   that	characters that	have Grapheme_Cluster_Break property value
	   Extend or SpacingMark be resolved to	CM.

       o   Characters assigned to SG or	XX are resolved	to AL.

       o   Code	points of following UCS	ranges are given fixed property	values
	   even	if they	have not been assigned any characers.

	       Ranges		  | UAX	#14    | UAX #11    | Description
	       -------------------------------------------------------------
	       U+20A0..U+20CF	  | PR [*1]    | N [*2]	    | Currency symbols
	       U+3400..U+4DBF	  | ID	       | W	    | CJK ideographs
	       U+4E00..U+9FFF	  | ID	       | W	    | CJK ideographs
	       U+D800..U+DFFF	  | AL (SG)    | N	    | Surrogates
	       U+E000..U+F8FF	  | AL (XX)    | F or N	(A) | Private use
	       U+F900..U+FAFF	  | ID	       | W	    | CJK ideographs
	       U+20000..U+2FFFD	  | ID	       | W	    | CJK ideographs
	       U+30000..U+3FFFD	  | ID	       | W	    | Old hanzi
	       U+F0000..U+FFFFD	  | AL (XX)    | F or N	(A) | Private use
	       U+100000..U+10FFFD | AL (XX)    | F or N	(A) | Private use
	       Other unassigned	  | AL (XX)    | N	    | Unassigned,
				  |	       |	    | reserved or
				  |	       |	    | noncharacters
	       -------------------------------------------------------------
	       [*1] Except U+20A7 PESETA SIGN (PO),
		 U+20B6	LIVRE TOURNOIS SIGN (PO), U+20BB NORDIC	MARK SIGN (PO)
		 and U+20BE LARI SIGN (PO).
	       [*2] Except U+20A9 WON SIGN (H) and U+20AC EURO SIGN
		 (F or N (A)).

       o   Characters belonging	to General Category Mn,	Me, Cc,	Cf, Zl or Zp
	   are treated as nonspacing by	this module.

REFERENCES
       [CMOS]
	   The Chicago Manual of Style,	15th edition.  University of Chicago
	   Press, 2003.

       [JIS X 4051]
	   JIS X 4051:2004 ae_Yae~_eaaeae,_a(R)_c_micro__cae^1ae^3 (Formatting
	   Rules for Japanese Documents).  Japanese Standards Association,
	   2004.

       [JLREQ]
	   Anan, Yasuhiro et al.  Requirements for Japanese Text Layout, W3C
	   Working Group Note 3	April 2012.
	   <http://www.w3.org/TR/2012/NOTE-jlreq-20120403/>.

       [UAX #11]
	   A. Freytag (ed.) (2008-2009).  Unicode Standard Annex #11: East
	   Asian Width,	Revisions 17-19.  <http://unicode.org/reports/tr11/>.

       [UAX #14]
	   A. Freytag and A. Heninger (eds.) (2008-2015).  Unicode Standard
	   Annex #14: Unicode Line Breaking Algorithm, Revisions 22-35.
	   <http://unicode.org/reports/tr14/>.

       [UAX #29]
	   Mark	Davis (ed.) (2009-2013).  Unicode Standard Annex #29: Unicode
	   Text	Segmentation, Revisions	15-23.
	   <http://www.unicode.org/reports/tr29/>.

SEE ALSO
       Text::LineFold, Text::Wrap, Unicode::GCString.

AUTHOR
       Copyright (C) 2009-2018 Hatuka*nezumi - IKEDA Soji
       <hatuka(at)nezumi.nu>.

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.0			  2018-03-29		 Unicode::LineBreak(3)

NAME | SYNOPSIS | DESCRIPTION | PUBLIC INTERFACE | CUSTOMIZATION | BUGS | VERSION | IMPLEMENTATION NOTES | REFERENCES | SEE ALSO | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Unicode::LineBreak&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help