Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Multibyte(3)	      User Contributed Perl Documentation	  Multibyte(3)

NAME
       String::Multibyte - manipulation	of multibyte character strings

SYNOPSIS
	   use String::Multibyte;

	   $utf8 = String::Multibyte->new('UTF8');
	   $utf8_len = $utf8->length($utf8_str);

DESCRIPTION
       This module provides some functions which emulate the corresponding
       "CORE" functions	for locale-independent manipulation of multiple-byte
       character strings.

       Why this	module is locale-independent?  Well, because this module only
       consider	the byte sequence structure of charsets	and is not aware of
       any Locale stuff!  Locale-dependent methods like	"uc()",	"lc()",	etc.,
       will not	be supported at	all.

   Definition of Multibyte Charsets
       The definition files are	sited under the	directory where
       String::Multibyte is sited.  E.g. if String::Multibyte is
       "perl/site/lib/String/Multibyte.pm", copy String::Multibyte::Foo	as
       "perl/site/lib/String/Multibyte/Foo.pm".

       The definition file must	return a hashref, having key(s)	named as
       following.

       "charset"
	   The value for the key 'charset' stands for a	string of the charset
	   name.  In almost case, omission of the 'charset' matters very
	   little, but keep them not conflict among another charset.

       "regexp"
	   The value for the key 'regexp', REQUIRED, is	a regular expression
	   that	matchs a single	character of charset in	question.  (You	may
	   use "qr//" if available.)

	   If the 'regexp' is omitted, calling any method is croaked.

       "nextchar"
	   The value for the key 'nextchar' must be a coderef that returns the
	   next	character to the specified character.  If the 'nextchar'
	   coderef is omitted, "mkrange()" and "strtr()" methods don't
	   understand hyphen metacharacter for character ranges.

       "cmpchar"
	   The value for the key 'cmpchar' must	be a coderef that compares the
	   specified two characters.  If the 'cmpchar' coderef is omitted,
	   "mkrange" and "strtr" functions don't understand reverse character
	   ranges.

       "hyphen"
	   The value for the key 'hyphen' is a character to stand for a
	   character range. The	default	is '-'.

       "escape"
	   The value for the key 'escape' is an	escape character for a
	   "hyphen" character. The default is '\\'.  The 'escape' character is
	   valid only before a "hyphen"	or another 'escape' (e.g. '\\\\-]'
	   means '\\' to ']'; '\\\\\-]'	means '\\', '-', and ']').  If an
	   'escape' character is followed by any character other than 'escape'
	   or 'hyphen',	it is parsed literally.

   Constructor
       "$mbcs =	String::Multibyte->new(CHARSET)"
       "$mbcs =	String::Multibyte->new(CHARSET,	VERBOSE)"
	   "CHARSET" is	the charset name; exactly speaking, the	file name of
	   the definition file (without	the suffix .pm).  It returns the
	   instance to tell methods in which charset the specified strings
	   should be handled.

	   "CHARSET" may be a hashref; this is how to define a charset without
	   any .pm file.

	       # see perlfaq6  :-)
	       my $martian  = String::Multibyte->new({
		   charset => "martian",
		   regexp => '[A-Z][A-Z]|[^A-Z]',
	       });

	   If true value is specified as "VERBOSE", the	called method
	   (excepting "islegal") will check its	arguments and carps if any of
	   them	is not legally encoded.

	   Otherwise such a check won't	be carried out (saves a	bit of time,
	   but unsafe, though you can use the "islegal"	method if necessary).

   Check Whether the String is Legal
       "$mbcs->islegal(LIST)"
	   Returns a boolean indicating	whether	all the	strings	in arguments
	   are legally encoded in the concerned	charset.  Returns false	even
	   if one element is illegal in	"LIST".

   Length
       "$mbcs->length(STRING)"
	   Returns the length in characters of the specified string.

   Reverse
       "$mbcs->strrev(STRING)"
	   Returns a reversed string in	characters.

   Search
       "$mbcs->index(STRING, SUBSTR)"
       "$mbcs->index(STRING, SUBSTR, POSITION)"
	   Returns the position	of the first occurrence	of "SUBSTR" in
	   "STRING" at or after	"POSITION".  If	"POSITION" is omitted, starts
	   searching from the beginning	of the string.

	   If the substring is not found, returns "-1".

       "$mbcs->rindex(STRING, SUBSTR)"
       "$mbcs->rindex(STRING, SUBSTR, POSITION)"
	   Returns the position	of the last occurrence of "SUBSTR" in "STRING"
	   at or after "POSITION".  If "POSITION" is specified,	returns	the
	   last	occurrence at or before	that position.

	   If the substring is not found, returns "-1".

       "$mbcs->strspn(STRING, SEARCHLIST)"
	   Returns returns the position	of the first occurrence	of any
	   character not contained in the search list.

	     $mbcs->strspn("+0.12345*12", "+-.0123456789");
	     # returns 8.

	   If the specified string does	not contain any	character in the
	   search list,	returns	0.

	   The string consists of characters in	the search list, the returned
	   value equals	the length of the string.

	   "SEARCHLIST"	can be an "ARRAYREF".  e.g. if a charset treats	"CRLF"
	   as a	single character, "\r\n" is a one-element list of only "\r\n".
	   A two-element list of "\r" and "\n" can be given as "["\r", "\n"]"
	   (of course "\n\r" is	also ok	since the character order of
	   "SEARCHLIST"	doesn't	matter in "strspn").

       "$mbcs->strcspn(STRING, SEARCHLIST)"
	   Returns returns the position	of the first occurrence	of any
	   character contained in the search list.

	   If the specified string does	not contain any	character in the
	   search list,	the returned value equals the length of	the string.

	   "SEARCHLIST"	can be an "ARRAYREF".  e.g. if a charset treats	"CRLF"
	   as a	single character, "\r\n" is a one-element list of only "\r\n".
	   A two-element list of "\r" and "\n" can be given as "["\r", "\n"]"
	   (of course "\n\r" is	also ok	since the character order of
	   "SEARCHLIST"	doesn't	matter in "strcspn").

   Substring
       "$mbcs->substr(STRING or	SCALAR REF, OFFSET)"
       "$mbcs->substr(STRING or	SCALAR REF, OFFSET, LENGTH)"
       "$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)"
	   It works like "CORE::substr", but using character semantics of
	   multibyte charset encoding.

	   If the "REPLACEMENT"	as the fourth argument is specified, replaces
	   parts of the	"SCALAR" and returns what was there before.

	   You can utilize the lvalue reference, returned if a reference of
	   scalar variable is used as the first	argument.

	       ${ $mbcs->substr(\$str,$off,$len) } = $replace;

		   works like

	       CORE::substr($str,$off,$len) = $replace;

	   The returned	lvalue is not multibyte-aware, then successive
	   assignment may lead to odd results.

   Split
       "$mbcs->strsplit(SEPARATOR, STRING)"
       "$mbcs->strsplit(SEPARATOR, STRING, LIMIT)"
	   This	function emulates "CORE::split", but splits on the "SEPARATOR"
	   string, not by a pattern.

	   If not in list context, only	return the number of fields found, but
	   does	not split into the @_ array.

	   If empty string is specified	as "SEPARATOR",	splits the specified
	   string into characters.

	     $bytes->strsplit('', 'This	is perl.', 7);
	     # ('T', 'h', 'i', 's', ' ', 'i',  's perl.')

   Character Range
       "$mbcs->mkrange(CHARLIST, ALLOW_REVERSE)"
	   Returns the character list (not in list context, as a concatenated
	   string) gained by parsing the specified character range.

	   The result depends on the the character order for the concerned
	   charset.  About the character order for each	charset, see its
	   definition file.

	   If the character order is undefined in the definition file, returns
	   an identical	string with the	specified string.

	   A character range is	specified with a hyphen	('-', but exactly
	   speaking, "$obj->{hyphen}").

	   The backslashed combinations	'\-' and '\\' (exactly speaking,
	   "$obj->{escape}$obj->{hyphen}" and "$obj->{escape}$obj->{escape}")
	   are used instead of the characters '-' and '\', respectively.  The
	   hyphen at the beginning or the end of the range is also evaluated
	   as the hyphen itself.

	   For example,	"$mbcs->mkrange('+\-0-9A-F')" returns "('+', '-', '0',
	   '1',	'2', '3', '4', '5', '6', '7', '8', '9',	'A', 'B', 'C', 'D',
	   'E',	'F')" and "scalar $mbcs->mkrange('A-P')" returns
	   'ABCDEFGHIJKLMNOP'.

	   If true value is specified as the second argument, reverse
	   character ranges such as '9-0', 'Z-A' are allowed.

	     $bytes = String::Multibyte->new('Bytes');
	     $bytes->mkrange('p-e-r-l',	1); # ponmlkjihgfefghijklmnopqrqponml

   Transliteration
       "$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)"
       "$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST,
       MODIFIER)"
	   Transliterates all occurrences of the characters found in the
	   search list with the	corresponding character	in the replacement
	   list.

	   If a	reference of scalar variable is	specified as the first
	   argument, returns the number	of characters replaced or deleted;
	   otherwise, returns the transliterated string	and the	specified
	   string is unaffected.

	   If 'h' modifier is specified, returns a hash	of histogram in	list
	   context; a reference	to hash	of histogram in	scalar context;

	   SEARCHLIST and REPLACEMENTLIST

	   Character ranges (internally	utilizing "mkrange()") are supported.

	   If the "REPLACEMENTLIST" is empty (specified	as '', not "undef",
	   because the use of uninitialized value causes warning under -w
	   option), the	"SEARCHLIST" is	replicated.

	   If the replacement list is shorter than the search list, the	final
	   character in	the replacement	list is	replicated till	it is long
	   enough (but differently works when the 'd' modifier is used).

	   "SEARCHLIST"	and "REPLACEMENTLIST" can be an	"ARRAYREF".  e.g. if a
	   charset treats "\r\n" ("CRLF") as a single character, "\r\n"	is a
	   one-element list of only "\r\n".  A two-element list	of "\r"	and
	   "\n"	should be given	as "["\r", "\n"]". Of course "\n\r" is also ok
	   but the character order is different; cf. "strtr($str, ["\r",
	   "\n"], ["\n", "\r"])" that swaps "\n" and "\r".

	   Each	elements of "ARRAYREF" can include character ranges (the
	   modifiers "R" and "r" affect	their evaluation as usual).

	   "["A-C", "h-z"]" is evaluated like "A-Ch-z" if "charset" does not
	   include grapheme "Ch".  The former prevents "C" and "h" from
	   evaluation as "Ch" even if the "charset" included grapheme "Ch".

	   MODIFIER

	       c   Complement the SEARCHLIST.
	       d   Delete found	but unreplaced characters.
	       s   Squash duplicate replaced characters.
	       h   Return a hash (or a hashref)	of histogram.
	       R   No use of character ranges.
	       r   Allows to use reverse character ranges.
	       o   Caches the conversion table internally.

	   If 'R' modifier is specified, '-' is	not evaluated as a meta
	   character but hyphen	itself like in "tr'''".	Compare:

	     $mbcs->strtr("90 -	32 = 58", "0-9", "A-J");
	       # output: "JA - DC = FI"

	     $mbcs->strtr("90 -	32 = 58", "0-9", "A-J",	"R");
	       # output: "JA - 32 = 58"
	       # cf. ($str = "90 - 32 =	58") =~	tr'0-9'A-J';
	       # '0' to	'A', '-' to '-', and '9' to 'J'.

	   If 'r' modifier is specified, reverse character ranges are allowed.
	   e.g.

	      $mbcs->strtr($str, "0-9",	"9-0", "r")

		is equivalent to

	      $mbcs->strtr($str, "0123456789", "9876543210")

	   Caching the conversion table

	   If 'o' modifier is specified, the conversion	table is cached
	   internally.	e.g.

	     foreach (@source_strings) {
	       print $mbcs->strtr($_, $from_list, $to_list, 'o');
	     }

	   will	be almost as efficient as this:

	     $trans = $mbcs->trclosure($from_list, $to_list);

	     foreach (@source_strings) {
	       print &$trans($_);
	     }

	   You can use whichever you like.

	   Without 'o',

	     foreach (@source_strings) {
	       print $mbcs->strtr($_, $from_list, $to_list);
	     }

	   will	be very	slow since the conversion table	is made	whenever the
	   function is called.

   Generation of the Closure to	Transliterate
       "$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)"
       "$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)"
	   Returns a closure to	transliterate the specified string.  The
	   return value	is an only code	reference, not blessed object.	By use
	   of this code	ref, you can save yourself time	as you need not
	   specify arguments every time.

	     my	$trans = $mbcs->trclosure($from_list, $to_list);
	     print &$trans ($string); #	ok to perl 5.003
	     print $trans->($string); #	perl 5.004 or better

	   The functionality of	the closure made by "trclosure()" is
	   equivalent to that of "strtr()". Frankly speaking, the "strtr()"
	   calls "trclosure()" internally and uses the returned	closure.

	   "SEARCHLIST"	and "REPLACEMENTLIST" can be an	"ARRAYREF" same	as
	   "strtr()".

CAVEAT
       $[  This	modules	supposes $[ is always equal to 0, never	1.

       Grapheme	manipulation
	   Since v. 1.01, manipulation of sequence of graphemes	is to be
	   supported.

	   In a	grapheme-aware manipulation, notice that the beginning and the
	   end of a string always lie on a grapheme boundary.

	   E.g.	imagine	a grapheme set where a grapheme	comprises either a
	   leading latin capital letter	followed by one	or more	latin small
	   letters, or a single	byte.  Such a set can be define	as below.

	      $gra = String::Multibyte->new({
		    regexp => '[A-Z][a-z]*|[\x00-\xFF]',
		 });

	   Think about "$gra->index("Perl", "Pe")".  As	both "Perl" and	"Pe"
	   are a single	grapheme, they are not equal to	each other.  So	the
	   result of this must be "-1" (meaning	no match).

AUTHOR
       SADAHIRO	Tomoyuki <SADAHIRO@cpan.org>

       Copyright(C) 2001-2015, SADAHIRO	Tomoyuki. Japan. All rights reserved.

       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       perl(1).

perl v5.32.1			  2015-12-06			  Multibyte(3)

NAME | SYNOPSIS | DESCRIPTION | CAVEAT | AUTHOR | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=String::Multibyte&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help