Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
String(3)	      User Contributed Perl Documentation	     String(3)

NAME
       Unicode::String - String	of Unicode characters (UTF-16BE)

SYNOPSIS
	use Unicode::String qw(utf8 latin1 utf16be);

	$u = utf8("string");
	$u = latin1("string");
	$u = utf16be("\0s\0t\0r\0i\0n\0g");

	print $u->utf32be;   # 4 byte characters
	print $u->utf16le;   # 2 byte characters + surrogates
	print $u->utf8;	     # 1-4 byte	characters

DESCRIPTION
       A "Unicode::String" object represents a sequence	of Unicode characters.
       Methods are provided to convert between various external	formats
       (encodings) and "Unicode::String" objects, and methods are provided for
       common string manipulations.

       The functions utf32be(),	utf32le(), utf16be(), utf16le(), utf8(),
       utf7(), latin1(), uhex(), uchr()	can be imported	from the
       "Unicode::String" module	and will work as constructors initializing
       strings of the corresponding encoding.

       The "Unicode::String" objects overload various operators, which means
       that they in most cases can be treated like plain strings.

       Internally a "Unicode::String" object is	represented by a string	of 2
       byte numbers in network byte order (big-endian).	This representation is
       not visible by the API provided,	but it might be	useful to know in
       order to	predict	the efficiency of the provided methods.

   METHODS
   Class methods
       The following class methods are available:

       Unicode::String->stringify_as
       Unicode::String->stringify_as( $enc )
	   This	method is used to specify which	encoding will be used when
	   "Unicode::String" objects are implicitly converted to and from
	   plain strings.

	   If an argument is provided it sets the current encoding.  The
	   argument should have	one of the following: "ucs4", "utf32",
	   "utf32be", "utf32le", "ucs2", "utf16", "utf16be", "utf16le",
	   "utf8", "utf7", "latin1" or "hex".  The default is "utf8".

	   The stringify_as() method returns a reference to the	current
	   encoding function.

       $us = Unicode::String->new
       $us = Unicode::String->new( $initial_value )
	   This	is the object constructor.  Without argument, it creates an
	   empty "Unicode::String" object.  If an $initial_value argument is
	   given, it is	decoded	according to the specified stringify_as()
	   encoding, UTF-8 by default.

	   In general it is recommended	to import and use one of the encoding
	   specific constructor	functions instead of invoking this method.

   Encoding methods
       These methods get or set	the value of the "Unicode::String" object by
       passing strings in the corresponding encoding.  If a new	value is
       passed as argument it will set the value	of the "Unicode::String", and
       the previous value is returned.	If no argument is passed then the
       current value is	returned.

       To illustrate the encodings we show how the 2 character sample string
       of "A<micro>m" (micro meter) is encoded for each	one.

       $us->utf32be
       $us->utf32be( $newval )
	   The string passed should be in the UTF-32 encoding with bytes in
	   big endian order.  The sample "A<micro>m" is	"\0\0\0\xB5\0\0\0m" in
	   this	encoding.

	   Alternative names for this method are utf32() and ucs4().

       $us->utf32le
       $us->utf32le( $newval )
	   The string passed should be in the UTF-32 encoding with bytes in
	   little endian order.	 The sample "A<micro>m"	is is
	   "\xB5\0\0\0m\0\0\0" in this encoding.

       $us->utf16be
       $us->utf16be( $newval )
	   The string passed should be in the UTF-16 encoding with bytes in
	   big endian order. The sample	"A<micro>m" is "\0\xB5\0m" in this
	   encoding.

	   Alternative names for this method are utf16() and ucs2().

	   If the string passed	to utf16be() starts with the Unicode byte
	   order mark in little	endian order, the result is as if utf16le()
	   was called instead.

       $us->utf16le
       $us->utf16le( $newval )
	   The string passed should be in the UTF-16 encoding with bytes in
	   little endian order.	 The sample "A<micro>m"	is is "\xB5\0m\0" in
	   this	encoding.  This	is the encoding	used by	the Microsoft Windows
	   API.

	   If the string passed	to utf16le() starts with the Unicode byte
	   order mark in big endian order, the result is as if utf16le() was
	   called instead.

       $us->utf8
       $us->utf8( $newval )
	   The string passed should be in the UTF-8 encoding. The sample
	   "A<micro>m" is "\xC2\xB5m" in this encoding.

       $us->utf7
       $us->utf7( $newval )
	   The string passed should be in the UTF-7 encoding. The sample
	   "A<micro>m" is "+ALU-m" in this encoding.

	   The UTF-7 encoding only use plain US-ASCII characters for the
	   encoding.  This makes it safe for transport through 8-bit stripping
	   protocols.  Characters outside the US-ASCII range are
	   base64-encoded and '+' is used as an	escape character.  The UTF-7
	   encoding is described in RFC	1642.

	   If the (global) variable
	   $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS	is TRUE, then a	wider
	   range of characters are encoded as themselves.  It is even TRUE by
	   default.  The characters affected by	this are:

	      !	" # $ %	& * ; <	= > @ [	] ^ _ `	{ | }

       $us->latin1
       $us->latin1( $newval )
	   The string passed should be in the ISO-8859-1 encoding. The sample
	   "A<micro>m" is "\xB5m" in this encoding.

	   Characters outside the "\x00" .. "\xFF" range are simply removed
	   from	the return value of the	latin1() method.  If you want more
	   control over	the mapping from Unicode to ISO-8859-1,	use the
	   "Unicode::Map8" class.  This	is also	the way	to deal	with other
	   8-bit character sets.

       $us->hex
       $us->hex( $newval )
	   The string passed should be plain ASCII where each Unicode
	   character is	represented by the "U+XXXX" string and separated by a
	   single space	character.  The	"U+" prefix is optional	when setting
	   the value.  The sample "A<micro>m" is "U+00b5 U+006d" in this
	   encoding.

   String Operations
       The following methods are available:

       $us->as_string
	   Converts a "Unicode::String"	to a plain string according to the
	   setting of stringify_as().  The default stringify_as() encoding is
	   "utf8".

       $us->as_num
	   Converts a "Unicode::String"	to a number.  Currently	only the
	   digits in the range 0x30 .. 0x39 are	recognized.  The plan is to
	   eventually support all Unicode digit	characters.

       $us->as_bool
	   Converts a "Unicode::String"	to a boolean value.  Only the empty
	   string is FALSE.  A string consisting of only the character U+0030
	   is considered TRUE, even if Perl consider "0" to be FALSE.

       $us->repeat( $count )
	   Returns a new "Unicode::String" where the content of	$us is
	   repeated $count times.  This	operation is also overloaded as:

	     $us x $count

       $us->concat( $other_string )
	   Concatenates	the string $us and the string $other_string.  If
	   $other_string is not	an "Unicode::String" object, then it is	first
	   passed to the Unicode::String->new constructor function.  This
	   operation is	also overloaded	as:

	     $us . $other_string

       $us->append( $other_string )
	   Appends the string $other_string to the value of $us.  If
	   $other_string is not	an "Unicode::String" object, then it is	first
	   passed to the Unicode::String->new constructor function.  This
	   operation is	also overloaded	as:

	     $us .= $other_string

       $us->copy
	   Returns a copy of the current "Unicode::String" object.  This
	   operation is	overloaded as the assignment operator.

       $us->length
	   Returns the length of the "Unicode::String".	 Surrogate pairs are
	   still counted as 2.

       $us->byteswap
	   This	method will swap the bytes in the internal representation of
	   the "Unicode::String" object.

	   Unicode reserve the character U+FEFF	character as a byte order
	   mark.  This works because the swapped character, U+FFFE, is
	   reserved to not be valid.  For strings that have the	byte order
	   mark	as the first character,	we can guaranty	to get the byte	order
	   right with the following code:

	      $ustr->byteswap if $ustr->ord == 0xFFFE;

       $us->unpack
	   Returns a list of integers each representing	an UCS-2 character
	   code.

       $us->pack( @uchr	)
	   Sets	the value of $us as a sequence of UCS-2	characters with	the
	   characters codes given as parameter.

       $us->ord
	   Returns the character code of the first character in	$us.  The
	   ord() method	deals with surrogate pairs, which gives	us a result-
	   range of 0x0	.. 0x10FFFF.  If the $us string	is empty, undef	is
	   returned.

       $us->chr( $code )
	   Sets	the value of $us to be a string	containing the character
	   assigned code $code.	 The argument $code must be an integer in the
	   range 0x0 ..	0x10FFFF.  If the code is greater than 0xFFFF then a
	   surrogate pair created.

       $us->name
	   In scalar context returns the official Unicode name of the first
	   character in	$us.  In array context returns the name	of all
	   characters in $us.  Also see	Unicode::CharName.

       $us->substr( $offset )
       $us->substr( $offset, $length )
       $us->substr( $offset, $length, $subst )
	   Returns a sub-string	of $us.	 Works similar to the builtin substr()
	   function.

       $us->index( $other )
       $us->index( $other, $pos	)
	   Locates the position	of $other within $us, possibly starting	the
	   search at position $pos.

       $us->chop
	   Chops off the last character	of $us and returns it (as a
	   "Unicode::String" object).

FUNCTIONS
       The following functions are provided.  None of these are	exported by
       default.

       byteswap2( $str,	... )
	   This	function will swap 2 and 2 bytes in the	strings	passed as
	   arguments.  If this function	is called in void context, then	it
	   will	modify its arguments in-place.	Otherwise, the swapped strings
	   are returned.

       byteswap4( $str,	... )
	   The byteswap4 function works	similar	to byteswap2, but will reverse
	   the order of	4 and 4	bytes.

       latin1( $str )
       utf7( $str )
       utf8( $str )
       utf16le(	$str )
       utf16be(	$str )
       utf32le(	$str )
       utf32be(	$str )
	   Constructor functions for the various Unicode encodings.  These
	   return new "Unicode::String"	objects.  The provided argument	should
	   be encoded correspondingly.

       uhex( $str )
	   Constructs a	new "Unicode::String" object from a string of hex
	   values.  See	hex() method above for description of the format.

       uchar( $num )
	   Constructs a	new one	character "Unicode::String" object from	a
	   Unicode character code.  This works similar to perl's builtin chr()
	   function.

SEE ALSO
       Unicode::CharName, Unicode::Map8

       <http://www.unicode.org/>

       perlunicode

COPYRIGHT
       Copyright 1997-2000,2005	Gisle Aas.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

POD ERRORS
       Hey! The	above document had some	coding errors, which are explained
       below:

       Around line 601:
	   Non-ASCII character seen before =encoding in	'"A<micro>m"'.
	   Assuming CP1252

perl v5.32.1			  2021-11-04			     String(3)

NAME | SYNOPSIS | DESCRIPTION | FUNCTIONS | SEE ALSO | COPYRIGHT | POD ERRORS

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Unicode::String&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help