Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
unicode(3)		   Erlang Module Definition		    unicode(3)

NAME
       unicode - Functions for converting Unicode characters.

DESCRIPTION
       This module contains functions for converting between different charac-
       ter representations. It converts	between	 ISO  Latin-1  characters  and
       Unicode	characters,  but it can	also convert between different Unicode
       encodings (like UTF-8, UTF-16, and UTF-32).

       The default Unicode encoding in Erlang is in binaries UTF-8,  which  is
       also the	format in which	built-in functions and libraries in OTP	expect
       to find binary Unicode data. In lists, Unicode data is encoded as inte-
       gers, each integer representing one character and encoded simply	as the
       Unicode code point for the character.

       Other Unicode encodings than integers representing code points or UTF-8
       in  binaries  are  referred to as "external encodings". The ISO Latin-1
       encoding	is in binaries and lists referred to as	latin1-encoding.

       It is recommended to only use external encodings	for communication with
       external	 entities  where this is required. When	working	inside the Er-
       lang/OTP	environment, it	is recommended to keep binaries	in UTF-8  when
       representing Unicode characters.	ISO Latin-1 encoding is	supported both
       for backward compatibility and for communication	with external entities
       not supporting Unicode character	sets.

       Programs	should always operate on a normalized form and compare canoni-
       cal-equivalent Unicode characters as equal. All characters should  thus
       be  normalized  to one form once	on the system borders. One of the fol-
       lowing functions	can convert characters to their	normalized forms char-
       acters_to_nfc_list/1,	    characters_to_nfc_binary/1,	       charac-
       ters_to_nfd_list/1  or  characters_to_nfd_binary/1.  For	 general  text
       characters_to_nfc_list/1	 or  characters_to_nfc_binary/1	 is preferred,
       and for identifiers one of the compatibility  normalization  functions,
       such  as	 characters_to_nfkc_list/1, is preferred for security reasons.
       The normalization functions where introduced in OTP 20. Additional  in-
       formation on normalization can be found in the Unicode FAQ.

DATA TYPES
       encoding() =
	   latin1 |
	   unicode |
	   utf8	|
	   utf16 |
	   {utf16, endian()} |
	   utf32 |
	   {utf32, endian()}

       endian()	= big |	little

       unicode_binary()	= binary()

	      A	binary() with characters encoded in the	UTF-8 coding standard.

       chardata() = charlist() | unicode_binary()

       charlist() =
	   maybe_improper_list(char() |	unicode_binary() | charlist(),
			       unicode_binary()	| [])

       external_unicode_binary() = binary()

	      A	binary() with characters coded in a user-specified Unicode en-
	      coding other than	UTF-8 (that is,	UTF-16 or UTF-32).

       external_chardata() =
	   external_charlist() | external_unicode_binary()

       external_charlist() =
	   maybe_improper_list(char() |
			       external_unicode_binary() |
			       external_charlist(),
			       external_unicode_binary() | [])

       latin1_binary() = binary()

	      A	binary() with characters coded in ISO Latin-1.

       latin1_char() = byte()

	      An integer() representing	a valid	ISO Latin-1 character (0-255).

       latin1_chardata() = latin1_charlist() | latin1_binary()

	      Same as iodata().

       latin1_charlist() =
	   maybe_improper_list(latin1_char() |
			       latin1_binary() |
			       latin1_charlist(),
			       latin1_binary() | [])

	      Same as iolist().

EXPORTS
       bom_to_encoding(Bin) -> {Encoding, Length}

	      Types:

		 Bin = binary()
		    A binary() such that byte_size(Bin)	_= 4.
		 Encoding =
		     latin1 | utf8 | {utf16, endian()} | {utf32, endian()}
		 Length	= integer() >= 0
		 endian() = big	| little

	      Checks for a UTF Byte Order Mark (BOM) in	the beginning of a bi-
	      nary. If the supplied binary Bin begins with a valid BOM for ei-
	      ther UTF-8, UTF-16, or UTF-32, the function returns the encoding
	      identified along with the	BOM length in bytes.

	      If no BOM	is found, the function returns {latin1,0}.

       characters_to_binary(Data) -> Result

	      Types:

		 Data =	latin1_chardata() | chardata() | external_chardata()
		 Result	=
		     binary() |
		     {error, binary(), RestData} |
		     {incomplete, binary(), binary()}
		 RestData  =  latin1_chardata()	 | chardata() |	external_char-
		 data()

	      Same as characters_to_binary(Data, unicode, unicode).

       characters_to_binary(Data, InEncoding) -> Result

	      Types:

		 Data =	latin1_chardata() | chardata() | external_chardata()
		 InEncoding = encoding()
		 Result	=
		     binary() |
		     {error, binary(), RestData} |
		     {incomplete, binary(), binary()}
		 RestData = latin1_chardata() |	 chardata()  |	external_char-
		 data()

	      Same as characters_to_binary(Data, InEncoding, unicode).

       characters_to_binary(Data, InEncoding, OutEncoding) -> Result

	      Types:

		 Data =	latin1_chardata() | chardata() | external_chardata()
		 InEncoding = OutEncoding = encoding()
		 Result	=
		     binary() |
		     {error, binary(), RestData} |
		     {incomplete, binary(), binary()}
		 RestData  =  latin1_chardata()	 | chardata() |	external_char-
		 data()

	      Behaves as characters_to_list/2, but produces a  binary  instead
	      of a Unicode list.

	      InEncoding  defines  how	input is to be interpreted if binaries
	      are present in Data

	      OutEncoding defines in what format output	is to be generated.

	      Options:

		unicode:
		  An alias for utf8, as	this is	 the  preferred	 encoding  for
		  Unicode characters in	binaries.

		utf16:
		  An alias for {utf16,big}.

		utf32:
		  An alias for {utf32,big}.

	      The atoms	big and	little denote big- or little-endian encoding.

	      Errors  and exceptions occur as in characters_to_list/2, but the
	      second element in	tuple error or incomplete is  a	 binary()  and
	      not a list().

       characters_to_list(Data)	-> Result

	      Types:

		 Data =	latin1_chardata() | chardata() | external_chardata()
		 Result	=
		     list() |
		     {error, list(), RestData} |
		     {incomplete, list(), binary()}
		 RestData  =  latin1_chardata()	 | chardata() |	external_char-
		 data()

	      Same as characters_to_list(Data, unicode).

       characters_to_list(Data,	InEncoding) -> Result

	      Types:

		 Data =	latin1_chardata() | chardata() | external_chardata()
		 InEncoding = encoding()
		 Result	=
		     list() |
		     {error, list(), RestData} |
		     {incomplete, list(), binary()}
		 RestData = latin1_chardata() |	 chardata()  |	external_char-
		 data()

	      Converts	a  possibly  deep list of integers and binaries	into a
	      list of integers representing Unicode characters.	 The  binaries
	      in  the  input can have characters encoded as one	of the follow-
	      ing:

		* ISO Latin-1 (0-255, one character per	byte). Here, case  pa-
		  rameter InEncoding is	to be specified	as latin1.

		* One  of  the	UTF-encodings, which is	specified as parameter
		  InEncoding.

	      Note that	integers in the	list always represent code points  re-
	      gardless	of  InEncoding passed. If InEncoding latin1 is passed,
	      only code	points < 256 are allowed; otherwise, all valid unicode
	      code points are allowed.

	      If  InEncoding  is latin1, parameter Data	corresponds to the io-
	      data() type, but for unicode, parameter Data can	contain	 inte-
	      gers  >  255  (Unicode characters	beyond the ISO Latin-1 range),
	      which makes it invalid as	iodata().

	      The purpose of the function is mainly to convert combinations of
	      Unicode  characters into a pure Unicode string in	list represen-
	      tation for further processing. For writing the data to an	exter-
	      nal entity, the reverse function characters_to_binary/3 comes in
	      handy.

	      Option unicode is	an alias for utf8, as this  is	the  preferred
	      encoding	for  Unicode characters	in binaries. utf16 is an alias
	      for {utf16,big} and utf32	is an alias for	{utf32,big}. The atoms
	      big and little denote big- or little-endian encoding.

	      If  the data cannot be converted,	either because of illegal Uni-
	      code/ISO Latin-1 characters in the list, or because  of  invalid
	      UTF  encoding  in	 any binaries, an error	tuple is returned. The
	      error tuple contains the tag  error,  a  list  representing  the
	      characters that could be converted before	the error occurred and
	      a	representation of the characters including and after  the  of-
	      fending integer/bytes. The last part is mostly for debugging, as
	      it still constitutes a possibly deep or mixed list, or both, not
	      necessarily  of  the  same depth as the original data. The error
	      occurs when traversing the list and whatever is left  to	decode
	      is returned "as is".

	      However,	if  the	input Data is a	pure binary, the third part of
	      the error	tuple is guaranteed to be a binary as well.

	      Errors occur for the following reasons:

		* Integers out of range.

		  If InEncoding	is latin1, an error occurs whenever an integer
		  > 255	is found in the	lists.

		  If InEncoding	is of a	Unicode	type, an error occurs whenever
		  either of the	following is found:

		  * An integer > 16#10FFFF (the	maximum	Unicode	character)

		  * An integer in the range 16#D800 to 16#DFFF (invalid	 range
		    reserved for UTF-16	surrogate pairs)

		* Incorrect UTF	encoding.

		  If  InEncoding is one	of the UTF types, the bytes in any bi-
		  naries must be valid in that encoding.

		  Errors can occur for various reasons,	including the  follow-
		  ing:

		  * "Pure"  decoding  errors (like the upper bits of the bytes
		    being wrong).

		  * The	bytes are decoded to a too large number.

		  * The	bytes are decoded to a code point in the invalid  Uni-
		    code range.

		  * Encoding  is "overlong", meaning that a number should have
		    been encoded in fewer bytes.

		  The case of a	truncated UTF is handled  specially,  see  the
		  paragraph about incomplete binaries below.

		  If  InEncoding  is latin1, binaries are always valid as long
		  as they contain whole	bytes, as each	byte  falls  into  the
		  valid	ISO Latin-1 range.

	      A	 special  type	of error is when no actual invalid integers or
	      bytes are	found, but a trailing binary()	consists  of  too  few
	      bytes  to	 decode	 the  last  character. This error can occur if
	      bytes are	read from a file in chunks or  if  binaries  in	 other
	      ways  are	 split	on non-UTF character boundaries. An incomplete
	      tuple is then returned instead of	the error tuple.  It  consists
	      of  the same parts as the	error tuple, but the tag is incomplete
	      instead of error and the last element is always guaranteed to be
	      a	 binary	 consisting  of	the first part of a (so	far) valid UTF
	      character.

	      If one UTF character is split over two consecutive  binaries  in
	      the  Data,  the conversion succeeds. This	means that a character
	      can be decoded from a range of binaries as  long	as  the	 whole
	      range is specified as input without errors occurring.

	      Example:

	      decode_data(Data)	->
		 case unicode:characters_to_list(Data,unicode) of
		    {incomplete,Encoded, Rest} ->
			  More = get_some_more_data(),
			  Encoded ++ decode_data([Rest,	More]);
		    {error,Encoded,Rest} ->
			  handle_error(Encoded,Rest);
		    List ->
			  List
		 end.

	      However,	bit  strings that are not whole	bytes are not allowed,
	      so a UTF character must be split along 8-bit boundaries to  ever
	      be decoded.

	      A	badarg exception is thrown for the following cases:

		* Any parameters are of	the wrong type.

		* The list structure is	invalid	(a number as tail).

		* The binaries do not contain whole bytes (bit strings).

       characters_to_nfc_list(CD :: chardata())	->
				 [char()] | {error, [char()], chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of canonical equivalent Composed characters  ac-
	      cording to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is a list of characters.

	      3> unicode:characters_to_nfc_list([<<"abc..a">>,[778],$a,[776],$o,[776]]).
	      "abc..AYAxA<paragraph>"

       characters_to_nfc_binary(CD :: chardata()) ->
				   unicode_binary() |
				   {error, unicode_binary(), chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of canonical equivalent Composed characters  ac-
	      cording to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is an utf8 encoded binary.

	      4> unicode:characters_to_nfc_binary([<<"abc..a">>,[778],$a,[776],$o,[776]]).
	      <<"abc..AYAxA<paragraph>"/utf8>>

       characters_to_nfd_list(CD :: chardata())	->
				 [char()] | {error, [char()], chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of canonical  equivalent	Decomposed  characters
	      according	to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is a list of characters.

	      1> unicode:characters_to_nfd_list("abc..AYAxA<paragraph>").
	      [97,98,99,46,46,97,778,97,776,111,776]

       characters_to_nfd_binary(CD :: chardata()) ->
				   unicode_binary() |
				   {error, unicode_binary(), chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of canonical  equivalent	Decomposed  characters
	      according	to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is an utf8 encoded binary.

	      2> unicode:characters_to_nfd_binary("abc..AYAxA<paragraph>").
	      <<97,98,99,46,46,97,204,138,97,204,136,111,204,136>>

       characters_to_nfkc_list(CD :: chardata()) ->
				  [char()] |
				  {error, [char()], chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of compatibly equivalent Composed	characters ac-
	      cording to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is a list of characters.

	      3> unicode:characters_to_nfkc_list([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
	      "abc..AYAxA<paragraph>32"

       characters_to_nfkc_binary(CD :: chardata()) ->
				    unicode_binary() |
				    {error, unicode_binary(), chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of compatibly equivalent Composed	characters ac-
	      cording to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is an utf8 encoded binary.

	      4> unicode:characters_to_nfkc_binary([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
	      <<"abc..AYAxA<paragraph>32"/utf8>>

       characters_to_nfkd_list(CD :: chardata()) ->
				  [char()] |
				  {error, [char()], chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of compatibly equivalent	Decomposed  characters
	      according	to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is a list of characters.

	      1> unicode:characters_to_nfkd_list(["abc..AYAxA<paragraph>",[65299,65298]]).
	      [97,98,99,46,46,97,778,97,776,111,776,51,50]

       characters_to_nfkd_binary(CD :: chardata()) ->
				    unicode_binary() |
				    {error, unicode_binary(), chardata()}

	      Converts	a possibly deep	list of	characters and binaries	into a
	      Normalized Form of compatibly equivalent	Decomposed  characters
	      according	to the Unicode standard.

	      Any binaries in the input	must be	encoded	with utf8 encoding.

	      The result is an utf8 encoded binary.

	      2> unicode:characters_to_nfkd_binary(["abc..AYAxA<paragraph>",[65299,65298]]).
	      <<97,98,99,46,46,97,204,138,97,204,136,111,204,136,51,50>>

       encoding_to_bom(InEncoding) -> Bin

	      Types:

		 Bin = binary()
		    A binary() such that byte_size(Bin)	_= 4.
		 InEncoding = encoding()

	      Creates  a  UTF  Byte Order Mark (BOM) as	a binary from the sup-
	      plied InEncoding.	The BOM	is, if supported at all,  expected  to
	      be placed	first in UTF encoded files or messages.

	      The  function  returns  ____ for latin1 encoding,	as there is no
	      BOM for ISO Latin-1.

	      Notice that the BOM for UTF-8 is seldom used, and	it  is	really
	      not  a byte order	mark. There are	obviously no byte order	issues
	      with UTF-8, so the BOM is	only there to differentiate UTF-8  en-
	      coding from other	UTF formats.

Ericsson AB			  stdlib 3.8			    unicode(3)

NAME | DESCRIPTION | DATA TYPES | EXPORTS

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=unicode&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help