Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
UTF(7)									UTF(7)

       UTF, Unicode, ASCII, rune - character set and format

       The  Plan  9  character set and representation are based	on the Unicode
       Standard	and on the ISO multibyte UTF-8 encoding	 (Universal  Character
       Set  Transformation  Format, 8 bits wide).  The Unicode Standard	repre-
       sents its characters in 16 bits;	UTF-8 represents  such	values	in  an
       8-bit  byte stream.  Throughout this manual, UTF-8 is shortened to UTF.

       In Plan 9, a rune is a 16-bit quantity representing a  Unicode  charac-
       ter.  Internally, programs may store characters as runes.  However, any
       external	manifestation of textual  information,	in  files  or  at  the
       interface  between  programs,  uses  a machine-independent, byte-stream
       encoding	called UTF.

       UTF is designed so the 7-bit ASCII set (values hexadecimal 00  to  7F),
       appear  only as themselves in the encoding.  Runes with values above 7F
       appear as sequences of two or more bytes	with values only  from	80  to

       The  UTF	 encoding  of the Unicode Standard is backward compatible with
       ASCII: programs presented only with ASCII work on Plan 9	 even  if  not
       written	to  deal with UTF, as do programs that deal with uninterpreted
       byte streams.  However, programs	that perform  semantic	processing  on
       ASCII  graphic  characters  must	 convert from UTF to runes in order to
       work properly with non-ASCII input.  See	rune(3).

       Letting numbers be binary, a rune x is converted	 to  a	multibyte  UTF
       sequence	as follows:

       01.   x in [00000000.0bbbbbbb] a	0bbbbbbb
       10.   x in [00000bbb.bbbbbbbb] a	110bbbbb, 10bbbbbb
       11.   x in [bbbbbbbb.bbbbbbbb] a	1110bbbb, 10bbbbbb, 10bbbbbb

       Conversion 01 provides a	one-byte sequence that spans the ASCII charac-
       ter set in a compatible way.  Conversions 10 and	11  represent  higher-
       valued  characters as sequences of two or three bytes with the high bit
       set.  Plan 9 does not support the 4, 5, and 6 byte  sequences  proposed
       by X-Open.  When	there are multiple ways	to encode a value, for example
       rune 0, the shortest encoding is	used.

       In the inverse mapping, any sequence except those  described  above  is
       incorrect and is	converted to rune hexadecimal 0080.

       ascii(1), tcs(1), rune(3), The Unicode Standard.



Want to link to this manual page? Use this URL:

home | help