Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
UTF(7)		       Miscellaneous Information Manual			UTF(7)

       UTF, Unicode, ASCII, rune - character set and format

       The  Plan  9  character set and representation are based	on the Unicode
       Standard	and on the ISO multibyte UTF-8 encoding	 (Universal  Character
       Set  Transformation  Format, 8 bits wide).  The Unicode Standard	repre-
       sents its characters in 16 bits;	UTF-8 represents  such	values	in  an
       8-bit byte stream.  Throughout this manual, UTF-8 is shortened to UTF.

       In  Plan	 9, a rune is a	16-bit quantity	representing a Unicode charac-
       ter.  Internally, programs may store characters as runes.  However, any
       external	 manifestation	of textual information,	in files or at the in-
       terface between programs, uses a	machine-independent,  byte-stream  en-
       coding called UTF.

       UTF  is	designed so the	7-bit ASCII set	(values	hexadecimal 00 to 7F),
       appear only as themselves in the	encoding.  Runes with values above  7F
       appear  as  sequences  of two or	more bytes with	values only from 80 to

       The UTF encoding	of the Unicode Standard	is  backward  compatible  with
       ASCII:  programs	 presented  only with ASCII work on Plan 9 even	if not
       written to deal with UTF, as do programs	that deal  with	 uninterpreted
       byte  streams.	However,  programs that	perform	semantic processing on
       ASCII graphic characters	must convert from UTF to  runes	 in  order  to
       work properly with non-ASCII input.  See	rune(3).

       Letting numbers be binary, a rune x is converted	to a multibyte UTF se-
       quence as follows:

       01.   x in [00000000.0bbbbbbb] a	0bbbbbbb
       10.   x in [00000bbb.bbbbbbbb] a	110bbbbb, 10bbbbbb
       11.   x in [bbbbbbbb.bbbbbbbb] a	1110bbbb, 10bbbbbb, 10bbbbbb

       Conversion 01 provides a	one-byte sequence that spans the ASCII charac-
       ter  set	 in a compatible way.  Conversions 10 and 11 represent higher-
       valued characters as sequences of two or	three bytes with the high  bit
       set.   Plan  9 does not support the 4, 5, and 6 byte sequences proposed
       by X-Open.  When	there are multiple ways	to encode a value, for example
       rune 0, the shortest encoding is	used.

       In  the	inverse	 mapping, any sequence except those described above is
       incorrect and is	converted to rune hexadecimal 0080.

       ascii(1), tcs(1), rune(3), The Unicode Standard.



Want to link to this manual page? Use this URL:

home | help