Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
unidesc(1)		    General Commands Manual		    unidesc(1)

       unidesc - Describe the contents of a Unicode text file

       unidesc ([option	flags])	(<file name>)

       If  no input file name is supplied, unidesc reads from the standard in-

       unidesc describes the content of	a Unicode text file by	reporting  the
       character  ranges  to which different portions of the text belong.  The
       ranges reported include both official Unicode ranges and	the  construc-
       ted  language  ranges  within the Private Use Areas registered with the
       Conscript Unicode  Registry  (
       For each	range of characters, unidesc prints the	character or byte off-
       set of the beginning of the range, the character	or byte	offset of  the
       end of the range, and the name of the range. Offsets start from 0.

       Since the ASCII digits, punctuation, and	whitespace characters are fre-
       quently used by other writing systems, by default these characters  are
       treated	as  neutral, that is, as not belonging exclusively to any par-
       ticular character range.	 These characters are treated as belonging  to
       the range of whatever characters	precede	them.

       If  the	input  begins with neutral characters, they are	treated	as be-
       longing to the range of whatever	characters follow them.	 If  the  file
       consists	 entirely  of  neutral	characters, the	range is identified as
       Neutral followed	by Basic Latin in square brackets.

       A magic number identifying the Unicode encoding is not part of the Uni-
       code  standard,	so  pure  Unicode files	do not contain a magic number.
       However,	informal conventions have arisen for  this  purpose.   If  the
       command	line  flag  -m	is given, unidesc will attempt to identify the
       Unicode subtype by examining the	first few bytes	of the input.  If  the
       input is	identified as one of the two acceptable	types, UTF-8 or	native
       order UTF-32, it	will then proceed to describe the contents of the  in-
       put.  Otherwise,	it will	report what it has learned and exit. Note that
       if the file does	contain	a magic	number,	you  must  use	the  -m	 flag.
       Without	this flag unidesc assumes that the input consists of pure Uni-
       code with the character data beginning immediately.  It will  therefore
       be thrown off by	the magic number.

       By  default, input is expected to be UTF-8. Native order	UTF-32 is also
       acceptable.  UTF-32 may be specified via	the command line flag  -u  or,
       if the command line flag	-m is given, via the magic number.

       -b     Give file	offsets	in bytes rather	than characters.

       -d     Treat  the  ASCII	 digits	 as belonging exclusively to the Basic
	      Latin range.

       -h     Print usage information.

       -L     List the Unicode ranges alphabetically.

       -l     List the Unicode ranges by codepoint.

       -m     Check the	file's magic number to determine the Unicode subtype.

       -p     Treat ASCII punctuation as belonging exclusively	to  the	 Basic
	      Latin range.

       -r     Instead of listing ranges	as they	are encountered, just list the
	      ranges detected after all	input has been read.

       -u     Input is native order UTF-32.

       -v     Print version information.

       -w     Treat ASCII whitespace as	belonging  exclusively	to  the	 Basic
	      Latin range.


       Unicode Standard, version 5.0

       Bill Poser

       GNU General Public License

				  June,	2007			    unidesc(1)


Want to link to this manual page? Use this URL:

home | help