Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ucto(1)			    General Commands Manual		       ucto(1)

       ucto - Unicode Tokenizer

       ucto [[options]]	[input-file] [[output-file]]

       ucto  ucto  tokenizes  text files: it separates words from punctuation,
       splits sentences	(and optionally	paragraphs), and finds paired  quotes.
       Ucto is preconfigured with tokenisation rules for several languages.

       -c configfile
	      read settings from a file

       -d value
	      set debug	mode to	'value'

       -e value
	      set input	encoding. (default UTF8)

       -N value
	      set UTF8 output normalization. (default NFC)

	      disable  filtering  of  special  characters, (default YES) These
	      special characters can be	specified in the [FILTER] block	of the
	      configuration file.

	      OBSOLETE.	use --filter=NO

       -L language
	      Automatically  selects  a	 configuration	file by	language code.
	      The language code	is generally a	three-letter  iso-639-3	 code.
	      For  example,  'fra' will	select the file	tokconfig-fra from the
	      installation directory

	      try to detect all	the specified languages. The default  language
	      will be 'lang1'.	(only useful for FoLiA output)

	      Convert to all lowercase

	      Convert to all uppercase

	      Emit one sentence	per line on output

	      Assume one sentence per line on input

	      map  all	occurrences  of	 tokens	 with class1,...class to their
	      generic names. e.g --normalize=DATE will map all	dates  to  the
	      word  {{DATE}}.  Very  useful  to	 normalize  tokens like	URL's,
	      DATE's, E-mail addresses and so on.

	      Add additional tokens to the [TOKENS] block of the default  lan-
	      guage.  The file should contain one TOKEN	per line.

	      Don't tokenize, but perform input	decoding and simple token role

	      remove most of the punctuation from the output. (not from	 abre-
	      viations and embedded punctuation	like John's)

	      Disable Paragraph	Detection

	      Enable  Quote  Detection.	 (this is experimental and may lead to
	      unexpected results)

       -s <string>
	      Set End-of-sentence marker. (Default <utt>)

	      Show version information

	      set Verbose mode

	      Read a FoLiA XML document, tokenize it, and output the  modified
	      doc.  (this  disables  usage  of most other options: -nPQvs) For
	      files with an '.xml' extension, -F is the	default.

	      When tokenizing a	FoLiA XML document, search for text  nodes  of
	      class 'cls'.  The	default	is "current".

	      When  tokenizing a FoLiA XML document, output the	tokenized text
	      in text nodes with 'cls'.	 The default is	"current".  It is rec-
	      ommended to have different classes for input and output.

	      use 'cls'	for input and output of	text from FoLiA. Equivalent to
	      both --inputclass='cls' and --outputclass='cls')

	      This option is obsolete and NOT recommended. Please use the sep-
	      arate --inputclass= and --outputclass options.

	      Output  FoLiA  XML.  (this disables usage	of most	other options:

       --id <DocId>
	      Use the specified	Document ID for	the FoLiA XML

       -x <DocId> (obsolete)
	      Output FoLiA XML,	use the	specified Document ID. (this  disables
	      usage of most other options: -nPQvs).

	      obsolete Use -X and --id instead


       Maarten van Gompel

       Ko van der Sloot

				  2018 nov 13			       ucto(1)


Want to link to this manual page? Use this URL:

home | help