Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
ucto(1)			    General Commands Manual		       ucto(1)

NAME
       ucto - Unicode Tokenizer

SYNOPSIS
       ucto [[options]]	[input-file] [[output-file]]

DESCRIPTION
       ucto  ucto  tokenizes  text files: it separates words from punctuation,
       splits sentences	(and optionally	paragraphs), and finds paired  quotes.
       Ucto is preconfigured with tokenisation rules for several languages.

OPTIONS
       -c configfile
	      read settings from a file

       -d value
	      set debug	mode to	'value'

       -e value
	      set input	encoding. (default UTF8)

       -N value
	      set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
	      disable  filtering  of  special  characters, (default YES) These
	      special characters can be	specified in the [FILTER] block	of the
	      configuration file.

       -f
	      OBSOLETE.	use --filter=NO

       -L language
	      Automatically  selects  a	 configuration	file by	language code.
	      The language code	is generally a	three-letter  iso-639-3	 code.
	      For  example,  'fra' will	select the file	tokconfig-fra from the
	      installation directory

       --detectlanguages=<lang1,lang2,..langn>
	      try to detect all	the specified languages. The default  language
	      will be 'lang1'.	(only useful for FoLiA output)

       -l
	      Convert to all lowercase

       -u
	      Convert to all uppercase

       -n
	      Emit one sentence	per line on output

       -m
	      Assume one sentence per line on input

       --normalize=class1,class2,..,classn
	      map  all	occurrences  of	 tokens	 with class1,...class to their
	      generic names. e.g --normalize=DATE will map all	dates  to  the
	      word  {{DATE}}.  Very  useful  to	 normalize  tokens like	URL's,
	      DATE's, E-mail addresses and so on.

       --add-tokens="file"
	      Add additional tokens to the [TOKENS] block of the default  lan-
	      guage.  The file should contain one TOKEN	per line.

       --passthru
	      Don't tokenize, but perform input	decoding and simple token role
	      detection

       --filterpunct
	      remove most of the punctuation from the output. (not from	 abre-
	      viations and embedded punctuation	like John's)

       -P
	      Disable Paragraph	Detection

       -Q
	      Enable  Quote  Detection.	 (this is experimental and may lead to
	      unexpected results)

       -s <string>
	      Set End-of-sentence marker. (Default <utt>)

       -V
	      Show version information

       -v
	      set Verbose mode

       -F
	      Read a FoLiA XML document, tokenize it, and output the  modified
	      doc.  (this  disables  usage  of most other options: -nPQvs) For
	      files with an '.xml' extension, -F is the	default.

       --inputclass="cls"
	      When tokenizing a	FoLiA XML document, search for text  nodes  of
	      class 'cls'.  The	default	is "current".

       --outputclass="cls"
	      When  tokenizing a FoLiA XML document, output the	tokenized text
	      in text nodes with 'cls'.	 The default is	"current".  It is rec-
	      ommended to have different classes for input and output.

       --textclass="cls"(obsolete)
	      use 'cls'	for input and output of	text from FoLiA. Equivalent to
	      both --inputclass='cls' and --outputclass='cls')

	      This option is obsolete and NOT recommended. Please use the sep-
	      arate --inputclass= and --outputclass options.

       -X
	      Output  FoLiA  XML.  (this disables usage	of most	other options:
	      -nPQvs)

       --id <DocId>
	      Use the specified	Document ID for	the FoLiA XML

       -x <DocId> (obsolete)
	      Output FoLiA XML,	use the	specified Document ID. (this  disables
	      usage of most other options: -nPQvs).

	      obsolete Use -X and --id instead

BUGS
       likely

AUTHORS
       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot	Timbl@uvt.nl

				  2018 nov 13			       ucto(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | BUGS | AUTHORS

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=ucto&sektion=1&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help