Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
fst-compiler(1)			 fst-compiler		       fst-compiler(1)

       fst-compiler fst-compiler-utf8 -	Two compilers for SFST programs

       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [	output-file ]

       -c     Store  the  transducer  in  compact format which is used by fst-

       -l     Store the	transducer in lowmem format.

       -s     Switch surface and analysis layer	of the transducer. You have to
	      use  this	switch in order	to use fst-infl	(fst-infl2, fst-infl3)
	      for generation rather than analysis.

       fst-compiler is a compiler for  finite-state  transducer	 programs.  It
       generates  a  minimized	finite state transducer	which can be used with
       fst-mor,	fst-infl, fst-print, fst-compare, fst-parse, and  fst-lattice.
       The  compact  transducer	 representation	which is generated with	the -c
       flag, is	supported by fst-infl2,	fst-train, and fst-match.  The memory-
       efficient  transducer  representation  which  is	 generated with	the -l
       flag, is	only supported by fst-infl3.

       The first program argument is the name of a  file  which	 contains  the
       transducer  program.  The  programming language is described below. The
       second argument is the name of the file to which	the  resulting	trans-
       ducer  will be written in binary	form. If a second argument is missing,
       the output will be written to stdout.

       fst-compiler-utf8 differs from fst-compiler only	in the	character  en-
       coding.	 fst-compiler-utf8  supports UTF8 encoding of the source files
       whereas fst-compiler is to be  used  for	 8-Bit	character  codes  like
       latin1  which are an extension of the ASCII code. Information about the
       encoding	is stored in the transducer files and used by the  other  SFST

       A transducer program consists of	an (optional) sequence of alphabet and
       variable	definitions followed by	a single transducer  expression	 which
       defines the result transducer.


       An  alphabet  definition	consists of the	keyword	ALPHABET followed by =
       and some	transducer expression e.g.

	   ALPHABET = [a-z]:[A-Z]

       This command redefines the alphabet as the set of symbol	 pairs	occur-
       ring on the transitions of the transducer. Occurrences of two-level op-
       erators,	negation operators and unquoted	periods	always have to be pre-
       ceded by	an alphabet definition.


       There  are  two different types of variables.  Symbol set variables are
       enclosed	by hash	signs (#) and take symbol  sequences  (see  below)  as

	   #UC#	= A-Z
	   #LC#	= a-z

       Transducer  variables  are enclosed by dollar signs and take transducer
       expressions as values:

	   $MAP$ = [a-z]:[A-Z]+
	   $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with	the symbol `=' are  special  agreement
       variables.  If  an agreement variable occurs more than once in a	trans-
       ducer expression, it will always	have the same value. Consider the fol-
       lowing transducer program:

	   $=1$	= [abc]
	   $=1$	X $=1$

       The  result  transducer	recognizes the strings aXa, bXb, and cXc. Only
       acyclic transducers (i.e. transducers with a finite set of string  map-
       pings) can be assigned to agreement variables.


       A symbol	is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is	always
	 enclosed in angle brackets) or

       - a backslash followed by a number which	is the numeric code of the
	 designated character

       - the null symbol <>.

       Symbol sequence

       A  symbol sequence is a sequence	of characters, multi-character symbols
       and character ranges, e.g. a-z \. <x>.

       symbol range

       A symbol	range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       - a symbol sequence starting with ^ and	enclosed  in  square  brackets
       like [^A-Za-z] (designating the complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair	of two symbol ranges separated by a colon is a TE.


       - A single symbol range like [a-z] is a TE.
	 It is a short form for	[a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
	a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d	c:e <>:f.

       - X Y is	a TE if	X and Y	are TEs.
	 (Blanks are ignored unless they are quoted.)

       - (X) is	a TE if	X is a TE.

       -  X  op	 is a TE is X is a TE and op is	either * (Kleene's star	opera-
       tor), +
	(Kleene's plus operator), or ? (optionality operator)

       - op X is a TE is X is a	TE and op is either ! (negation	operator), ^
	(target	language extraction operator), _ (source  language  extraction
	operator), or ^_ (source and target switch operator).

       - X op Y	is a TE	is X and Y are TEs and op is either & (conjunction
	operator),  |  (disjunction operator), || (composition operator), or -
	(subtraction operator)

       - L x op	y R is a TE if L and R are TEs,	x and y	are symbol ranges and
	op is either =>	(two-level restriction), <= (two-level	coercion),  or
	<=> (two-level restriction and coercion).

       - X op L__R is a	TE if X, L and R are TEs and op	is either ^-> (upward
	replacement),  _->  (downward replacement), /->	(leftward replacement)
	or \-> (rightward replacement).	Furthermore, L and R must  define  au-
	tomata (i.e. which map their strings onto themselves). These operators
	correspond to Karttunen's replace operators. If	the arrow is  followed
	by a question mark (?),	the replacement	becomes	optional.

       - X << l	is a TE	if X is	a TE, and l is either of the form
	a  or the form a:b where a and b are single characters or symbols. The
	result is a transducer where l was freely inserted into	X. The	trans-
	ducer ab << c for instance is equivalent to c*ac*bc*.

       - X op Y	L1__R2,	... , LN__RN is	a TE if	X,Y, L1	through	LN and R1
	through	 RN  are  TEs,	and  op	is either => (general restriction), <=
	(general coercion), ^=>	(general surface  restriction),	 ^<=  (general
	surface	 coercion),  ^<=>  (general surface restriction	and coercion),
	_=> (general deep restriction),	 _<=  (general	deep  coercion),  _<=>
	(general  deep restriction and coercion). (These operators were	imple-
	mented following a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
	it into	a transducer of	the form line1|line2|line3|... where linex  is
	the  x-th  line	of the file. All characters other than : and \ are in-
	terpreted literally (i.e. not as operators). This TE is	typically used
	e.g. to	read morpheme list from	a file.

       - "<fname>" is a	TE. The	compiler reads a pre-compiled transducer from
	the file named fname. This

       Further Features

       Comments	 start with the	symbol % and extend up to the end of the line.
       Blanks are ignored unless they are quoted. Expressions terminate	at the
       end  of	a line unless the end of line is preceded by a backslash.  The

       #include	"fname"

       can be used to insert source code from a	file named fname.  The command

       RE >> "fname"

       stores the regular expression RE	in the file fname.  The	command

       #use hopcroft

       tells the compiler to use the Hopcroft minimisation algorithm from  now
       on, and

       #use default

       switsches back to the default minimisation algorithm (Brzozowski).  The

       Here is an example of a simple transducer program.  Assuming  that  the
       file "adj-stems"	contains the two lines


       this  transducer	 will correctly	analyze	the adjective forms easy, eas-
       ier, easiest and	late, later, and latest.

       ALPHABET	= [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

       fst-compiler returns 0 unless some error	occurs.

       The compiler gets the operator precedence wrong in  case	 of  two-level
       rules and interprets the	expression "ab c<=>d ef" as "a(b c<=>d (ef))".
       Therefore, you should always surround the  left	context	 of  two-level
       rules with parenthesis: (ab) c<=>d (ef)

       fst-mor,	 fst-infl,  fst-infl2, fst-infl3, fst-print, fst-compact, fst-
       parse, fst-compare, fst-compact,	fst-lowmem, fst-lattice, fst-train

       Helmut Schmid, Institute	for Computational Linguistics,	University  of
       Stuttgart,  Email:, This software is	avail-
       able under the GNU Public License.

				 December 2004		       fst-compiler(1)


Want to link to this manual page? Use this URL:

home | help