Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
fst-compiler(1)			 fst-compiler		       fst-compiler(1)

NAME
       fst-compiler fst-compiler-utf8 -	Two compilers for SFST programs

SYNOPSIS
       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [	output-file ]

OPTIONS
       -c     Store  the  transducer  in  compact format which is used by fst-
	      infl2.

       -l     Store the	transducer in lowmem format.

       -s     Switch surface and analysis layer	of the transducer. You have to
	      use  this	switch in order	to use fst-infl	(fst-infl2, fst-infl3)
	      for generation rather than analysis.

DESCRIPTION
       fst-compiler is a compiler for  finite-state  transducer	 programs.  It
       generates  a  minimized	finite state transducer	which can be used with
       fst-mor,	fst-infl, fst-print, fst-compare, fst-parse, and  fst-lattice.
       The  compact  transducer	 representation	which is generated with	the -c
       flag, is	supported by fst-infl2,	fst-train, and fst-match.  The memory-
       efficient  transducer  representation  which  is	 generated with	the -l
       flag, is	only supported by fst-infl3.

       The first program argument is the name of a  file  which	 contains  the
       transducer  program.  The  programming language is described below. The
       second argument is the name of the file to which	the  resulting	trans-
       ducer  will be written in binary	form. If a second argument is missing,
       the output will be written to stdout.

       fst-compiler-utf8 differs from fst-compiler only	in the	character  en-
       coding.	 fst-compiler-utf8  supports UTF8 encoding of the source files
       whereas fst-compiler is to be  used  for	 8-Bit	character  codes  like
       latin1  which are an extension of the ASCII code. Information about the
       encoding	is stored in the transducer files and used by the  other  SFST
       programs.

FILE FORMATS
       A transducer program consists of	an (optional) sequence of alphabet and
       variable	definitions followed by	a single transducer  expression	 which
       defines the result transducer.

       Alphabet

       An  alphabet  definition	consists of the	keyword	ALPHABET followed by =
       and some	transducer expression e.g.

	   ALPHABET = [a-z]:[A-Z]

       This command redefines the alphabet as the set of symbol	 pairs	occur-
       ring on the transitions of the transducer. Occurrences of two-level op-
       erators,	negation operators and unquoted	periods	always have to be pre-
       ceded by	an alphabet definition.

       Variables

       There  are  two different types of variables.  Symbol set variables are
       enclosed	by hash	signs (#) and take symbol  sequences  (see  below)  as
       values:

	   #UC#	= A-Z
	   #LC#	= a-z

       Transducer  variables  are enclosed by dollar signs and take transducer
       expressions as values:

	   $MAP$ = [a-z]:[A-Z]+
	   $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with	the symbol `=' are  special  agreement
       variables.  If  an agreement variable occurs more than once in a	trans-
       ducer expression, it will always	have the same value. Consider the fol-
       lowing transducer program:

	   $=1$	= [abc]
	   $=1$	X $=1$

       The  result  transducer	recognizes the strings aXa, bXb, and cXc. Only
       acyclic transducers (i.e. transducers with a finite set of string  map-
       pings) can be assigned to agreement variables.

       Symbols

       A symbol	is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is	always
	 enclosed in angle brackets) or

       - a backslash followed by a number which	is the numeric code of the
	 designated character

       - the null symbol <>.

       Symbol sequence

       A  symbol sequence is a sequence	of characters, multi-character symbols
       and character ranges, e.g. a-z \. <x>.

       symbol range

       A symbol	range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       - a symbol sequence starting with ^ and	enclosed  in  square  brackets
       like [^A-Za-z] (designating the complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair	of two symbol ranges separated by a colon is a TE.

	 [a-z]:[a-Z]

       - A single symbol range like [a-z] is a TE.
	 It is a short form for	[a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
	a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d	c:e <>:f.

       - X Y is	a TE if	X and Y	are TEs.
	 (Blanks are ignored unless they are quoted.)

       - (X) is	a TE if	X is a TE.

       -  X  op	 is a TE is X is a TE and op is	either * (Kleene's star	opera-
       tor), +
	(Kleene's plus operator), or ? (optionality operator)

       - op X is a TE is X is a	TE and op is either ! (negation	operator), ^
	(target	language extraction operator), _ (source  language  extraction
	operator), or ^_ (source and target switch operator).

       - X op Y	is a TE	is X and Y are TEs and op is either & (conjunction
	operator),  |  (disjunction operator), || (composition operator), or -
	(subtraction operator)

       - L x op	y R is a TE if L and R are TEs,	x and y	are symbol ranges and
	op is either =>	(two-level restriction), <= (two-level	coercion),  or
	<=> (two-level restriction and coercion).

       - X op L__R is a	TE if X, L and R are TEs and op	is either ^-> (upward
	replacement),  _->  (downward replacement), /->	(leftward replacement)
	or \-> (rightward replacement).	Furthermore, L and R must  define  au-
	tomata (i.e. which map their strings onto themselves). These operators
	correspond to Karttunen's replace operators. If	the arrow is  followed
	by a question mark (?),	the replacement	becomes	optional.

       - X << l	is a TE	if X is	a TE, and l is either of the form
	a  or the form a:b where a and b are single characters or symbols. The
	result is a transducer where l was freely inserted into	X. The	trans-
	ducer ab << c for instance is equivalent to c*ac*bc*.

       - X op Y	L1__R2,	... , LN__RN is	a TE if	X,Y, L1	through	LN and R1
	through	 RN  are  TEs,	and  op	is either => (general restriction), <=
	(general coercion), ^=>	(general surface  restriction),	 ^<=  (general
	surface	 coercion),  ^<=>  (general surface restriction	and coercion),
	_=> (general deep restriction),	 _<=  (general	deep  coercion),  _<=>
	(general  deep restriction and coercion). (These operators were	imple-
	mented following a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
	it into	a transducer of	the form line1|line2|line3|... where linex  is
	the  x-th  line	of the file. All characters other than : and \ are in-
	terpreted literally (i.e. not as operators). This TE is	typically used
	e.g. to	read morpheme list from	a file.

       - "<fname>" is a	TE. The	compiler reads a pre-compiled transducer from
	the file named fname. This

       Further Features

       Comments	 start with the	symbol % and extend up to the end of the line.
       Blanks are ignored unless they are quoted. Expressions terminate	at the
       end  of	a line unless the end of line is preceded by a backslash.  The
       command

       #include	"fname"

       can be used to insert source code from a	file named fname.  The command

       RE >> "fname"

       stores the regular expression RE	in the file fname.  The	command

       #use hopcroft

       tells the compiler to use the Hopcroft minimisation algorithm from  now
       on, and

       #use default

       switsches back to the default minimisation algorithm (Brzozowski).  The
       command

EXAMPLE
       Here is an example of a simple transducer program.  Assuming  that  the
       file "adj-stems"	contains the two lines

	  easy
	  late
	  big

       this  transducer	 will correctly	analyze	the adjective forms easy, eas-
       ier, easiest and	late, later, and latest.

       ALPHABET	= [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

EXIT STATUS
       fst-compiler returns 0 unless some error	occurs.

BUGS
       The compiler gets the operator precedence wrong in  case	 of  two-level
       rules and interprets the	expression "ab c<=>d ef" as "a(b c<=>d (ef))".
       Therefore, you should always surround the  left	context	 of  two-level
       rules with parenthesis: (ab) c<=>d (ef)

SEE ALSO
       fst-mor,	 fst-infl,  fst-infl2, fst-infl3, fst-print, fst-compact, fst-
       parse, fst-compare, fst-compact,	fst-lowmem, fst-lattice, fst-train

AUTHOR
       Helmut Schmid, Institute	for Computational Linguistics,	University  of
       Stuttgart,  Email: schmid@ims.uni-stuttgart.de, This software is	avail-
       able under the GNU Public License.

				 December 2004		       fst-compiler(1)

NAME | SYNOPSIS | OPTIONS | DESCRIPTION | FILE FORMATS | EXAMPLE | EXIT STATUS | BUGS | SEE ALSO | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=fst-compiler&sektion=1&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help