Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
txt(3)				 AFNIX Module				txt(3)

NAME
       txt - standard text processing module

STANDARD TEXT PROCESSING MODULE
       The  Standard Text Processingmodule is an original implementation of an
       object collection dedicated to text processing. Although	 text  scaning
       is  the current operation perfomed in the field of text processing, the
       module provides also specialized	object to store	and index  text	 data.
       Text sorting and	transliteration	is also	part of	this module.

       Scanning	concepts
       Text scanning is	the ability to extract lexical elements	or lexemesfrom
       a stream. A scanner or lexical analyzer is the principal	object used to
       perform	this  task. A scanner is created by adding special object that
       acts as a pattern matcher. When a pattern is matched, a special	object
       called a	lexemeis returned.

       Pattern object
       A  Patternobject	 is a special object that acts as model	for the	string
       to match. There are several ways	to build a pattern. The	 simplest  way
       to  build it is with a regular expression. Another type of pattern is a
       balanced	pattern. In its	first form, a pattern object  can  be  created
       with a regular expression object.

       # create	a pattern object
       const pat (afnix:txt:Pattern "$d+")

       In this example,	the pattern object is built to detect integer objects.

       pat:check "123" # true
       pat:match "123" # 123

       The  checkmethod	 return	 true if the input string matches the pattern.
       The matchmethod returns the string that matches the pattern. Since  the
       pattern object can also operates	with stream object, the	matchmethod is
       appropriate to match a particular string. The  pattern  object  is,  as
       usual, available	with the appropriate predicate.

       afnix:txt:pattern-p pat # true

       Another form of pattern object is the balanced pattern. A balanced pat-
       tern is determined by a starting	string and an ending string. There are
       two types of balanced pattern. One is a single balanced pattern and the
       other one is the	recursive balanced pattern. The	single	balanced  pat-
       tern  is	 appropriate  for  those lexical element that are defined by a
       character. For example, the classical C-string  is  a  single  balanced
       pattern with the	double quote character.

       # create	a balanced pattern
       const pat (afnix:txt:Pattern "ELEMENT" "<" ">")
       pat:check "<xml>" # true
       pat:match "<xml>" # xml

       In  the	case  of the C-string, the pattern might be more appropriately
       defined with an additional escape character. Such character is used  by
       the  pattern  matcher to	grab characters	that might be part of the pat-
       tern definition.

       # create	a balanced pattern
       const pat (afnix:txt:Pattern "STRING" "'" '\')
       pat:check "'hello'" # true
       pat:match "'hello'" # "hello"

       In this form, a balanced	pattern	with an	escape character  is  created.
       The  same  string  is used for both the starting	and ending string. An-
       other constructor that takes two	strings	can be used  if	 the  starting
       and ending strings are different. The last pattern form is the balanced
       recursive form. In this form, a starting	and ending string are used  to
       delimit	the  pattern.  However,	 in  this mode,	a recursive use	of the
       starting	and ending strings is allowed.	In  order  to  have  an	 exact
       match,  the  number  of starting	string must equal the number of	ending
       string. For example, the	C-comment pattern can be viewed	 as  recursive
       balanced	pattern.

       # create	a c-comment pattern
       const pat (afnix:txt:Pattern "STRING" "/*" "*/" )

       Lexeme object
       The  Lexemeobject  is  the  object built	by a scanner that contains the
       matched string. A lexeme	is therefore a tagged string. Additionally,  a
       lexeme can carry	additional information like a source name and index.

       # create	an empty lexeme
       const lexm (afnix:txt:Lexeme)
       afnix:txt:lexeme-p lexm # true

       The  default  lexeme is created with any	value. A value can be set with
       the set-valuemethod and retrieved with the get-valuemethods.

       lexm:set-value "hello"
       lexm:get-value #	hello

       Similar are the set-tagand get-tagmethods which operate with  an	 inte-
       ger.  The source	name and index are defined as well with	the same meth-
       ods.

       # check for the source
       lexm:set-source "world"
       lexm:get-source # world
       # check for the source index
       lexm:set-index 2000
       lexm:get-index #	2000

       Text scanning
       Text scanning is	the ability to extract	lexical	 elements  or  lexemes
       from  an	 input	stream.	 Generally,  the  lexemes are the results of a
       matching	operation which	is defined by a	pattern	object.	As  a  result,
       the  definition	of  a  scanner object is the object itself plus	one or
       several pattern object.

       Scanner construction
       By default, a scanner is	created	without	pattern	objects.  The  length-
       method  returns the number of pattern objects. As usual,	a predicate is
       associated with the scanner object.

       # the default scanner
       const  scan (afnix:txt:Scanner)
       afnix:txt:scanner-p scan	# true
       # the length method
       scan:length # 0

       The scanner construction	proceeds by adding pattern objects. Each  pat-
       tern  can be created independently, and later added to the scanner. For
       example,	a scanner that reads real, integer and string can  be  defined
       as follow:

       # create	the scanner pattern
       const REAL    (
	 afnix:txt:Pattern "REAL"    [$d+.$d*])
       const STRING  (
	 afnix:txt:Pattern "STRING"  """ '\')
       const INTEGER (
	 afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+])
       # add the pattern to the	scanner
       scanner:add INTEGER REAL	STRING

       The  order of pattern integration defines the priority at which a token
       is recognized. The symbol name for each pattern is optional  since  the
       functional  programming permits the creation of patterns	directly. This
       writing style makes the scanner definition easier to read.

       Using the scanner
       Once constructed, the scanner can be used as is.	A stream is  generally
       the  best  way  to operate. If the scanner reaches the end-of-stream or
       cannot recognize	a lexeme, the nil object is returned. With a loop,  it
       is easy to get all lexemes.

       while (trans valid (is:valid-p))	{
	 # try to get the lexeme
	 trans lexm (scanner:scan is)
	 # check for nil lexeme	and print the value
	 if (not (nil-p	lexm)) (println	(lexm:get-value))
	 # update the valid flag
	 valid:= (and (is:valid-p) (not	(nil-p lexm)))
       }

       In this loop, it	is necessary first to check for	the end	of the stream.
       This is done with the help of the special loop construct	that  initial-
       ize  the	 validsymbol.  As  soon	 as the	the lexeme is built, it	can be
       used. The lexeme	holds the value	as well	as it tag.

       Text sorting
       Sorting is one the primary function implemented inside  the  text  pro-
       cessingmodule.  There are three sorting functions available in the mod-
       ule.

       Ascending and descending	order sorting
       The sort-ascentfunction operates	with a vector object and sorts the el-
       ements in ascending order. Any kind of objects can be sorted as long as
       they support a comparison method. The elements are sorted in placed  by
       using a quick sortalgorithm.

       # create	an unsorted vector
       const v-i (Vector 7 5 3 4 1 8 0 9 2 6)
       # sort the vector in place
       afnix:txt:sort-ascent v-i
       # print the vector
       for (e) (v) (println e)

       The  sort-descentfunction  is similar to	the sort-ascentfunction	except
       that the	object are sorted in descending	order.

       Lexical sorting
       The sort-lexicalfunction	operates with a	vector object  and  sorts  the
       elements	 in  ascending	order using a lexicographic ordering relation.
       Objects in the vector must  be  literal	objects	 or  an	 exception  is
       raised.

       Transliteration
       Transliteration is the process of changing characters my	mapping	one to
       another one. The	transliteration	 process  operates  with  a  character
       source  and  produces a target character	with the help of a mapping ta-
       ble. The	transliteration	process	is not necessarily reversible as often
       indicated in the	literature.

       Literate	object
       The Literateobject is a transliteration object that is bound by default
       with the	identity function mapping. As usual, a predicate is  associate
       with the	object.

       # create	a transliterate	object
       const tl	(afnix:txt:Literate)
       # check the object
       afnix:txt:literate-p tl # true

       The  transliteration  process can also operate with an escape character
       in order	to map double character	sequence into a	single one, as usually
       found inside programming	language.

       # create	a transliterate	object by escape
       const tl	(afnix:txt:Literate '\')

       Transliteration configuration
       The  set-mapconfigures the transliteration mapping table	while the set-
       escape-mapconfigure the escape mapping table. The mapping  is  done  by
       setting the source character and	the target character. For instance, if
       one want	to map the tabulation character	to a white space, the  mapping
       table is	set as follow:

       tl:set-map '' ' '

       The escape mapping table	operates the same way. It should be noted that
       the mapping algorithm translate first the input	character,  eventually
       yielding	 to  an	 escape	 character  and	 then the escape mapping takes
       place. Note also	that the set-escapemethod can be used to set  the  es-
       cape character.

       tl:set-map '' ' '

       Transliteration process
       The  transliteration  process  is done either with a string or an input
       stream. In the first case, the translatemethod operates with  a	string
       and  returns a translated string. On the	other hand, the	readmethod re-
       turns a character when operating	with a stream.

       # set the mapping characters
       tl:set-map 'w'
       tl:set-map '\' 'o'
       tl:set-map 'r'
       tl:set-map 'd'
       # translate a string
       tl:translate "helo" # word

STANDARD TEXT PROCESSING REFERENCE
       Pattern
       The Patternclass	is a pattern matching class based  either  on  regular
       expression  or  balanced	 string. In the	regex mode, the	pattern	is de-
       fined with a regex and a	matching is said to occur when a  regex	 match
       is achieved. In the balanced string mode, the pattern is	defined	with a
       start pattern and end pattern strings. The balanced mode	can be a  sin-
       gle  or	recursive. Additionally, an escape character can be associated
       with the	class. A name and a tag	is also	bound to the pattern object as
       a mean to ease the integration within a scanner.

       Predicate

	      pattern-p

       Inheritance

	      Object

       Constructors

	      Pattern (none)
	      The Patternconstructor creates an	empty pattern.

	      Pattern (String|Regex)
	      The  Patternconstructor creates a	pattern	object associated with
	      a	regular	expression. The	argument can be	either a string	 or  a
	      regular  expression  object.  If the argument is a string, it is
	      converted	into a regular expression object.

	      Pattern (String String)
	      The Patternconstructor creates a balanced	pattern. The first ar-
	      gument  is  the start pattern string. The	second argument	is the
	      end balanced string.

	      Pattern (String String Character)
	      The Patternconstructor creates a balanced	pattern	with an	escape
	      character.  The  first argument is the start pattern string. The
	      second argument is the end balanced string. The third  character
	      is the escape character.

	      Pattern (String String Boolean)
	      The Patternconstructor creates a recursive balanced pattern. The
	      first argument is	the start pattern string. The second  argument
	      is the end balanced string.

       Constants

	      REGEX
	      The  REGEXconstant  indicates  that the pattern is a regular ex-
	      pression.

	      BALANCED
	      The BALANCEDconstant indicates that the pattern  is  a  balanced
	      pattern.

	      RECURSIVE
	      The  RECURSIVEconstant indicates that the	pattern	is a recursive
	      balanced pattern.

       Methods

	      check -> Boolean (String)
	      The checkmethod checks the pattern against the input string.  If
	      the  verification	 is successful,	the method returns true, false
	      otherwise.

	      match -> String (String|InputStream)
	      The matchmethod attempts to match	an input string	 or  an	 input
	      stream. If the matching occurs, the matching string is returned.
	      If the input is a	string,	the end	of string is used  as  an  end
	      condition.  If  the  input  stream is used, the end of stream is
	      used as an end condition.

	      set-tag -> none (Integer)
	      The set-tagmethod	sets the pattern tag. The tag can  be  further
	      used inside a scanner.

	      get-tag -> Integer (none)
	      The get-tagmethod	returns	the pattern tag.

	      set-name -> none (String)
	      The  set-namemethod  sets	 the  pattern name. The	name is	symbol
	      identifier for that pattern.

	      get-name -> String (none)
	      The get-namemethod returns the pattern name.

	      set-regex	-> none	(String|Regex)
	      The set-regexmethod sets the pattern regex either	with a	string
	      or with a	regex object. If the method is successfully completed,
	      the pattern type is switched to the REGEX	type.

	      set-escape -> none (Character)
	      The set-escapemethod sets	the pattern escape character. The  es-
	      cape character is	used only in balanced mode.

	      get-escape -> Character (none)
	      The get-escapemethod returns the escape character.

	      set-balanced -> none (String| String String)
	      The  set-balancedmethod  sets  the pattern balanced string. With
	      one argument, the	same balanced string is	used for starting  and
	      ending.  With  two arguments, the	first argument is the starting
	      string and the second is the ending string.

       Lexeme
       The Lexemeclass is a literal object that	is designed to hold a matching
       pattern.	A lexeme consists in string (i.e. the lexeme value), a tag and
       eventually a source name	(i.e. file name) and a source index (line num-
       ber).

       Predicate

	      lexeme-p

       Inheritance

	      Literal

       Constructors

	      Lexeme (none)
	      The Lexemeconstructor creates an empty lexeme.

	      Lexeme (String)
	      The  Lexemeconstructor creates a lexeme by value.	The string ar-
	      gument is	the lexeme value.

       Methods

	      set-tag -> none (Integer)
	      The set-tagmethod	sets the lexeme	tag. The tag  can  be  further
	      used inside a scanner.

	      get-tag -> Integer (none)
	      The get-tagmethod	returns	the lexeme tag.

	      set-value	-> none	(String)
	      The  set-valuemethod  sets the lexeme value. The lexeme value is
	      generally	the result of a	matching operation.

	      get-value	-> String (none)
	      The get-valuemethod returns the lexeme value.

	      set-index	-> none	(Integer)
	      The set-indexmethod sets the lexeme  source  index.  The	lexeme
	      source index can be for instance the source line number.

	      get-index	-> Integer (none)
	      The get-indexmethod returns the lexeme source index.

	      set-source -> none (String)
	      The  set-sourcemethod  sets  the	lexeme source name. The	lexeme
	      source name can be for instance the source file name.

	      get-source -> String (none)
	      The get-sourcemethod returns the lexeme source name.

       Scanner
       The Scannerclass	is a text scanner or lexical analyzerthat operates  on
       an input	stream and permits to match one	or several patterns. The scan-
       ner is built by adding patterns to the scanner object.  With  an	 input
       stream,	the  scanner  object  attempts to build	a buffer that match at
       least one pattern. When such matching occurs, a lexeme is  built.  When
       building	a lexeme, the pattern tag is used to mark the lexeme.

       Predicate

	      scanner-p

       Inheritance

	      Object

       Constructors

	      Scanner (none)
	      The Scannerconstructor creates an	empty scanner.

       Methods

	      add -> none (Pattern*)
	      The addmethod adds 0 or more pattern objects to the scanner. The
	      priority of the pattern is determined by the order in which  the
	      patterns are added.

	      length ->	Integer	(none)
	      The  lengthmethod	 returns the number of pattern objects in this
	      scanner.

	      get -> Pattern (Integer)
	      The getmethod returns a pattern object by	index.

	      check -> Lexeme (String)
	      The checkmethod checks that a string is matched by  the  scanner
	      and returns the associated lexeme.

	      scan -> Lexeme (InputStream)
	      The scanmethod scans an input stream until a pattern is matched.
	      When a matching occurs, the associated lexeme is returned.

       Literate
       The Literateclass is transliteration mapping class. Transliteration  is
       the  process  of	changing characters my mapping one to another one. The
       transliteration process operates	with a character source	and produces a
       target character	with the help of a mapping table. This transliteration
       object can also operate with an escape table. In	the presence of	an es-
       cape  character,	an escape mapping table	is used	instead	of the regular
       one.

       Predicate

	      literate-p

       Inheritance

	      Object

       Constructors

	      Literate (none)
	      The Literateconstructor creates a	 default  transliteration  ob-
	      ject.

	      Literate (Character)
	      The Literateconstructor creates a	default	transliteration	object
	      with an escape character.	The argument is	the escape character.

       Methods

	      read -> Character	(InputStream)
	      The readmethod reads a  character	 from  the  input  stream  and
	      translate	 it with the help of the mapping table.	A second char-
	      acter might be consumed from the stream if the  first  character
	      is an escape character.

	      getu -> Character	(InputStream)
	      The  getumethod  reads a Unicode character from the input	stream
	      and translate it with the	help of	the mapping  table.  A	second
	      character	might be consumed from the stream if the first charac-
	      ter is an	escape character.

	      reset -> none (none)
	      The resetmethod resets all the mapping table and install	a  de-
	      fault identity one.

	      set-map -> none (Character Character)
	      The  set-mapmethod  set  the mapping table by using a source and
	      target character.	The first character is the  source  character.
	      The second character is the target character.

	      get-map -> Character (Character)
	      The  get-mapmethod  returns  the mapping character by character.
	      The source character is the argument.

	      translate	-> String (String)
	      The translatemethod translate a string  by  transliteration  and
	      returns a	new string.

	      set-escape -> none (Character)
	      The set-escapemethod set the escape character.

	      get-escape -> Character (none)
	      The get-escapemethod returns the escape character.

	      set-escape-map ->	none (Character	Character)
	      The set-escape-mapmethod set the escape mapping table by using a
	      source and target	character. The first character is  the	source
	      character. The second character is the target character.

	      get-escape-map ->	Character (Character)
	      The get-escape-mapmethod returns the escape mapping character by
	      character. The source character is the argument.

       Functions

	      sort-ascent -> none (Vector)
	      The sort-ascentfunction sorts in ascending order the vector  ar-
	      gument. The vector is sorted in place.

	      sort-descent -> none (Vector)
	      The  sort-descentfunction	 sorts	in descending order the	vector
	      argument.	The vector is sorted in	place.

	      sort-lexical -> none (Vector)
	      The sort-lexicalfunction sorts in	lexicographic order the	vector
	      argument.	The vector is sorted in	place.

AFNIX				  2017-04-29				txt(3)

NAME | STANDARD TEXT PROCESSING MODULE | STANDARD TEXT PROCESSING REFERENCE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=afnix::txt&sektion=3&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help