Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
dtsrlangfiles(special file)			   dtsrlangfiles(special file)

       dtsrlangfiles --	Describes the formats of DtSearch language files


       The  parsing of text into words in a particular language	often requires
       comparison with lists of	specific words in that language.  These	 lists
       are  maintained	on  external language files which are used by both the
       offline database	build programs and the online  search  API.   Language
       files  mandatory	 for a particular database must	be located in the same
       directory as the	other database files.

       The base	file names of language files are used to identify the language
       or  database  to	 which	they apply.  The initialization	functions look
       first for database specific language files, using the 1-	to 8-  charac-
       ter  database  name  as the base	file name. Secondly the	functions look
       for generic files by language base name.	Required  language  files  are
       provided	 for supported languages with generic base names.  A developer
       may edit	the generic language file and rename it	to apply to a particu-
       lar database.

       Different  types	 of language files are distinguished by	different file
       name extensions.

   Stop	Lists (.stp)
       The file	name extension <.stp> is used to identify  stop	 lists.	  Stop
       lists  are  used	 to prevent indexing frequently	occurring but semanti-
       cally unimportant words in a language.  Examples	include	common	prepo-
       sitions,	 indefinite  articles,	and  nonlinguistic  character strings.
       Stop lists are mandatory	for supported European languages.  If a	 data-
       base  specific  stop  list file is not found, the generic language file
       must be available in the	same directory as the other database files.

       Database	specific stop lists are	optional for Japanese.

   Include Lists (.inc)
       The file	name extension <.inc>  is  used	 to  identify  include	lists.
       Words  found  in	an include list	file are forcibly indexed even if they
       would otherwise be discarded. Include lists take	precedence  over  stop
       lists.  Include list files are always optional; no generic language de-
       faults are provided.

   Kanji Compounds List	(.knj)
       The file	name extension <.knj> is used to identify indexable  lists  of
       compound	 kanji words (that is, substrings of kanji characters that are
       indexed both as individual words	of one character, and  as  a  compound
       word). Currently	they apply only	to databases for the specific Japanese
       Language	DtSrLaJPN2.

       The kanji compounds file	is optional. If	no database specific knj  file
       is  found,  the	Japanese  language initialization function will	try to
       open the	generic	jpn.knj	file.  If the generic file is also not	found,
       kanji compounding will not be performed.

   Language Files Format
       Each  line  of a	language file represents one word. The word must begin
       in column one and ends at the first ASCII whitespace character  or  the
       ASCII  linefeed	character  (0fP,  0x0A)	 that terminates the line. Any
       other text on the line after the	first word token  is  discarded	 as  a
       comment.	 Lines that begin with '#', '$', '*', or '!' in	column one are
       discarded in their entirety as comments.	Blank  lines  (that  is,  hose
       that contain only the terminating linefeed), are	also discarded.

       The  word lists in language files are loaded into memory	at initializa-
       tion and	thereafter referenced internally.  The most efficient process-
       ing  occurs  when the files are maintained in frequency order (that is,
       when the	most frequently	occurring words	in the language	are the	 first
       words  in  the file).  Alternatively, if	the frequency of occurrence of
       the words is not	known, it is recommended that the word	order  in  the
       file be randomized.

   English Suffixes File
       Stemming	of English words is accomplished with the Paice	stemming algo-
       rithm. This heuristic algorithm removes common suffixes in a  recurrent
       manner, and conflates words into	a representation of their etymological
       root. The suffixes are maintained in eng.sfx and	loaded into memory  at
       initialization.	The  suffixes  file  is	mandatory for English language
       databases and is	not editable; a	copy of	it must	be found in  the  same
       directory as every English language database.

       dtsrcreate(1), dtsrindex(1), DtSrAPI(3),	dtsrdbfiles(4),	DtSearch(5)

						   dtsrlangfiles(special file)


Want to link to this manual page? Use this URL:

home | help