Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
DtSearch(special file)					DtSearch(special file)

NAME
       DtSearch	-- Introduces the DtSearch text	search and retrieval system

DESCRIPTION
       DtSearch	 is  a	general	 purpose text search and retrieval system that
       serves as the text search engine	for the	DtInfo browser in  the	Common
       Desktop Environment (CDE). DtSearch utilizes a full text	inverted index
       of natural language words and stems. Both queries  and  documents  have
       been  internationalized	for CDE	single-	and multi-byte languages, with
       provision for the definition of custom languages.  Queries  are	simple
       text  strings  that  can	optionally include full	boolean	specifications
       with a simple intuitive syntax. Results of searches can be ranked  sta-
       tistically.  Document retrievals	can include information	for highlight-
       ing query words in retrieved documents.

       DtSearch	consists of two	major functional areas.	 The first is a	set of
       offline build tools that:

	  o  Create searchable databases.

	  o  Index  the	user's text files and load the resultant search	infor-
	     mation into the databases.

	  o  Maintain the databases.

       The second functional area is an	online search API. It provides a  sim-
       ple  interface  to  the search engine to	facilitate user-written	search
       and retrieval programs. The API consists	of a set of functions compiled
       into  the library libDtSearch, with function prototypes,	constant defi-
       nitions,	and data structures defined in Search.h. DtSearch  includes  a
       sample browser source program, dtsrtest.c, to demonstrate API usage.

       Information  and	 error	messages  in  both functional areas, including
       those appended to the online API	MessageList, are generated from	a sin-
       gle DtSearch Message catalog, dtsearch.cat. The source for this catalog
       is dtsearch.msg.

       Each DtSearch database is associated with a single full	text  inverted
       index.  In addition, each database can be partitioned into logical sub-
       sets of documents called	"keytypes" by a	naming convention of the data-
       base  keys. The search engine can open multiple databases and users can
       specify any combination of databases and	keytypes for each query,  thus
       providing  a  two  tier	search	capability.  Users can further qualify
       searches	by restricting the search return list by date ranges and maxi-
       mum number of documents returned.

       DtSearch	 is  written  in  ANSI Standard	and POSIX compliant C. The Dt-
       Search online search API	is not reentrant (not "thread-safe") and  must
       therefore  be directly linked into the user-written search program. The
       DtSearch	API will increase the size of a	browser	 search	 program  from
       100K to 200K bytes depending on which functions are called.

GENERAL	SPECIFICATIONS AND CONVENTIONS
   Database Names
       Databases  consist  of  a set of	binary and ASCII files whose names are
       the 1- to 8-character ASCII database name specified to  the  dtsrcreate
       command,	 a period (.), and a 1-	to 3-character ASCII file name suffix.
       Executing dtsrcreate will create	and  initialize	 these	files.	 After
       creation, databases are always identified by the	1- to 8-character name
       string used in dtsrcreate. The database names dtsearch and austext  are
       reserved	and may	not be specified.

   DtSearch Languages
       Each database is	associated with	a single natural language. Unlike con-
       ventional locales, a DtSearch language includes code  set  presumptions
       and,  most  importantly,	linguistic parsing and stemming	rules to iden-
       tify indexable terms in a text stream. A	DtSearch language is specified
       when a database is created. Developers can also define custom languages
       with special code sets and linguistic rules. See	"Language Parsing  and
       Stemming" in this man page below	for details.

   Database Types
       The  API	 can be	used simply as a search	engine,	referring to documents
       only through the	inverted indexes. Alternatively,  a  database  can  be
       configured  to  store  actual  document	text in	compressed format in a
       repository efficiently accessible to the	engine.	The configuration  op-
       tions  that  indicate  these  alternatives  are referred	to as database
       types and are specified to dtsrcreate at	database creation time.

   Abstracts
       A field called the "abstract" is	included in the	fzk file for each doc-
       ument  loaded  into a database, and is included on the Results list for
       each document in	a successful search. When documents are	not stored  in
       a  repository,  the  abstract  typically	specifies a file name, URL, or
       other reference useful to the browser. It can also include summary  in-
       formation viewable by users to help them	select documents for retrieval
       and display.

   Offline Build Tools
       dtsrcreate creates and initializes new databases	or reinitializes  pre-
       existing	databases. Textual data	is loaded into databases by the	execu-
       tion of two programs. dtsrload creates a	 database  object  record  for
       each  text document, and	dtsrindex creates the full text	inverted index
       of words	and stems for each object record.  Based  on  unique  database
       keys  for  each object, dtsrload	and dtsrindex can also serve as	update
       programs	for preexisting	databases.

       The input to the	load and index programs	is a canonical text file  with
       a .fzk file name	suffix.	The format of fzk files	is sufficiently	simple
       that they can be	generated manually. In addition, DtSearch  includes  a
       utility	program, dtsrhan, which	can generate a correctly formatted fzk
       file for	some kinds of text documents.

       Several other utilites provided in the distribution package  are	 suit-
       able  for  extracting summary database information, including dtsrdbrec
       and dtsrkdump.

   Argument Conventions
       Optional	command	line arguments are specified with a dash (-) and typi-
       cally  a	 single	character argument identifier. Some required arguments
       also use	the dash convention. Unless specifically indicated  otherwise,
       dash  arguments	may  be	 specified in any order. Where values are used
       with dash arguments, they must be directly  appended  to	 the  argument
       without white space.

       Optional	 arguments precede required arguments. Non-dash	required argu-
       ments must usually be specified in the order  indicated	by  the	 usage
       statement.

LANGUAGE PARSING AND STEMMING
   Parsing and Stemming
       Word  parsing  is fundamental to	DtSearch operations at both index time
       and query time. Linguistic  parsing  algorithms	filter	incoming  text
       strings	into  sequences	of word	tokens for each	natural	language.  De-
       pending on the language,	word tokens may	also be	 processed  into  stem
       tokens.	At index time each linguistic token, or	term, in a document is
       stored in the inverted index. At	search time  queries  are  parsed  for
       linguistic terms	and used to access the documents that contain them.

       Each  database  is  assigned  its own DtSearch language identified by a
       language	number at database creation time. A language number determines
       the  parsing  and  stemming  algorithms to be applied to	the database's
       text and	queries. Internal DtSearch algorithms are  supplied  for  sup-
       ported  languages including several European languages and Japanese. In
       addition	a user exit mechanism permits developers to provide their  own
       custom language algorithms for a	database.

   Language Files
       Language	 algorithms  often  use	 various  word lists. Typically, these
       lists are stored	in language files for easy maintenance,	with the  type
       of  list	 identified  by	 the  file  name extension. Language files are
       opened and read into internal tables when the offline programs initial-
       ize  or	when the DtSearchInit online API function is called. Some lan-
       guage files are required	and initialization will	return fatal errors if
       they are	missing. Some language files are optional and associated algo-
       rithms will be silently bypassed	if they	are missing.  Files  for  sup-
       ported  languages  may  be edited to provide database specific enhance-
       ments. At open time, database specific files supercede generic language
       files.

   General European Parsing Rules
       The currently supported European	languages are

       0       English,	ASCII character	set
       1       English,	ISO Latin-1 character set
       2       Spanish,	ISO Latin-1 character set
       3       French, ISO Latin-1 character set
       4       Italian,	ISO Latin-1 character set
       5       German, ISO Latin-1 character set

       If  not	otherwise  specified, dtscreate	will initialize	databases with
       language	number 0. Note that all	supported  European  languages	use  a
       single-byte  encoding  method, with the ASCII code set as a proper sub-
       set.

       Parsed text, including both queries and indexed text in	documents,  is
       case insensitive	in supported European languages.

       In supported European languages parsing is accomplished with the	Teskey
       algorithm, which	partitions a character set into	 characters  that  are
       always parts of words (concordable), characters that are	never parts of
       words (nonconcordable), and characters that may be parts	of  words  de-
       pending	on  context  (optionally concordable). Typically, alphanumeric
       characters are concordable. Whitespace and most punctuation is  noncon-
       cordable.  Slashes are examples of characters that may or may not sepa-
       rate words depending on context.	The essence of the  parsing  algorithm
       is  "optionally concordable characters preceding	concordable characters
       are concordable;	otherwise, they	are nonconcordable". For example, UNIX
       directory names of the form /usr/local/bin would	be considered just one
       word, but slashes in isolation would be discarded as nonconcordable.

       The parsing algorithm does a table lookup to determine the concordabil-
       ity  of	characters.  The  tables are arrays of the characters for each
       code page supported by the algorithm. Currently	7-bit  ASCII  and  ISO
       Latin-1 are supported.

   Words Not Indexed
       Several	additional parsing rules are applied to	prevent	indexing mean-
       ingless terms. These terms include common prepositions, indefinite  ar-
       ticles,	and  nonlinguistic  text  strings such as formatting tags, se-
       quences of hexadecimal dump characters, list identifiers, etc.

       Tokens whose lengths are	less than a minimum word size or greater  than
       a maximum word size are discarded. The default minimum and maximum word
       sizes can be overridden in dtsrcreate.

       Similarly words found in	the "stop list"	file for the database are dis-
       carded.	Stop  lists  are  external, editable language files. Each sup-
       ported European language	is provided with a default stop	list.

       Words found in an "include list"	file are forcibly indexed even if they
       would otherwise be discarded. Include list database files are optional;
       no defaults are provided.

   Stemming
       When specified for a language, individual parsed	words  will  be	 "con-
       flated"	or  mapped  into their "stem" form, a new word that represents
       the etymological	root of	the original word. A default null stemming al-
       gorithm	is  used  for languages	that are not otherwise provided	with a
       supported stemmer. The null stemmer returns the original	 word  as  its
       own  stem.  Both	 words and stems are stored in the inverted index. API
       searches	can be specified for either words or stems, but	the two	search
       methods	are distinguished only when real stems have been stored	in the
       inverted	index.

       In the  supported  European  languages  stemming	 can  be  accomplished
       heuristically  or  by dictionary	lookup.	The heuristic algorithms typi-
       cally remove affixes in a language-dependent way. Affix lists are  usu-
       ally stored in language files. Currently	stemming is supported for Eng-
       lish languages 0	and 1, Spanish language	2, French language 3,  Italian
       language	4, and German language 5.

   Japanese
       Two  Japanese  DtSearch languages (numbers 6 and	7) are supported. Both
       use the same packed, Extended UNIX Code (EUC) character	set.  The  two
       languages  differ  only	in  the	technique used to parse	compound kanji
       words. All validly encoded text for supported Japanese languages	incor-
       porates	ASCII  encoding	as a proper, single-byte subset. The supported
       Japanese	languages use the null stemmer.

   Kanji Compounds
       Individual kanji	characters are parsed as single	 words.	 In  addition,
       for  language  number 6 all possible kanji substrings (pairs, triplets,
       etc.)  found in any contiguous string of	kanjis will be parsed as  com-
       pound kanji words, up to	a maximum word size of 6 kanji characters. For
       language	number 7, only kanji substrings	listed in the jpn.knj language
       file  may be treated as compound	kanji words. At	offline	index time all
       possible	individual kanjis and  kanji  compounds	 for  a	 language  are
       stored in the inverted index. At	online search time kanji substrings in
       the query are treated as	single query terms and are not compounded fur-
       ther.

   Japanese Code Sets
       The supported packed EUC	character set consists of four separate	multi-
       byte Code Sets. Code Set	0 can be either	7-bit ASCII or	7-bit  JIS-Ro-
       man.  The first and only	byte of	a character in Code Set	0 is less than
       0x80. Substrings	of Code	Set 0 in supported Japanese  text  are	parsed
       into  individual	 words	with  the  European  language parser described
       above. Minimum and maximum word sizes, stop lists,  and	include	 lists
       will be used as in European languages if	provided with a	Japanese data-
       base.

       Code Set	1 is JIS X 0208-1990. The two-byte characters in  Code	Set  1
       always  begin with a byte greater than 0xA0 and less than 0xFF. Symbols
       and line	drawing	elements are not indexed. Hirigana  strings  are  dis-
       carded  as  equivalent  to  stop	 list  words. Contiguous substrings of
       katakana, Roman,	Greek, or cyrillic are parsed as single	words.	 Indi-
       vidual  kanji  characters  are  treated as single words with additional
       kanji compounding depending on language	number,	 as  described	above.
       Characters  from	 unassigned  kuten  rows  are  treated as user-defined
       kanji.

       Code Set	2 is halfwidth katakana. The two-byte characters in Code Set 2
       always  begin  with the unique byte 0x8E. Contiguous strings are	parsed
       as single words.

       Code Set	3 is JIS X 0212-1990. The three-byte characters	in Code	Set  3
       always  begin with the unique byte 0x8F.	Parsing	is similar to Code Set
       1: discard symbols, etc., contiguous strings of related foreign charac-
       ters equal words, and individual	kanjis and unassigned characters equal
       single words, with additional kanji compounding depending on  language.
       Kuten  row  5  is  treated  as  katakana; undefined rows	are treated as
       kanji.

   Custom Languages
       All language dependent data structures and functions are	referenced  by
       fields  in  the	main internal DtSearch structure for databases (DBLK).
       The same	structure is used for offline build programs as	well as	online
       API  search  functions.	Language processing is initialized database by
       database	by an internal language	loader function	which stores values in
       DBLK  fields. A database	whose language number is not supported is pre-
       sumed to	be associated with a  custom  language.	 A  special  function,
       load_custom_language,  is called	to initialize language fields for cus-
       tom languages. The default load_custom_language merely returns an error
       code.   However,	 developers can	link in	their own load_custom_language
       function, which will be called to initialize the	DBLK fields needed  to
       parse  and  stem	 one or	more custom languages. Values required for the
       language	fields of a DBLK are specified in DtSrAPI(3).

SEE ALSO
       dtsrcreate(1), dtsrdbrec(1), dtsrhan(1),	dtsrindex(1), dtsrload(1), dt-
       srkdump(1),  huffcode(1),  DtSrAPI(3), dtsrfzkfiles(4), dtsrocffile(4),
       dtsrhanfile(4), dtsrlangfiles(4), dtsrdbfiles(4)

							DtSearch(special file)

NAME | DESCRIPTION | GENERAL SPECIFICATIONS AND CONVENTIONS | LANGUAGE PARSING AND STEMMING | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=DtSearch&sektion=5&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help