Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
dtsrindex(user cmd)					   dtsrindex(user cmd)

       dtsrindex -- Load
	inverted index for document objects

       dtsrindex-ddbname    [-tetxstr]	  [-h0	  |  -hhashsz  ]   [-rrecdots]
       [-bbatchsz]  [-ccachesz]	 [-iinbufsz] file

       dtsrindex is the	second of a pair of programs that load a database with
       documents  data from an input fzk file.	dtsrload loads document	header
       information and optionally the documents	themselves.  dtsrindex	parses
       words  from document text and loads them	into the inverted index	files.
       Word parsing is performed in  the  specified  language  and  linguistic
       codeset	of  the	database. The inverted index contains the search terms
       used for	subsequent online queries.

       An fzk file can be generated by dtsrhan manually	with a text editor, or
       by a special application	program	created	for the	purpose. Typically the
       same fzk	file is	used for dtsrload and dtsrindex. However,  it  is  not
       required	and there are situations where it may not be desirable.	If the
       same fzk	file is	not used by both programs, the one used	for  dtsrindex
       must  represent the same	objects	in the same order. Only	the unique key
       line and	the text portions of the file are used by this	program.  (See
       dtsrfzkfiles(4) for information about DtSearch fzk files).

       A  document's  unique  key in the fzk file must already preexist	in the
       database	(that is, dtsrload must	be executed before dtsrindex). If  any
       words are already indexed for the unique	document key, indicating dtsr-
       load "updated" the document, then the newly parsed words	from the  cur-
       rent fzk	file will totally replace the previously indexed words.

       When  duplicate	record	ids are	encountered in a single	fzk file, only
       the first occurrence of the document is indexed into the	database;  the
       second  one  is discarded. Sinxe	this is	exactly	the same discard order
       as dtsrload, the	same fzk file can be used for both programs. Duplicate
       record ids are maintained during	execution with a hash table.

       dtsrindex  performs two passes. In the first pass, dtsrindex constructs
       an inverted index in memory of all the words it	parses	from  the  fzk
       file.  Since the	index is built in memory, it is	possible to run	out of
       memory for very large fzk files.	 For this reason very large fzk	 files
       are  processed  in batches. Execution time in the first pass depends on
       the size	of the fzk file.

       In the second pass, dtsrindex merges the	information in the memory  in-
       dex into	the database's disk inverted index. Execution time in the sec-
       ond pass	depends	on both	the size of the	 incoming  fzk	file  and  the
       overall size of the database.

       If  dtsrindex  is  interrupted  in the first pass, it can be reexecuted
       without database	damage.	However	if it is  interrupted  in  the	second
       pass,  the database will	be corrupted. Database backups are always rec-


	      To prevent database corruption, execute dtsrindex	only after all
	      users  of	 a  preexisting	database have exited their search pro-
	      grams. For a single fzk file, dtsrload must be executed  immedi-
	      ately  before  dtsrindex	so that	dtsrindex can map the words it
	      indexes to the correct internal database addresses.  Only	 after
	      both programs successfully complete execution may	users again be
	      allowed to perform online	searches of the	database.

       The following options are available:


	      If an option takes a value, the value must be directly  appended
	      to the option name without white space.

       -ddbname	 Specifies  the	1 to 8 ASCII character name of the database to
		 be updated.  If an optional directory path is	not  prepended
		 to  the  database  name,  dtsrindex  will attempt to open the
		 database from the current working directory. File name	exten-
		 sions for database files are automatically appended.

       -tetxstr	 Specifies  the	end of document	text delimiter string. The de-
		 fault document	separator in an	fzk file is an ASCII form feed
		 character  followed  by  an  ASCII line feed ('). For certain
		 multibyte languages it	may be more convenient	to  specify  a
		 nonASCII string as the	document delimiter.

       -h0	 Instructs  dtsrindex  to  not check for duplicate record ids.
		 This option should not	be specified unless it is certain that
		 there are no duplicate	ids in the fzk file.

       -hhashsz	 Sets  the  duplicate record id	hash table size	to hashsz. The
		 default is 3000.  dtsrindex will execute more efficiently  if
		 the  specified	 table size is larger than the number of docu-
		 ments in the fzk file.

       -rrecdots Instructs dtsrindex to	print a	progress character  to	stdout
		 for  every recdots documents processed	during the first pass.
		 The default is	20.

       -bbatchsz Sets the batch	size to	batchsz. The  default  is  10000.  The
		 batch size is the maximum number of records processed in Pass
		 1 before copying the in memory	 index	to  disk  in  Pass  2.
		 Larger	 batch	sizes  significantly improve execution time in
		 Pass 2, but require exponentially larger amounts  of  memory.
		 The default batch size	has been optimized for moderately fast
		 machines with large amounts of	memory.

       -ccachesz Sets the number of 1024 byte cache pages used by the DtSearch
		 Database Management System to cachesz.	The default is 64. The
		 cache size affects memory  paging  performance	 for  word  b-
		 trees.	 cacheszshould be greater than or equal	to 16, in even
		 powers	of 2. The default is usually sufficient.

       -iinbufsz Sets the size of the input line buffer	to  inbufsz.  The  de-
		 fault is 1024 bytes. This buffer is used only for reading the
		 four ASCII header lines for each document  in	an  fzk	 file.
		 (The  text  portion  of  each document	is parsed on the fly a
		 word at a time.) Increasing inbufsz may  be  appropriate  for
		 very  large  abstracts, but the default is sufficient in most

       The required input file name (file) identifies the file to be processed
       by dtsrindex. It	can optionally include a path prefix, either from root
       or relative to the current working directory. If	a file name  extension
       is not specified, dtsrindex assumes a default extension of .fzk.




       The return values are as	follows:

       0	 dtsrindex completed successfully.

       1	 dtsrindex  successfully  recovered from an error. This	occurs
		 when one or more documents were discarded because of  a  par-
		 tially	 invalid  fzk  file  format,  duplicate	record ids, or
		 empty record text.

       >1	 dtsrindex encountered a fatal error.

       dtsrindex reads the specified fzk file and opens	all the	 database  and
       related language	files for the specified	database name.

       dtsrindex updates the following database	files:








       Index all words in the fzk file named batch1.fzk	in the current working
       directory into database mydb.

       dtsrindex -dmydb	batch1

       Load database mydb with the documents specified in the fzk file	/u/dt-
       search/jpndocs.1. Three ASCII plus signs	at the bottom of each document
       signals the end of document text	and the	beginning of the next fzk file

       dtsrindex -dmydb	-t+++ /u/dtsearch/jpndocs.1

       dtsrload(1), dtsrhan(1),	dtsrfzkfiles(4), DtSearch(5)

							   dtsrindex(user cmd)


Want to link to this manual page? Use this URL:

home | help