Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
huffcode(user cmd)					    huffcode(user cmd)

       huffcode	-- Create optimized DtSearch compression/decompression tables

       huffcode	[-llit_thresh  | -l- ]	[-o] huffile  [textfile]

       huffcode	creates	optimized DtSearch compression/decompression tables.

       Documents  stored  in  a	DtSearch database text repository can be first
       compressed using	a Huffman text compression  algorithm.	The  algorithm
       provides	 optimal  compression only with	preanalysis of the statistical
       distribution of bytes in	the database corpus.  huffcode analyses	a text
       corpus  and generates DtSearch compression and decompression tables. It
       is provided as a	convenience utility for	database developers  who  want
       to  optimize  offline  storage requirements. Compression	is not used in
       databases created without the ability  to  store	 text  in  a  DtSearch

       huffcode	 reads a text file as input and	writes out ophuf.huf (compres-
       sion or "encode"	table) and ophuf.c (decompression or "decode"  table).
       ophuf.huf  is  an external ascii	file that also retains the statistical
       information on how it was generated.  huffcode can be executed  repeat-
       edly  against different text samples, continually accumulating results.
       In the case of a	small or static	text corpus, the entire	corpus can  be
       fed  into huffcode for optimal huffman compression. In large or dynamic
       databases the typical practice is to feed dynamic f representative text

       The  huffman  code  tables  are created once for	each API instance (not
       once per	database) before any documents are loaded. The only program to
       read  the  encode  table,  an external file, is dtsrload. The ophuf.huf
       file generated by huffcode should be used instead of the	 provided  de-
       fault  file  prior to the first run of dtsrload for any databases to be
       accessed	by a particular	API instance. The decode table,	 a  C  module,
       should  be  compiled  and linked	into the application code ahead	of the
       API library to override the default decode module in the	 library.  Huf
       files and decode	modules	are not	user editable.

       It  is  imperative  that	the encode and decode tables reflect identical
       byte statistics to prevent decode errors. The first line	 of  ophuf.huf
       includes	 a long	integer	value named HCTREE_ID. Each execution of huff-
       code generates a	new, unique hctree_id integer. dtsrload	loads this in-
       teger  into  the	database configuration and status record when it loads
       the first document into a new database. Thereafter, each	 execution  of
       dtsrload	for that database confirms that	the same hctree_id is used for
       each document compression. It will abort	 if  the  ophuf.huf  hctree_id
       does not	match the value	for a database from previous executions.

       hctree_id  is  also  stored as a	variable in the	decode module ophuf.c.
       DtSearchInit will not open any database listed in the  ocf  file	 whose
       hctree_id,  as  stored in its configuration and status record, does not
       match the value in the decode module. The dtsrdbrec utility will	 print
       the hctree_id value for any database.

       The following options are available:


	      If  an option takes a value, the value must be directly appended
	      to the option name without white space.

		 Sets the literal character's minimum threshold	to the integer
		 specified by lit_thresh.

		 This  Huffman	algorithm implements a pseudo-character	called
		 the literal character.	It  represents	all  characters	 whose
		 frequency  is	so low that no huffman translation will	be at-
		 tempted. This reduces the maximum length  of  the  coded  bit
		 string	when there are lots of zero- or	low-frequency bytes in
		 the text corpus. For example, pure ASCII text files only  oc-
		 casionally have byte values less than 32 (control characters)
		 and rarely greater than 127 (high order bit turned  on).  The
		 lit_thresh  value specifies the literal character's threshold
		 count.	After counting is completed, any character in the  en-
		 code  table  occurring	 with  frequency less than or equal to
		 lit_thresh will be coded with the literal character.

		 If this option	and the	-l- option are omitted,	the default is
		 -l0,  meaning	that literal coding is provided	only for bytes
		 that never occur (counts of zero).

       -l-	 Disables literal character encoding. Disabling	literal	 char-
		 acter	encoding  in corpa with	unbalanced byte	frequency dis-
		 tributions will lead to extremely long	bit string codes. Most
		 natural  language text	corpa are represented by highly	unbal-
		 anced frequency distributions so this option  is  not	recom-
		 mended	for most DtSearch applications.

		 If  this  option and the -llit_thresh option are omitted, the
		 default is -l0, meaning that literal coding is	provided  only
		 for bytes that	never occur (counts of zero).

       -o	 Suppresses the	overwrite prompt. It preauthorizes erasure and
		 reinitialization of the decode	module.

       textfile	 Specifies an optional input file of text that is  representa-
		 tive  of  the	entire text corpus of the databases. It	should
		 contain bytes in the same relative  abundances	 as  occur  in
		 documents  in	the  entire corpus. Since huffcode can be exe-
		 cuted repeatedly with different  document  textfiles,	it  is
		 possible  to  analyze the entire actual corpus	if it is small
		 enough	or static.

		 If textfile is	not specified, the  byte  frequencies  in  the
		 currently  loaded  tables  are	 not  changed, and the huffman
		 codes are recomputed with the existing	frequencies.  This  is
		 useful	 for  examining	the relative merits of using different
		 literal character thresholds.

       The required input file name (huffile) is the base file name of the en-
       code  table,  excluding the .huf	extension. dtsrload expects huffile to
       be ophuf. Similarly, the	decode module will be named huffile.c.

       At the beginning	of each	new execution, huffcode	tries to open the  en-
       code table file and continue byte frequency counting from the last run.
       If the huf file represented by huffile  does  not  exist,  the  table's
       counts are initialized to zeroes. The decode module is recomputed fresh
       each run, whether it existed before or not.




       The return values are as	follows:

       0	 huffcode completed successfully.

       nonzero	 huffcode encountered an error.

       huffcode	reads the specified huffile. It	also reads textfile if	it  is
       specified.  It writes to	huffile.huf and	huffile.c.

       Read  ophuf.huf if it exists and	initialize the internal	byte count ta-
       ble with	its byte frequency counts. If ophuf.huf	does  not  exist,  the
       internal	 byte  counts will be initialized to zeros. The	encoding table
       in the original huf file	will be	discarded. The text file foo.txt  will
       be  read	and its	individual byte	frequencies added to the internal byte
       count table. Then, ophuf.huf will be  written  out,  with  an  encoding
       scheme  based  on the current byte counts, and with a literal character
       encoding	all bytes that have zero frequency.  Finally,  if  the	decode
       module  ophuf.c already exists, a prompt	requesting permission to over-
       write it	will be	output to stdout and, if an  affirmative  response  is
       read  from stdin, a new version corresponding to	the new	ophuf.huf will
       be written out.

       huffcode	ophuf foo.txt

       Read myappl.huf and initialize the internal byte	count table  with  its
       byte  frequency	counts.	 Since	no textfile argument is	specified, the
       only possible action is to build	different coding tables	using existing
       frequency  counts in myappl.huf.	The new	tables will be based on	a lit-
       eral character implementation where only	bytes that occur more than 200
       times  will  be given an	encoding; all other bytes will be encoded with
       the  literal  character.	 After	new  encoding  tables  are   generated
       myappl.huf will be written out. The decode module myappl.c will also be
       written out without prompting whether it	preexists or not.

       huffcode	-l200 -o myappl

       dtsrcreate(1), dtsrdbrec(1), dtsrload(1), DtSrAPI(3), DtSearch(5)

							    huffcode(user cmd)


Want to link to this manual page? Use this URL:

home | help