Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
mifluz(3)		   Library Functions Manual		     mifluz(3)

       mifluz -	C++ library to use and manage inverted indexes

       #include	<mifluz.h>

	  Configuration* config	= WordContext::Initialize();

	  WordList* words = new	WordList(*config);


	  delete words;


       The  purpose of mifluz is to provide a C++ library to build and query a
       full text inverted index. It is dynamically updatable, scalable (up  to
       1Tb  indexes),  uses  a controlled amount of memory, shares index files
       and memory cache	among processes	or threads and compresses index	 files
       to  50%	of the raw data. The structure of the index is configurable at
       runtime and allows inclusion  of	 relevance  ranking  information.  The
       query  functions	 do  not  require  loading  all	 the  occurrences of a
       searched	term.  They consume very few resources and many	 searches  can
       be run in parallel.

       The  file  management  library used in mifluz is	a modified Berkeley DB
       ( version 3.1.14.


	      reads the	configuration file and manages it in memory.


	      read configuration and setup mifluz context.


	      abstract class to	search and retrieve entries in a WordList  ob-


	      search and retrieve entries in a WordListOne object.

	      inverted index usage environment.


	      manage and use an	inverted index dictionary.

	      inverted index key.

	      information on the key structure of the inverted index.


	      abstract class to	manage and use an inverted index file.


	      manage and use an	inverted index file.

	      monitoring classes activity.

	      inverted index record.

	      information on the record	structure of the inverted index.

	      inverted index occurrence.

	      defines a	word in	term of	allowed	characters, length etc.


	      dump the content of an inverted index in Berkeley	DB fashion


	      displays statistics for Berkeley DB environments.


	      displays statistics for Berkeley DB environments.


	      dump the dictionnary of an inverted index.


	      dump the content of an inverted index.


	      load the content of an inverted index.

	      search the content of an inverted	index.

       The  format  of	the configuration file read by WordContext::Initialize
       keyword:	value
       Comments	may be added on	lines starting with a #. The default  configu-
       ration file is read from	from the file pointed by the MIFLUZ_CONFIG en-
       vironment variable or ~/.mifluz or /etc/mifluz.conf in this  order.  If
       no configuration	file is	available, builtin defaults are	used.  Here is
       an example configuration	file:
       wordlist_extend:	true
       wordlist_cache_size: 10485760
       wordlist_page_size: 32768
       wordlist_compress: 1
       wordlist_wordrecord_description:	NONE
       wordlist_wordkey_description: Word/DocID	32/Flags 8/Location 16
       wordlist_monitor: true
       wordlist_monitor_period:	30
       wordlist_monitor_output:	monitor.out,rrd

       wordlist_allow_numbers {true|false} <number> (default false)
	      A	digit is considered a valid character within a	word  if  this
	      configuration  parameter is set to true otherwise	it is an error
	      to insert	a word containing digits.  See	the  Normalize	method
	      for more information.

       wordlist_cache_inserts {true|false} (default false)
	      If true all Insert calls are cached in memory. When the WordList
	      object is	closed or a different  access  method  is  called  the
	      cached entries are flushed in the	inverted index.

       wordlist_cache_max <bytes> (default 0)
	      Maximum  size  of	the cumulated cache files generated when doing
	      bulk insertion with the BatchStart() function. When  this	 limit
	      is reached, the cache files are all merged into the inverted in-
	      dex.  The	value 0	means infinite size allowed.  See  WordList(3)
	      for the rationale	behind cache file handling.

       wordlist_cache_size <bytes> (default 500K)
	      Berkeley	DB  cache  size	 (see Berkeley DB documentation) Cache
	      makes a huge difference in performance. It must be at  least  2%
	      of the expected total data size. Note that if compression	is ac-
	      tivated the data size is eight times larger than the actual file
	      size.  In	 this  case the	cache must be scaled to	2% of the data
	      size, not	2% of the file size. See Cache tuning  in  the	mifluz
	      guide  for more hints.  See WordList(3) for the rationale	behind
	      cache file handling.

       wordlist_compress {true|false} (default false)
	      Activate compression of the index. The resulting index is	 eight
	      times smaller than the uncompressed index.

       wordlist_env_dir	<directory> (default .)
	      Only  valid  if wordlist_env_share set to	true.  Specify the di-
	      rectory in which the sharable environment	will be	 created.  All
	      inverted	indexes	specified with a non-absolute pathname will be
	      created relative to this directory.

       wordlist_env_share {true,false} (default	false)
	      If true a	sharable environment is	open or	created	if none	exist.

       wordlist_env_skip {true,false} (default false)
	      If true no environment is	created	at all.	 This  must  never  be
	      used  if	a WordList object is created. It may be	useful if only
	      WordKey objects are used,	for instance.

       wordlist_extend {true|false} (default false)
	      If true maintain reference count of unique  words.  The  Noccur-
	      rence method gives access	to this	count.

       wordlist_locale <locale>	(default C)
	      Set the locale of	the program to locale for more information.

       wordlist_lowercase {true|false} <number>	(default true)
	      If  a word contains upper	case letters it	is converted to	lower-
	      case if this configuration parameter is true,  otherwise	it  is
	      left untouched.

       wordlist_maximum_word_length <number> (default 25)
	      The maximum length of a word.  See the Normalize method for more

       wordlist_mimimun_word_length <number> (default 3)
	      The minimum length of a word.  See the Normalize method for more

       wordlist_monitor	{true|false} (default false)
	      If  true	create a WordMonitor instance to gather	statistics and
	      build reports.

       wordlist_monitor_output <file>[,{rrd,readable] (default stderr)
	      Print reports on file instead of the default stderr If  type  is
	      set  to  rrd  the	output is fit for the benchmark-report script.
	      Otherwise	it a (hardly :-) readable string.

       wordlist_monitor_period <sec> (default 0)
	      If the value sec is a positive integer, set a timer to print re-
	      ports  every sec seconds.	The timer is set using the ALRM	signal
	      and will fail if the calling application already has  a  handler
	      on that signal.

       wordlist_page_size <bytes> (default 8192)
	      Berkeley DB page size (see Berkeley DB documentation)

       wordlist_truncate {true|false} <number> (default	true)
	      If   a   word  is	 too  long  according  to  the	wordlist_maxi-
	      mum_word_length it is truncated if this configuration  parameter
	      is true otherwise	it is considered an invalid word.

       wordlist_valid_punctuation [characters] (default	none)
	      A	 list  of  punctuation	characters  that may appear in a word.
	      These characters will be removed from the	word before  insertion
	      in the index.

       wordlist_verbose	<number> (default 0)
	      Set the verbosity	level of the WordList class.

	      1	walk logic

	      2	walk logic details

	      3	walk logic lots	of details

       wordlist_wordkey_description <desc> (no default)
	      Describe	the  structure of the inverted index key.  In the fol-
	      lowing explanation of the	_desc_ format, mandatory words are  in
	      bold and values that must	be replaced in italic.

	      Word bits/name bits [/...]

	      The  name	 is an alphanumerical symbolic name for	the key	field.
	      The bits is the number of	bits required  to  store  this	field.
	      Note  that  all values are stored	in unsigned integers (unsigned
	      int).  Example:
	      Word 8/Document 16/Location 8

       wordlist_wordkey_document [field	...] (default none)
	      A	white space separated list of field numbers that define	a doc-
	      ument.   The  field  number  list	must not contain gaps. For in-
	      stance 1 2 3 is valid but	1 3 4 is not valid.   This  configura-
	      tion parameter is	not used by the	mifluz library but may be used
	      by a query application to	define the semantic of a document.  In
	      response	to  a query, the application will return a list	of re-
	      sults in which only distinct documents will be shown.

       wordlist_wordkey_location field (default	none)
	      A	single field number that contains the position of a word in  a
	      given document.  This configuration parameter is not used	by the
	      mifluz library but may be	used by	a query	application.

       wordlist_wordrecord_description {NONE|DATA|STR} (no default)
	      NONE: the	record is empty

	      DATA: the	record contains	an integer (unsigned int)

	      STR: the record contains a string	(String)

       MIFLUZ_CONFIG file name of configuration	file read  by  WordContext(3).
       Defaults	to ~/.mifluz.  or /usr/etc/mifluz.conf

       Loic Dachary

       The Ht://Dig group

       htdb_dump(1), htdb_stat(1), htdb_load(1), mifluzdump(1),	mifluzload(1),
       mifluzsearch(1),	 mifluzdict(1),	 WordContext(3),  WordList(3),	 Word-
       Dict(3),	WordListOne(3),	WordKey(3), WordKeyInfo(3), WordType(3), Word-
       DBInfo(3), WordRecordInfo(3), WordRecord(3), WordReference(3), WordCur-
       sor(3), WordCursorOne(3), WordMonitor(3), Configuration(3)

				     local			     mifluz(3)


Want to link to this manual page? Use this URL:

home | help