Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
DBACL(1)							      DBACL(1)

       dbacl - a digramic Bayesian classifier for text recognition.

       dbacl [-01dvnirmwMNDXW] [-T type	] -l category [-h size]	[-H gsize] [-x
	      decim] [-q quality] [-w max_order] [-e deftok] [-o  online]  [-L
	      measure] [-z ftresh] [-O ronline]...  [-g	regex]...  [FILE]...

       dbacl  [-vnimNRXYP]  [-h	 size]	[-T type] -c category [-c category]...
	      [-f keep]...  [FILE]...

       dbacl -V

       dbacl is	a Bayesian text	 and  email  classifier.  When	using  the  -l
       switch,	it  learns  a  body  of	text and produce a file	named category
       which summarizes	the text. When using the -c  switch,  it  compares  an
       input  text  stream  with any number of category	files, and outputs the
       name of the closest  match,  or	optionally  various  numerical	scores
       explained below.

       Whereas	this manual page is intended as	a reference, there are several
       tutorials and documents you can read to	get  specialized  information.
       Specific	 documentation	about  the design of dbacl and the statistical
       models that it uses can be found	in  For a basic overview  of
       text  classification  using dbacl, see tutorial.html. A companion tuto-
       rial geared towards email filtering is email.html. If you have  trouble
       getting dbacl to	classify reliably, read	is_it_working.html.  The USAGE
       section of this manual page also	has some examples.





       dbacl uses a maximum entropy (minimum divergence) language  model  con-
       structed	 with  respect to a digramic reference measure (unknown	tokens
       are predicted from digrams, i.e.	pairs of letters).  Practically,  this
       means  that  a category is constructed from tokens in the training set,
       while previously	unseen tokens  can  be	predicted  automatically  from
       their  letters.	A token	here is	either a word (fragment) or a combina-
       tion of words (fragments),  selected  according	to  various  switches.
       Learning	roughly	works by tweaking token	probabilities until the	train-
       ing data	is least surprising.

       The normal shell	exit conventions aren't	followed (sorry!). When	 using
       the -l command form, dbacl returns zero on success, nonzero if an error
       occurs. When using the -c form, dbacl returns a positive	integer	corre-
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first	most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

       When  using the -l command form,	dbacl learns a category	when given one
       or more FILE names, which should	contain	readable  ASCII	 text.	If  no
       FILE  is	 given,	dbacl learns from STDIN. If FILE is a directory, it is
       opened and all its files	are read,  but	not  its  subdirectories.  The
       result  is  saved  in  the  binary  file	named category,	and completely
       replaces	any previous contents. As a convenience,  if  the  environment
       variable	DBACL_PATH contains a directory, then that is prepended	to the
       file path, unless category starts with a	'/' or a '.'.

       The input text for learning is assumed to be unstructured plain text by
       default.	 This  is  not suitable	for learning email, because email con-
       tains various transport encodings and formatting	instructions which can
       reduce classification effectiveness. You	must use the -T	switch in that
       case so that dbacl knows	it should perform decoding  and	 filtering  of
       MIME  and HTML as appropriate.  Apropriate switch values	are "-T	email"
       for RFC2822 email input,	"-T html" for HTML input, "-T xml" for generic
       XML  style  input and "-T text" is the default plain text format. There
       are other values	of the -T switch that also allow fine  tuning  of  the
       decoding	capabilities.

       When  using  the	 -c  command form, dbacl attempts to classify the text
       found in	FILE, or STDIN if no FILE is  given.  Each  possible  category
       must  be	 given separately, and should be the file name of a previously
       learned text corpus. As a convenience, if the variable DBACL_PATH  con-
       tains  a	 directory,  it	 is  prepended to each file path which doesn't
       start with a '/'	or a '.'. The visible  output  of  the	classification
       depends	on  the	 combination  of  extra	switches used. If no switch is
       used, then no output is shown on	STDOUT.	However, dbacl always produces
       an exit code which can be tested.

       To see an output	for a classification, you must use at least one	of the
       -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
       to  produce a natural variation of their	individual outputs. Sometimes,
       dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of  the  best  category  among  all  the
       choices given.

       The  -U switch outputs the name of the best category followed by	a con-
       fidence percentage. Normally, this is the switch	that you want to  use.
       A  percentage  of  100% means that dbacl	is sure	of its choice, while a
       percentage of 0%	means that some	other category is equally likely. This
       is  not the model probability, but measures how unambiguous the classi-
       fication	is, and	can be used to tag unsure classifications (e.g.	if the
       confidence is 25% or less).

       The  -N	switch	prints	each category name followed by its (posterior)
       probability, expressed as a percentage. The percentages always  sum  to
       100%.  This is intuitive, but only valuable if the document being clas-
       sified contains a handful of tokens (ten	or less). In the  common  case
       with  many more tokens, the probabilities are always extremely close to
       100% and	0%.

       The -n switch prints each category name followed	by the negative	 loga-
       rithm  of  its  probability. This is equivalent to using	the -N switch,
       but much	more useful. The smallest number gives the  best  category.  A
       more  convenient	 form is to use	both -n	and -v which prints each cate-
       gory name followed by the cross entropy and the number of  tokens  ana-
       lyzed.  The  cross  entropy  measures (in bits) the average compression
       rate which is achievable, under the given category model, per token  of
       input  text.  If	 you  use all three of -n,-v,-X	then an	extra value is
       output for each category, representing a	kind of	p-value	for each cate-
       gory  score.  This  indicates  how typical the score is compared	to the
       training	documents, but only works if the -X  switch  was  used	during
       learning, and only for some types of models (e.g. email).  These	p-val-
       ues are uniformly distributed and independent (if  the  categories  are
       independent),  so  can  be  combined using Fisher's chi squared test to
       obtain composite	p-values for groupings of categories.

       The -v and -X switches together print each category name	followed by  a
       detailed	 decomposition	of  the	category score,	factored into (	diver-
       gence rate + shannon entropy rate )* token count	@ p-value. Again, this
       only works in some types	of models.

       The  -v and -U switches print each category name	followed by a decompo-
       sition of the category score into ( divergence rate +  shannon  entropy
       rate # score variance )*	token count.

       The -D switch prints out	the input text as modified internally by dbacl
       prior to	tokenization. For example, if a	MIME encoded email document is
       classified,  then  this	prints	the decoded text that will be actually
       tokenized and classified. This switch is	mainly useful for debugging.

       The -d switch dumps tokens and scores while they	are being read.	It  is
       useful  for  debugging,	or if you want to create graphical representa-
       tions of	the classification. A detailed explanation of  the  output  is
       beyond  the scope of this manual	page, but is straightforward if	you've
       read  Possible	variations include -d together with -n or  -N.

       Classification can be done with one or several categories in principle.
       When two	or more	categories are used, the Bayesian posterior  probabil-
       ity is used, given the input text, with a uniform prior distribution on
       categories. For other choices  of  prior,  see  the  companion  utility
       bayesol(1).  When a single category is used, classification can be done
       by comparing the	score with a treshold. In practice however, much  bet-
       ter results are obtained	with several categories.

       Learning	 and  classifying  cannot be mixed on the same command invoca-
       tion, however there are no locking issues and separate dbacl  processes
       can  operate  simultaneously  with obvious results, because file	opera-
       tions are designed to be	atomic.

       Finally,	note that dbacl	does not manage	your document corpora or  your
       computed	 categories.   In  particular,	dbacl cannot add or subtract a
       document	from a category	file directly.	If you want to learn  a	 cate-
       gory incrementally, the standard	way is to keep adding to your document
       corpus, and learn the whole corpus each time.  By  keeping  control  of
       your  archives,	you can	never lose the information in your categories,
       and you can easily experiment with different switches or	 tokenizations
       or sets of training documents if	you like.

       If  the standard	incremental learning method is too slow, the -o	switch
       can help. This creates a	data file named	online which contains all  the
       document	 statistics that have been learned. When you use the -l	and -o
       switches	together, dbacl	merges the online data	file  (if  it  exists)
       with  the  new document(s) to be	learned, and recreates an updated ver-
       sion of online.	This is	equivalent to adding the new documents to  the
       corpus  and relearning the whole	corpus,	but faster. However, documents
       cannot be removed if you	change your mind.  This	 is  a	limitation  of
       dbacl  which  cannot  be	changed	for mathematical reasons. You can work
       around this by making backups of	the online data	file. It is also  pos-
       sible  to  merge	 one or	more extra online data files simultaneously by
       using the -O switch one or more times.

       By default, dbacl classifies the	input text as a	whole, ie it only out-
       puts  a	single	result even if you specify several input files.	If you
       want to classify	multiple input files you can either call dbacl repeat-
       edly  (which is fast when you use the -m	switch), or use	the -F switch,
       which prints each input FILE followed by	 the  result  for  that	 FILE.
       Alternatively, you can classify each line of the	input individually, by
       using the -f option, which prints only those lines which	match  one  or
       more  models  identified	 by  keep  (use	the category name or number to
       refer to	a category). This last switch is useful	if you want to	filter
       out  some  lines,  but note that	if the lines are short,	then the error
       rate can	be high.

       The -e,-w,-g,-j switches	are used for  selecting	 an  appropriate  tok-
       enization  scheme. A token is a word or word fragment or	combination of
       words or	fragments. The shape of	tokens is important because  it	 forms
       the  basis of the language models used by dbacl.	 The -e	switch selects
       a predefined tokenization scheme, which is speedy but limited.  The  -w
       switch specifies	composite tokens derived from the -e switch. For exam-
       ple, "-e	alnum -w 2" means that	tokens	should	be  alphanumeric  word
       fragments combined into overlapping pairs (bigrams). When the -j	switch
       is used,	all tokens are converted to lowercase, which reduces the  num-
       ber of possible tokens and therefore memory consumption.

       If  the	-g  switch is used, you	can completely specify what the	tokens
       should look like	using a	regular	expression. Several -g switches	can be
       used  to	construct complex tokenization schemes,	and parentheses	within
       each expression can be used to select fragments and combine  them  into
       n-grams.	 The  cost  of	such flexibility is reduced classification and
       learning	speed. When experimenting with tokenization schemes, try using
       the -d or -D switches while learning or classifying, as they will print
       the tokens explicitly so	you can	see what text fragments	are picked  up
       or missed out. For regular exression syntax, see	regex(7).

       The  -h	and  -H	 switches  regulate  how much memory dbacl may use for
       learning. Text classification can use a lot of memory, and  by  default
       dbacl  limits  itself even at the expense of learning accuracy. In many
       cases if	a limit	is reached, a  warning	message	 will  be  printed  on
       STDERR with some	advice.

       When  relearning	the same category several times, a significant speedup
       can be obtained by using	the -1 switch, as this allows  the  previously
       learned probabilities to	be read	from the category and reused.

       Note  that  classification  accuracy depends foremost on	the amount and
       quality of the training samples,	and then only on amount	of tweaking.

       When using the -l command form, dbacl returns  zero  on	success.  When
       using  the  -c form, dbacl returns a positive integer (1,2,3...)	corre-
       sponding	to the category	with the  highest  posterior  probability.  In
       case  of	a tie, the first most probable category	is chosen. If an error
       occurs, dbacl returns zero.

       -0     When  learning,  prevents	 weight	 preloading.  Normally,	 dbacl
	      checks  if the category file already exists, and if so, tries to
	      use the existing weights as a starting point. This can  dramati-
	      cally  speed  up learning.  If the -0 (zero) switch is set, then
	      dbacl behaves as if no category file  already  exists.  This  is
	      mainly  useful  for  testing.   This  switch  is	now enabled by
	      default, to protect against weight drift which can reduce	 accu-
	      racy  over many learning iterations. Use -1 to force preloading.

       -1     Force weight preloading if the category file already exists. See
	      discussion of the	-0 switch.

       -a     Append  scores.  Every  input  line is written to	STDOUT and the
	      dbacl scores are appended. This  is  useful  for	postprocessing
	      with  bayesol(1).	  For ease of processing, every	original input
	      line is indented by a single space (to distinguish them from the
	      appended	scores),  and the line with the	scores (if -n is used)
	      is prefixed with the string "scores ". If	a second copy of dbacl
	      needs  to	 read this output later, it should be invoked with the
	      -A switch.

       -d     Dump the model parameters	to STDOUT. In conjunction with the  -l
	      option,  this  produces  a human-readable	summary	of the maximum
	      entropy model. In	conjunction with the -c	option,	 displays  the
	      contribution  of	each  token to the final score.	Suppresses all
	      other normal output.

       -e     Select character class for default (not  regex-based)  tokeniza-
	      tion.  By	default, tokens	are alphabetic strings only. This cor-
	      responds to the case when	deftok is "alpha". Possible values for
	      deftok  are  "alpha", "alnum", "graph", "char", "cef" and	"adp".
	      The last two are custom tokenizers intended for email  messages.
	      See  also	 isalpha(3).   The  "char"  tokenizer  picks up	single
	      printable	characters rather than bigger tokens, and is  intended
	      for testing only.

       -f     Filter  each  line  of  input separately,	passing	to STDOUT only
	      lines which match	the category identified	as keep.  This	option
	      should  be used repeatedly for each category which must be kept.
	      keep can be either the category file name, or a positive integer
	      representing  the	required category in the same order it appears
	      on the command line.

	      Output lines are flushed as soon as they	are  written.  If  the
	      input  file  is  a  pipe or character device, then an attempt is
	      made to use line buffering mode, otherwise  the  more  efficient
	      block buffering is used.

       -g     Learn only features described by the extended regular expression
	      regex.  This overrides the default feature selection method (see
	      -w  option) and learns, for each line of input, only tokens con-
	      structed from the	 concatenation	of  strings  which  match  the
	      tagged subexpressions within the supplied	regex.	All substrings
	      which match regex	within a suffix	of each	input line are treated
	      as features, even	if they	overlap	on the input line.

	      As  an  optional convenience, regex can include the suffix ||xyz
	      which indicates which  parenthesized  subexpressions  should  be
	      tagged. In this case, xyz	should consist exclusively of digits 1
	      to 9, numbering exactly those  subexpressions  which  should  be
	      tagged.  Alternatively,  if  no  parentheses exist within	regex,
	      then it is assumed that the whole	expression must	be captured.

       -h     Set the size of the hash table to	2^size	elements.  When	 using
	      the  -l  option,	this  refers  to  the total number of features
	      allowed in the maximum entropy model being learned.  When	 using
	      the  -c option toghether with the	-M switch and multinomial type
	      categories, this refers to the maximum number of features	 taken
	      into account during classification.  Without the -M switch, this
	      option has no effect.

       -i     Fully internationalized mode. Forces the use of wide  characters
	      internally,  which  is  necessary	in some	locales. This incurs a
	      noticeable performance penalty.

       -j     Make features case sensitive. Normally, all  features  are  con-
	      verted  to  lower	 case during processing, which reduces storage
	      requirements  and	 improves  statistical	estimates  for	 small
	      datasets.	 With this option, the original	capitalization is used
	      for each feature.	This can improve classification	accuracy.

       -m     Aggressively maps	categories into	memory and locks them into RAM
	      to  prevent  swapping, if	possible. This is useful when speed is
	      paramount	and memory is plentiful, for example when testing  the
	      classifier on large datasets.

	      Locking  may  require  relaxing user limits with ulimit(1).  Ask
	      your system administrator.  Beware  when	using  the  -m	switch
	      together	with  the  -o  switch,	as only	one dbacl process must
	      learn or classify	at a time to prevent file  corruption.	If  no
	      learning	takes  place,  then  the  -m switch for	classifying is
	      always safe to use. See also the discussion for the -o switch.

       -n     Print scores for each category.  Each score is  the  product  of
	      two  numbers,  the cross entropy and the complexity of the input
	      text under each model. Multiplied	together, they	represent  the
	      log probability that the input resembles the model. To see these
	      numbers separately, use also the -v option. In conjunction  with
	      the  -f  option,	stops  filtering  but  prints  each input line
	      prepended	with a list of scores for that line.

       -q     Select quality of	learning, where	quality	can be 1,2,3,4.	Higher
	      values  take  longer to learn, and should	be slightly more accu-
	      rate. The	default	quality	is 1  if  the  category	 file  doesn't
	      exist or weights cannot be preloaded, and	2 otherwise.

       -o     When  learning, reads/writes partial token counts	so they	can be
	      reused. Normally,	category files are learned  from  exactly  the
	      input data given,	and don't contain extraneous information. When
	      this option is in	effect,	some extra information is saved	in the
	      file  online,  after all input was read. This information	can be
	      reread the next time that	learning occurs, to continue where the
	      previous	dataset	 left off. If online doesn't exist, it is cre-
	      ated. If online exists, it is read before	learning, and  updated
	      afterwards.  The file is approximately 3 times bigger (at	least)
	      than the learned category.

	      In dbacl,	file updates are atomic, but if	using the  -o  switch,
	      two  or  more processes should not learn simultaneously, as only
	      one process will write a lasting category	and memory  dump.  The
	      -m  switch can also speed	up online learning, but	beware of pos-
	      sible corruption.	 Only one process should read or write a file.
	      This  option is intended primarily for controlled	test runs. See
	      also the -O (big-oh) switch.

       -r     Learn the	digramic reference model only. Skips the  learning  of
	      extra features in	the text corpus.

       -v     Verbose  mode.  When learning, print out details of the computa-
	      tion, when classifying, print out	the name of the	most  probable
	      category.	  In conjunction with the -n option, prints the	scores
	      as an explicit product of	the cross entropy and the  complexity.

       -w     Select  default features to be n-grams up	to max_order.  This is
	      incompatible with	the -g option, which always takes  precedence.
	      If  no -w	or -g options are given, dbacl assumes -w 1. Note that
	      n-grams for n greater than 1 do  not  straddle  line  breaks  by
	      default.	The -S switch enables line straddling.

       -x     Set  decimation probability to 1 - 2^(-decim).  To reduce	memory
	      requirements when	learning, some inputs  are  randomly  skipped,
	      and  only	a few are added	to the model.  Exact behaviour depends
	      on the applicable	-T option (default is -T  "text").   When  the
	      type  is not "email" (eg "text"),	then individual	input features
	      are added	with probability 2^(-decim). When the type is "email",
	      then  full input messages	are added with probability 2^(-decim).
	      Within each such message,	all features are used.

       -z     When learning, only take into account features whose  occurrence
	      count  is	strictly greater than ftreshold. By default, ftreshold
	      is zero, so all features in the training corpus are used.	A neg-
	      ative value of ftreshold causes dbacl to subtract	from the maxi-
	      mum observed feature count, and to use that if it	 is  positive.
	      For  example,  -z	1 means	dbacl only learns features which occur
	      at least twice in	the corpus, and	-z -5 means dbacl only	learns
	      the  feature(s) whose occurrence count is	within 4 of the	global

       -A     Expect indented  input  and  scores.  With  this	switch,	 dbacl
	      expects  input  lines to be indented by a	single space character
	      (which is	then skipped).	Lines starting with any	other  charac-
	      ter are ignored. This is the counterpart to the -a switch	above.
	      When used	together with the -a switch, dbacl outputs the skipped
	      lines  as	they are, and reinserts	the space at the front of each
	      processed	input line.

       -D     Print debug output. Do not use normally, but can be very	useful
	      for displaying the list features picked up while learning.

       -F     For  each	 FILE  of  input,  print the FILE name followed	by the
	      classification result  (normally	dbacl  only  prints  a	single
	      result even if multiple files are	listed as input).

       -H     Allow  hash  table  to  grow up to a maximum of 2^gsize elements
	      during learning. Initial size is given by	-h option.

       -L     Select the digramic reference measure for	character transitions.
	      The  measure  can	 be one	of "uniform", "dirichlet" or "maxent".
	      Default is "uniform".

       -M     Force multinomial	calculations. When learning, forces the	 model
	      features to be treated multinomially. When classifying, corrects
	      entropy scores to	reflect	multinomial probabilities (only	appli-
	      cable  to	 multinomial  type  models,  if	present).  Scores will
	      always be	lower, because the ordering of features	is lost.

       -N     Print posterior probabilities for	each category.	 This  assumes
	      the  supplied  categories	 form an exhaustive list of possibili-
	      ties.  In	conjunction with the -f	option,	 stops	filtering  but
	      prints each input	line prepended with a summary of the posterior
	      distribution for that line.

       -O     This switch causes the online data  file	named  ronline	to  be
	      merged  during  learning.	The ronline file must be created using
	      the -o (little-oh) switch.  Several -O data files	can be	merged
	      simultaneously.  This  is	 intended to be	a read only version of
	      -o, to allow piecing together of several sets of preparsed data.
	      See the description of the -o switch.

       -R     Include  an  extra category for purely random text. The category
	      is called	"random".  Only	makes sense when using the -c  option.

       -P     Correct the category scores to include estimated prior probabil-
	      ities. The prior probability estimate for	each category is  pro-
	      portional	 to  the  number of documents or, if that doesn't make
	      sense, the number	of unique features. This can help  with	 "bal-
	      ancing"  when  one  category is learned from much	more data than
	      another. If all categories are learned  from  approximately  the
	      same  amount  of data (or	maybe within a factor of 2), then this
	      option should have little	qualitative effect.

       -S     Enable line straddling. This is  useful  together	 with  the  -w
	      option  to  allow	 n-grams for n > 1 to ignore line breaks, so a
	      complex token can	continue past the end of the line. This	is not
	      recommended for email.

       -T     Specify  nonstandard text	format.	By default, dbacl assumes that
	      the input	text is	a purely ASCII text file. This corresponds  to
	      the case when type is "text".

	      There  are several types and subtypes which can be used to clean
	      the input	text of	extraneous tokens before  actual  learning  or
	      classifying  takes place.	Each (sub)type you wish	to use must be
	      indicated	with a separate	-T option on  the  command  line,  and
	      automatically implies the	corresponding type.

	      The  "text"  type	 is for	unstructured plain text. No cleanup is
	      performed. This is the default if	no types are given on the com-
	      mand line.

	      The "email" type is for mbox format input	files or single	RFC822
	      emails.  Headers are recognized and most are skipped. To include
	      extra  RFC822  standard  headers (except for trace headers), use
	      the "email:headers" subtype.  To include trace headers, use  the
	      "email:theaders"	subtype.  To include all headers in the	email,
	      use the "email:xheaders" subtype.	To skip	 all  headers,	except
	      the  subject,  use "email:noheaders". To scan binary attachments
	      for strings, use the "email:atts"	subtype.

	      When the "email" type is in effect, HTML markup is automatically
	      removed  from text attachments except text/plain attachments. To
	      also  remove  HTML  markup  from	plain  text  attachments,  use
	      "email:noplain".	To  prevent  HTML  markup  removal in all text
	      attachments, use "email:plain".

	      The "html" type is for removing HTML markup (between <html>  and
	      </html>  tags)  and  surrounding	text. Note that	if the "email"
	      type is enabled, then "html" is automatically enabled  for  com-
	      patible message attachments only.

	      The  "xml"  type	is  like "html", but doesn't honour <html> and
	      </html>, and doesn't interpret tags  (so	this  should  be  more
	      properly	called	"angle	markup"	removal, and has nothing to do
	      with actual XML semantics).

	      When "html" is enabled, most markup  attributes  are  lost  (for
	      values  of  'most'  close	 to  'all').  The "html:links" subtype
	      forces link urls to be parsed and	learned, which would otherwise
	      be ignored. The "html:alt" subtype forces	parsing	of alternative
	      text  in	ALT   attributes   and	 various   other   tags.   The
	      "html:scripts"  subtype forces parsing of	scripts, "html:styles"
	      forces parsing of	styles,	"html:forms" forces  parsing  of  form
	      values, while "html:comments" forces parsing of HTML comments.

       -U     Print  (U)nambiguity.   When  used  in  conjunction  with	the -v
	      switch, prints scores followed by	their empirical	standard devi-
	      ations.  When  used alone, prints	the best category, followed by
	      an estimated probability that this category choice is  unambigu-
	      ous. More	precisely, the probability measures lack of overlap of
	      CLT confidence intervals for each	category score	(If  there  is
	      overlap, then there is ambiguity).

	      This estimated probability can be	used as	an "unsure" flag, e.g.
	      if the estimated probability is  lower  than  50%.  Formally,  a
	      score of 0% means	another	category is equally likely to apply to
	      the input, and a score of	100% means no other category is	likely
	      to  apply	 to  the  input.  Note that this type of confidence is
	      unrelated	to the -X switch. Also,	the  probability  estimate  is
	      usually low if the document is short, or if the message contains
	      many tokens that have never been seen before  (only  applies  to
	      uniform digramic measure).

       -V     Print the	program	version	number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the
	      description of -w.

       -X     Print the	confidence in the score	calculated for each  category,
	      when  used together with the -n or -N switch. Prepares the model
	      for confidence scores, when used with the	-l switch.  The	confi-
	      dence  is	 an  estimate of the typicality	of the score, assuming
	      the null hypothesis that the given  category  is	correct.  When
	      used  with  the  -v  switch  alone,  factorizes the score	as the
	      empirical	divergence plus	the  shannon  entropy,	multiplied  by
	      complexity, in that order. The -X	switch is not supported	in all
	      possible models, and displays a percentage of "0.0" if it	 can't
	      be calculated. Note that for unknown documents, it is quite com-
	      mon to have confidences close to zero.

       -Y     Print the	cumulative media counts.  Some	tokenizers  include  a
	      medium variable with each	token: for example, in email classifi-
	      cation the word "the" can	appear in the subject or the body of a
	      message,	but  the  subject is counted as	a separate medium from
	      the body.	This allows the	token frequencies to be	kept separate,
	      even  though the word is the same. Currently, up to 16 different
	      media are	supported (0-15), with	the  following	interpretation
	      for email:

	       0   unused.
	       1   default medium.
	       2   mail	body or	attachment in HTML format.
	       3   mail	body or	attachment in plain text format.
	       4   mail	header unknown.
	       5   User-Agent, Comments, Keywords, Note
	       6   X-MS*, Categor*, Priority, Importance, Thread-*
	       7   X-*
	       8   List-*
	       9   MIME-Version, Content-*
	       10  Subject
	       11  To
	       12  Sender, Sent, BCC, CC, From
	       13  Resent-*, Original-*
	       14  Message-ID, References, In-Reply-To
	       15  Received, Return-Path, Return-Receipt-To, Reply-To

	      The -Y switch prints the number of tokens	observed in each sepa-
	      rate medium, in order from 0 to 15.

       To create two category files in the current directory  from  two	 ASCII
       text  files  named  Mark_Twain.txt  and William_Shakespeare.txt respec-
       tively, type:

       % dbacl -l twain	Mark_Twain.txt
       % dbacl -l shake	William_Shakespeare.txt

       Now you can classify input text,	for example:

       % echo "howdy" |	dbacl -v -c twain -c shake
       % echo "to be or	not to be" | dbacl -v -c twain -c shake

       Note that the -v	option at least	is necessary, otherwise	dbacl does not
       print  anything.	The return value is 1 in the first case, 2 in the sec-

       % echo "to be or	not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or	not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake	6.74 * 6.0

       These invocations are equivalent. The numbers 6.74 and  7.04  represent
       how  close the average token is to each category, and 6.0 is the	number
       of tokens observed. If you want to  print  a  simple  confidence	 value
       together	with the best category,	replace	-v with	-U.

       % echo "to be or	not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note  that the true probability of category shake versus	category twain
       is 77.37%, but the calculation is somewhat ambiguous, and  34%  is  the
       confidence out of 100% that the calculation is qualitatively correct.

       Suppose	a  file	 document.txt contains English text lines interspersed
       with noise lines. To filter out the noise lines from the	English	lines,
       assuming	you have an existing category shake say, type:

       % dbacl -c shake	-f shake -R document.txt > document.txt_eng
       % dbacl -c shake	-f random -R document.txt > document.txt_rnd

       Note  that  the	quality	of the results will vary depending on how well
       the categories shake and	random represent each input line.  It is some-
       times  useful  to see the posterior probabilities for each line without

       % dbacl -c shake	-f shake -RN document.txt > document.txt_probs

       You can now postprocess the posterior probabilities for	each  line  of
       text  with  another script, to replicate	an arbitrary Bayesian decision
       rule of your choice.

       In the special case of exactly two  categories,	the  optimal  Bayesian
       decision	 procedure can be implemented for documents as follows:	let p1
       be the prior probability	that the input text  is	 classified  as	 cate-
       gory1.  Consequently, the prior probability of classifying as category2
       is 1 - p1.  Let u12 be the cost of  misclassifying  a  category1	 input
       text as belonging to category2 and vice versa for u21.  We assume there
       is no cost for  classifying  correctly.	 Then  the  following  command
       implements the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1	* u12 >	$4 *
	      (1 - p1) * u21) {	print $1; } else { print $3; } }'

       dbacl can also be used in conjunction with procmail(1) to  implement  a
       simple  Bayesian	email classification system. Assume that incoming mail
       should be automatically delivered to one	of three mail folders  located
       in  $MAILDIR and	named work, personal, and spam.	 Initially, these must
       be created and filled with appropriate  sample  emails.	 A  crontab(1)
       file can	be used	to learn the three categories once a day, e.g.

       5  0 * *	* dbacl	-T email -l $CATS/work $MAILDIR/work
       10 0 * *	* dbacl	-T email -l $CATS/personal $MAILDIR/personal
       15 0 * *	* dbacl	-T email -l $CATS/spam $MAILDIR/spam

       To  automatically  deliver  each	 incoming  email  into the appropriate
       folder, the following procmailrc(5) recipe fragment could be used:


       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c	$CATS/work -c $CATS/personal -c	$CATS/spam

       # send to the appropriate mailbox
       * ? test	-n "$YAY"


       Sometimes, dbacl	will send the email to	the  wrong  mailbox.  In  that
       case, the misclassified message should be removed from its wrong	desti-
       nation and placed in the	correct	mailbox.  The error will be  corrected
       the  next  time	your messages are learned.  If it is left in the wrong
       category, dbacl will learn the wrong corpus statistics.

       The default text	features (tokens) read by dbacl	are purely  alphabetic
       strings,	 which minimizes memory	requirements but can be	unrealistic in
       some cases. To construct	models based on	alphanumeric tokens,  use  the
       -e  switch.  The	 example below also uses the optional -D switch, which
       prints a	list of	actual tokens found in the document:

       % dbacl -e alnum	-D -l twain Mark_Twain.txt | less

       It is also possible to override the default  feature  selection	method
       used  to	 learn the category model by means of regular expressions. For
       example,	the following duplicates the default feature selection	method
       in the C	locale,	while being much slower:

       % dbacl -l twain	-g '^([[:alpha:]]+)' -g	'[^[:alpha:]]([[:alpha:]]+)'

       The category twain which	is obtained depends only on single  alphabetic
       words  in  the text file	Mark_Twain.txt (and computed digram statistics
       for prediction).	 For a second example, the following command builds  a
       smoothed	 Markovian  (word bigram) model	which depends on pairs of con-
       secutive	words within each line	(but  pairs  cannot  straddle  a  line

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2'	-g

       More  general, line based, n-gram models	of all orders (up to 7)	can be
       built in	a similar way.	 To  construct	paragraph  based  models,  you
       should  reformat	 the input corpora with	awk(1) or sed(1) to obtain one
       paragraph per line. Line	size is	limited	by available memory, but  note
       that regex performance will degrade quickly for long lines.

       The  underlying assumption of statistical learning is that a relatively
       small number of training	documents can represent	a much larger  set  of
       input  documents.  Thus	in  the	long run, learning can grind to	a halt
       without serious impact on classification	accuracy. While	 not  true  in
       reality,	 this assumption is surprisingly accurate for problems such as
       email filtering.	 In practice, this means that a	well chosen corpus  on
       the  order  of ten thousand documents is	sufficient for highly accurate
       results for years.  Continual  learning	after  such  a	critical  mass
       results in diminishing returns.	Of course, when	real world input docu-
       ment patterns change dramatically, the predictive power of  the	models
       can  be	lost.  At  the other end, a few	hundred	documents already give
       acceptable results in most cases.

       dbacl is	heavily	optimized for the case of frequent classifications but
       infrequent  batch  learning.  This  is  the  long run optimum described
       above. Under ideal conditions, dbacl can	classify a hundred emails  per
       second  on low end hardware (500Mhz Pentium III). Learning speed	is not
       very much slower, but takes effectively much longer for large  document
       collections for various reasons.	 When using the	-m switch, data	struc-
       tures are aggressively mapped into memory if possible,  reducing	 over-
       heads for both I/O and memory allocations.

       dbacl  throws  away its input as	soon as	possible, and has no limits on
       the input document size.	Both classification  and  learning  speed  are
       directly	 proportional to the number of tokens in the input, but	learn-
       ing also	needs a	nonlinear optimization step which takes	 time  propor-
       tional  to the number of	unique tokens discovered.  At time of writing,
       dbacl is	one of the fastest open	source mail filters given its  optimal
       usage scenario, but uses	more memory for	learning than other filters.

       When  saving category files, dbacl first	writes out a temporary file in
       the same	location, and renames it afterwards. If	 a  problem  or	 crash
       occurs  during  learning,  the  old  category  file  is	therefore left
       untouched. This ensures that categories can never be corrupted, no mat-
       ter  how	 many  processes  try to simultaneously	learn or classify, and
       means that valid	categories are available  for  classification  at  any

       When  using  the	 -m switch, file contents are memory mapped for	speedy
       reading and writing. This, together with	the  -o	 switch,  is  intended
       mainly for testing purposes, when tens of thousands of messages must be
       learned and scored in a laboratory to measure dbacl's accuracy. Because
       no  file	 locking is attempted for performance reasons, corruptions are
       possible, unless	you make sure that only	one  dbacl  process  reads  or
       writes  any  file  at  any given	time. This is the only case (-m	and -o
       together) when corruption is possible.

       When classifying	a document, dbacl loads	all indicated categories  into
       RAM,  so	 the total memory needed is approximately the sum of the cate-
       gory file sizes plus a fixed small overhead.   The  input  document  is
       consumed	 while	being  read, so	its size doesn't matter, but very long
       lines can take up space.	 When using the	-m switch, the categories  are
       read using mmap(2) as available.

       When  learning,	dbacl keeps a large structure in memory	which contains
       many objects which won't	be saved into the output category. The size of
       this structure is proportional to the number of unique tokens read, but
       not the size of the input documents, since  they	 are  discarded	 while
       being  read.  As	a rough	guide, this structure is 4x-5x the size	of the
       final category file that	is produced.

       To prevent unchecked memory growth, dbacl allocates by default a	 fixed
       smallish	 amount	of memory for tokens. When this	space is used up, fur-
       ther tokens are discarded which has the effect of skewing  the  learned
       category	making it less usable as more tokens are dropped. A warning is
       printed on STDERR in such a case.

       The -h switch lets you fix the initial size of the token	space in  pow-
       ers  of	2, ie "-h 17" means 2^17 = 131072 possible tokens. If you type
       "dbacl -V", you can see the number of bytes needed for each token  when
       either  learning	 or  classifying.  Multiply this number	by the maximum
       number of possible tokens to estimate the memory	needed	for  learning.
       The  -H	switch	lets  dbacl  grow its tables automatically if and when
       needed, up to a maximum specified. So if	you type  "-H  21",  then  the
       initial	size  will  be doubled repeatedly if necessary,	up to approxi-
       mately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also
       kept in RAM throughout.

	      When this	variable is set, its value is prepended	to every cate-
	      gory filename which doesn't start	with a '/' or a	'.'.

       INT    If this signal is	caught,	dbacl simply exits without  doing  any
	      cleanup  or  other  operations. This signal can often be sent by
	      pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
	      If one of	these signals is caught, dbacl stops reading input and
	      continues	 its operation as if no	more input was available. This
	      is a way of quitting gracefully, but note	that in	learning mode,
	      a	 category  file	will be	written	based on the incomplete	input.
	      The QUIT signal can often	be sent	by pressing Ctrl- on the  key-
	      board. See stty(1).

       USR1   If  this	signal is caught, dbacl	reloads	the current categories
	      at the earliest feasible opportunity. This is not	normally  use-
	      ful  at  all,  but  might	be in special cases, such as if	the -f
	      switch is	invoked	together with input from a long	running	 pipe.

       dbacl generated category	files are in binary format, and	may or may not
       be portable to systems using a different	byte order architecture	 (this
       depends	on  how	 dbacl was compiled). The -V switch prints out whether
       categories are portable,	or else	you can	just experiment.

       dbacl does not recognize	functionally equivalent	 regular  expressions,
       and in this case	duplicate features will	be counted several times.

       With  every  learned  category, the command line	options	that were used
       are saved.  When	classifying, make sure that  every  relevant  category
       was  learned  with the same set of options (regexes are allowed to dif-
       fer), otherwise behaviour is undefined. There is	no need	to repeat  all
       the switches when classifying.

       If you get many digitization warnings, then you are trying to learn too
       much data at once, or your model	is too complex.	 dbacl is compiled  to
       save  memory by digitizing final	weights, but you can disable digitiza-
       tion by editing dbacl.h and recompiling.

       dbacl offers several built-in tokenizers	(see -e	switch)	with  more  to
       come in future versions,	as the author invents them.  While the default
       tokenizer may evolve, no	tokenizer should ever be removed, so that  you
       can  always  simulate previous dbacl behaviour subject to bug fixes and
       architectural changes.

       The confidence estimates	obtained through the -X	switch are  underesti-
       mates, ie are more conservative than they should	be.

       "Ya know, some day scientists are gonna invent something	that will out-
       smart a rabbit."	(Robot Rabbit, 1953)

       The source code for the latest version of this program is available  at
       the following locations:

       Laird A.	Breyer <>

       awk(1),	bayesol(1),  crontab(1),  hmine(1),  hypex(1),	less(1), mail-
       cross(1),  mailfoot(1),	mailinspect(1),	  mailtoe(1),	procmailex(5),
       regex(7), stty(1), sed(1)

Version	1.14.1	      Bayesian Text Classification Tools	      DBACL(1)


Want to link to this manual page? Use this URL:

home | help