Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
MAILTOE(1)							    MAILTOE(1)

       mailtoe - a train-on-error simulator for	use with dbacl.

       mailtoe command [ command_arguments ]

       mailtoe	automates  the task of testing email filtering and classifica-
       tion programs such as dbacl(1).	Given a	set of categorized  documents,
       mailtoe	initiates  test	runs to	estimate the classification errors and
       thereby permit fine tuning of the parameters of the classifier.

       Train-on-error (TOE) is a learning method which is sometimes  advocated
       for  email classifiers. Given an	incoming email stream, the method con-
       sists in	reusing	a fixed	set of category	databases until	the first mis-
       classification  occurs.	At  that point,	the offending email is used to
       relearn the relevant category, until  the  next	misclassification.  In
       this  way, categories are only updated when errors occur. This directly
       models the way that some	email classifiers are used in practice.

       TOE's error rates depend	directly on the	 order	in  which  emails  are
       seen.   A  small	 change	in ordering, as	might happen due to networking
       delays, can have	a large	impact on the  number  of  misclassifications.
       Consequently, mailtoe does not give meaningful results, unless the sam-
       ple emails are chosen carefully.	 However, as this method  is  commonly
       used  by	 spam  filters,	 it is still worth computing to	foster compar-
       isons. Other methods  (see mailcross(1),mailfoot(1)) attempt to capture
       the behaviour of	classification errors in other ways.

       To  improve  and	stabilize the error rate calculation, mailtoe performs
       the TOE simulations several times on slightly reordered email  streams,
       and  averages  the  results.  The reorderings occur by multiplexing the
       emails from each	category mailbox in random order. Thus	if  there  are
       three  categories,  the	first email classified is chosen randomly from
       the front of the	sample email streams of	each type.  The	 second	 email
       is also chosen randomly among the three types, from the front of	the
	streams	 after	the first email	was removed. Simulation	stops when all
       sample streams are exhausted.

       mailtoe uses the	environment variable  MAILTOE_FILTER  when  executing,
       which  permits the simulation of	arbitrary filters, provided these sat-
       isfy the	compatibility conditions stated	in the ENVIRONMENT section be-

       For  convenience,  mailtoe implements a testsuite framework with	prede-
       fined wrappers for several open source classifiers.  This  permits  the
       direct  comparison  of  dbacl(1)	with competing classifiers on the same
       set of email samples. See the USAGE section below.

       During preparation, mailtoe builds a subdirectory  named	 mailtoe.d  in
       the  current  working directory.	 All needed calculations are performed
       inside this subdirectory.

       mailtoe returns 0 on success, 1 if a problem occurred.

       prepare size
	      Prepares a subdirectory named mailtoe.d in the  current  working
	      directory,  and  populates  it with empty	subdirectories for ex-
	      actly size subsets.

       add category [ FILE ]...
	      Takes a set of emails from either	FILE if	specified,  or	STDIN,
	      and  associates  them  with  category.   The  ordering of	emails
	      within FILE is preserved,	and subsequent FILEs are  appended  to
	      the  first  in each category.  This command can be repeated sev-
	      eral times, but should be	executed at least once.

       clean  Deletes the directory mailtoe.d and all its contents.

       run    Multiplexes randomly from	the email streams added	 earlier,  and
	      relearns	categories  only  when a misclassification occurs. The
	      simulation is repeated size times.

	      Prints average error rates for the simulations.

       plot [ ps | logscale ]...
	      Plots the	number of errors over simulation time.	The  "ps"  op-
	      tion,  if	 present,  writes the plot to a	postscript file	in the
	      directory	mailtoe/plots, instead of being	shown  on-screen.  The
	      "logscale"  option, if present, causes the plot to be on the log
	      scale for	both ordinates.

       review truecat predcat
	      Scans the	last run statistics  and  extracts  all	 the  messages
	      which  belong  to	category truecat but have been classified into
	      category predcat.	 The extracted messages	are copied to the  di-
	      rectory mailtoe.d/review for perusal.

       testsuite list
	      Shows  a	list of	available filters/wrapper scripts which	can be

       testsuite select	[ FILTER ]...
	      Prepares the filter(s) named FILTER to be	used  for  simulation.
	      The  filter  name	is the name of a wrapper script	located	in the
	      directory	/usr/local/share/dbacl/testsuite.  Each	filter	has  a
	      rigid  interface	documented  below, and the act of selecting it
	      copies it	to the mailtoe.d/filters directory. Only  filters  lo-
	      cated there are used in the simulations.

       testsuite deselect [ FILTER ]...
	      Removes the named	filter(s) from the directory mailtoe.d/filters
	      so that they are not used	in the simulation.

       testsuite run [ plots ]
	      Invokes every selected filter on the datasets added  previously,
	      and calculates misclassification rates. If the "plots" option is
	      present, each filter simulation is plotted as a postscript  file
	      in the directory mailtoe.d/plots.

       testsuite status
	      Describes	the scheduled simulations.

       testsuite summarize
	      Shows  the  cross	validation results for all filters. Only makes
	      sense after the run command.

       The normal usage	pattern	is the following: first, you  should  separate
       your  email collection into several categories (manually	or otherwise).
       Each category should be associated with one or more folders,  but  each
       folder  should not contain more than one	category. Next,	you should de-
       cide how	many runs to use, say 10.  The more runs you use,  the	better
       the  predicted error rates. However, more runs take more	time.  Now you
       can type

       % mailtoe prepare 10

       Next, for every category, you must add  every  folder  associated  with
       this  category. Suppose you have	three categories named spam, work, and
       play, which are associated with the mbox	 files	spam.mbox,  work.mbox,
       and play.mbox respectively. You would type

       % mailtoe add spam spam.mbox
       % mailtoe add work work.mbox
       % mailtoe add play play.mbox

       You  should aim for a similar number of emails in each category,	as the
       random multiplexing will	be unbalanced otherwise. The ordering  of  the
       email  messages in each *.mbox file is important, and is	preserved dur-
       ing each	simulation. If you repeatedly add to the  same	category,  the
       later  mailboxes	 will be appended to the first,	preserving the implied

       You can now perform as many TOE simulations as desired. The multiplexed
       emails  are classified and learned one at a time, by executing the com-
       mand given in the environment variable MAILTOE_FILTER. If  not  set,  a
       default value is	used.

       % mailtoe run
       % mailtoe summarize

       The testsuite commands are designed to simplify the above steps and al-
       low comparison of a wide	range of email classifiers, including but  not
       limited	to  dbacl.  Classifiers	are supported through wrapper scripts,
       which are located in the	/usr/local/share/dbacl/testsuite directory.

       The first stage when using the testsuite	is deciding which  classifiers
       to compare.  You	can view a list	of available wrappers by typing:

       % mailtoe testsuite list

       Note  that  the	wrapper	 scripts are NOT the actual email classifiers,
       which must be installed separately by your system administrator or oth-
       erwise.	Once this is done, you can select one or more wrappers for the
       simulation by typing, for example:

       % mailtoe testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are not selected. Note also that	some wrappers can have hard-coded cat-
       egory names, e.g. if the	classifier only	 supports  binary  classifica-
       tion. Heed the warning messages.

       It  remains  only  to  run the simulation. Beware, this can take	a long
       time (several hours depending on	the classifier).

       % mailtoe testsuite run
       % mailtoe testsuite summarize

       Once you	are all	done, you can delete the working files,	log files etc.
       by typing

       % mailtoe clean

       mailtoe	testsuite takes	care of	learning and classifying your prepared
       email corpora for each  selected	 classifier.  Since  classifiers  have
       widely  varying interfaces, this	is only	possible by wrapping those in-
       terfaces	individually into a standard form which	can be used by mailtoe

       Each  wrapper script is a command line tool which accepts a single com-
       mand followed by	zero or	more optional arguments, in the	standard form:

       wrapper command [argument]...

       Each wrapper script also	makes use of STDIN and STDOUT in  a  well  de-
       fined way. If no	behaviour is described,	then no	output or input	should
       be used.	 The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
	      category filenames is expected in	$2, $3,	etc. The script	writes
	      the category name	corresponding to the input email on STDOUT. No
	      trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
	      a	suitable category file name is expected	in $2.	No  output  is
	      written to STDOUT.

       clean  In  this	case, a	directory is expected in $2, which is examined
	      for old database information. If any old	databases  are	found,
	      they are purged or reset.	No output is written to	STDOUT.

	      IN  this	case,  a single	line of	text is	written	to STDOUT, de-
	      scribing the filter's functionality. The	line  should  be  kept
	      short to prevent line wrapping on	a terminal.

	      In  this case, a directory is expected in	$2. The	wrapper	script
	      first checks for the existence of	its associated classifier, and
	      other  prerequisites. If the check is successful,	then the wrap-
	      per is cloned into the supplied directory.  A courtesy notifica-
	      tion  should  be	given on STDOUT	to express success or failure.
	      It is also permissible to	give longer descriptions caveats.

       toe    In this case, a list of categories is expected in	$3,  $4,  etc.
	      Every possible category must be listed. Preceding	this list, the
	      true category is given in	$2.

       foot   Used by mailfoot(1).

       Right after loading, mailtoe reads the hidden file  .mailtoerc  in  the
       $HOME  directory, if it exists, so this would be	a good place to	define
       custom values for environment variables.

	      This variable contains a shell command to	be executed repeatedly
	      during  the  running  stage.  The	command	should accept an email
	      message on STDIN and output a resulting category	name.  On  the
	      command  line,  it  should  also	accept first the true category
	      name, then a list	of all possible	category file names.   If  the
	      output category does not match the true category,	then the rele-
	      vant categories are assumed to have  been	 silently  updated/re-
	      learned.	If MAILTOE_FILTER is undefined,	mailtoe	uses a default

	      This directory is	exported for the benefit of  wrapper  scripts.
	      Scripts which need to create temporary files should place	them a
	      the location given in TEMPDIR.

       The subdirectory	mailtoe.d can grow quite large.	 It  contains  a  full
       copy  of	the training corpora, as well as learning files	for size times
       all the added categories, and various log files.

       While TOE simulations for dbacl(1) can be used to  compare  with	 other
       classifiers,  TOE  should  not  be used for real	world classifications.
       This is because,	unlike many other filters,  dbacl(1)  learns  evidence
       weights	in a nonlinear way, and	does not preserve relative weights be-
       tween tokens, even if those tokens aren't seen in new emails.

       Because the ordering of emails within the added mailboxes matters,  the
       estimated error rates are not well defined or even meaningful in	an ob-
       jective sense.  However,	if the sample emails represent an actual snap-
       shot  of	 a  user's  incoming  email, then the error rates are somewhat
       meaningful. The simulations can then be interpreted as alternate	reali-
       ties where a given classifier would have	intercepted the	incoming mail.

       The  source code	for the	latest version of this program is available at
       the following locations:

       Laird A.	Breyer <>

       bayesol(1)   dbacl(1),	mailinspect(1),	  mailcross(1),	  mailfoot(1),

Version	1.14.1	      Bayesian Text Classification Tools	    MAILTOE(1)


Want to link to this manual page? Use this URL:

home | help