Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ANNOYANCE-FILTER(1)	    General Commands Manual	   ANNOYANCE-FILTER(1)

       annoyance-filter	- automatically	detect junk mail

       annoyance-filter	[ options ]

       annoyance-filter	 uses Bayesian statistics to determine the probability
       an E-mail message is junk based on an analysis of its contents compared
       to collections of known junk and	legitimate E-mail.

       The current version of this program is always posted at:
       Please  visit  this page	for news about the program and to download the
       latest version.

       The project is hosted on	SourceForge,  where  you  will	find  the  CVS
       source code repository and release archives:

       annoyance-filter	 has a multitude of options which permit it to be used
       in many different  ways,	 but  the  most	 common	 application  involves
       training	 the  program  with collections	of legitimate and junk mail in
       order to	create a dictionary which indicates the	probability that words
       identify	 a message as junk or non-junk (legitimate).  Training must be
       done before the program is used to classify incoming mail, but need  be
       done   subsequently   only   when   adding  messages  to	 the  training
       collections.  As	long as	the overall content  of	 the  mail,  junk  and
       legitimate,  which you receive remains pretty much the same, there's no
       need to retrain,	but the	 ability  to  do  so  allows  the  program  to
       automatically  adapt to evolving	message	content, which is particularly
       characteristic of junk mail.

       Suppose you have	a collection of	legitimate mail	(in other words,  mail
       you  wish to read) in a file named m-good and a collection of junk mail
       (that which you don't wish to read) in file m-junk.  These  collections
       may  be in ``Unix mail folder'' format, which is	simply the text	of one
       or more E-mail messages concatenated together in	a single text file, or
       may  be the names of directories	containing files, each of which	may be
       a single	E-mail message or a Unix mail folder.  In either  case,	 if  a
       message	file  is  compressed  with  gzip,  it  will  be	 automatically
       uncompressed on the fly.	 Directories of	 messages  may	not,  however,
       contain other directories of messages.

       To   train   annoyance-filter  with  these  collections	and  create  a
       dictionary, use a command like:

	annoyance-filter --mail	m-good --junk m-junk --prune --write dict.bin

       where dict.bin is the name of the dictionary file you wish to create.

       Now that	the dictionary has been	created, you can use it	on  subsequent
       runs  to	 compute  the  probability  a  message is junk and classify it
       accordingly.  Suppose you have an E-mail	message	in the file  mail.txt.
       To compute its junk priority and	display	it on standard output, use the

		  annoyance-filter --read dict.bin --test mail.txt

       To integrate annoyance-filter into a mail  processing  system  such  as
       procmail,  you'll  usually  want	 to  run  it  as  a filter which reads
       incoming	 messages  from	 standard  input  (piped  there	 by  the  mail
       processing system), classifies them and adds annotations	to the message
       header indicating the classification,  then  writes  the	 message  with
       header  annotations to standard output.	The mail processing system may
       then examine the	header annotations and route the message  accordingly.
       To  filter  a  message,	again  assuming	 the dictionary	created	by the
       training	run is in the file dict.bin, use the command:

	      annoyance-filter --read dict.bin --transcript - --test -

       Here the	--transcript option is used to request the  input  message  be
       copied  to  an  output file, in this case standard output, specified by
       ``-'', with the message read from standard input, the ``-'' argument to
       the --test option.

       Options	are  specified	on  the	 command line.	Options	are treated as
       commands--most instruct the program to perform  some  specific  action;
       consequently,  the  order  in  which they are specified is significant;
       they are	processed left to right. Long options  beginning  with	``--''
       may  be	abbreviated  to	 any unambiguous prefix; single-letter options
       introduced by a single ``-'' without arguments may be aggregated.

       --annotate options
		 Add the annotations requested by the characters in options to
		 the  transcript  generated by the --transcript	option.	 Upper
		 and lower case	options	are  treated  identically.   Available
		 annotations are:
			     d	      Decoder diagnostics
			     p	      Parser warnings and error	messages
			       w	  Most	significant  words  and	 their

       --autoprune n
	      As the dictionary	is bring built by appending mail  to  it  with
	      the  --mail  and --junk options, unique words will automatically
	      be pruned	from it	whenever the dictionary	exceeds	 approximately
	      n	  bytes.   This	 is  particularly  handy  when	loading	 large
	      collections of messages with --phrasemax set greater  than  one,
	      as  a  very  large  number  of  unique  phrases  may clutter the
	      dictionary being built and exceed	the memory  capacity  of  your
	      computer.	  You  could  split  the mail collection into multiple
	      parts and	explicitly --prune after each part, but	--autoprune is
	      much more	convenient.

       --biasmail n
	      The  frequency of	words appearing	in legitimate mail is inflated
	      by the floating point factor  n,	which  defaults	 to  2.	  This
	      biases  the  classification  of  messages	 in  favour of ``false
	      negatives''--junk	mail deemed  legitimate,  while	 reducing  the
	      probability  of ``false positives'' (legitimate mail erroneously
	      classified as junk, which	is bad).  The higher  the  setting  of
	      --biasmail,  the	greater	 the bias in favour of false negatives
	      will be.

       --binword n
	      Binary  character	  streams   (for   example,   attachments   of
	      application-specific  files,  including  the  executable code of
	      worm and virus attachments) are scanned and contiguous sequences
	      of  alphanumeric	ASCII  characters  n  characters or longer are
	      added to the list	of words in  the  message.   The  dollar  sign
	      (``$'')  is  considered  an  alphanumeric	 character  for	 these
	      purposes,	and words may have embedded hyphens  and  apostrophes,
	      but may not begin	or end with those characters.  If --binword is
	      set  to  zero,  scanning	of  binary  attachments	 is   disabled
	      entirely.	 The default setting is	5 characters.

	      The  next	--mail or --junk folder	will be	parsed using ``classic
	      BSD'' rules for identifying the start of individual messages  in
	      the  folder.   In	 BSD-style  folders, the text ``From ''	as the
	      leftmost characters of a line always denotes the start of	a  new
	      message:	any  appearance	 of  this text in any other context is
	      always quoted, often by prefixing	a  ``>''  character.   In  the
	      default  Unix folder syntax, ``From '' only marks	the start of a
	      new message if it	appears	following one  or  more	 blank	lines.
	      Note  that you must specify --bsdfolder before each folder to be
	      read with	BSD rules; it is not a modal setting.

       --classify fname
	      Classify mail in fname.	If  it	equals	or  exceeds  the  junk
	      threshold	 (see  --threshjunk),  ``JUNK''	is written to standard
	      output and the program exits with	status code 3. If the  message
	      scores   less   than   or	 equal	to  the	 mail  threshold  (see
	      --threshmail), ``MAIL'' is written to standard  output  and  the
	      program  exits  with  status  0.	 If  the message's score falls
	      between the two thresholds, its content is deemed	indeterminate;
	      ``INDT''	is  written  to	 standard output and the program exits
	      with a  status  of  4.   The  output  can	 be  used  to  set  an
	      environment  variable  in	Procmail to control the	disposition of
	      the message.  If	fname  is  ``-''  the  message	is  read  from
	      standard input.

	      Clear  appearances  of  words  in	junk mail from database.  Used
	      when preparing a database	of legitimate mail.

	      Clear appearances	of words in  legitimate	 mail  from  database.
	      Used when	preparing a database of	junk mail.

	      Print copyright information.

       --csvread fname
	      Import  a	 dictionary  from  a  comma-separated value (CSV) file
	      fname.  Records are assumed to  be  in  the  format  written  by
	      --csvwrite  but  need  not  be  sorted  in any particular	order.
	      Words are	added to those already in memory.

       --csvwrite fname
	      Export a dictionary as a comma-separated value (CSV) fname  with
	      this  option.   Such  files  can	be  loaded into	spreadsheet or
	      database programs	for  further  processing.   Words  are	sorted
	      first  in	 ascending order of probability	they denote junk mail,
	      then lexically.

       --fread,	-r fname
	      Load a fast dictionary (previously  created  with	 the  --fwrite
	      option) from file	fname.

       --fwrite	fname
	      Write  a dictionary to the file fname in fast dictionary format.
	      Fast dictionaries	are written in a binary	format	which  is  not
	      portable	across	machines with different	byte order conventions
	      and  cannot  be  added  incrementally  to	 assemble   a	larger
	      dictionary,  but	can  be	loaded in a small fraction of the time
	      required by the format created by	the --write command.  Using  a
	      fast  dictionary	for  routine  classification  of incoming mail
	      drastically reduces the time consumed in loading the  dictionary
	      for each message.

       --help, -u
	      Print how-to-call	information including a	list of	options.

       --junk, -j fname
	      Add  the	mail  in  folder fname to the dictionary as junk mail.
	      These folders may	be compressed by a utility the host system can
	      uncompress;   specify  the  complete  file  name	including  the
	      extension	denoting its form of compression.  If fname  is	 ``-''
	      the mail folder is read from standard input.

       --list List the dictionary on standard output.

       --mail, -m fname
	      Add  the	mail  in  folder fname to the dictionary as legitimate
	      mail.  These folders may be compressed by	 a  utility  the  host
	      system  can uncompress; specify the complete file	name including
	      the extension denoting its form of  compression.	 If  fname  is
	      ``-'' the	mail folder is read from standard input.

       --newword n
	      The  probability	that a word seen in mail which does not	appear
	      in the dictionary	(or appeared too few  times  to	 assign	 it  a
	      probability with acceptable confidence) is indicative of junk is
	      set to n.	 The default is	0.2--the odds are that novel words are
	      more likely to appear in legitimate mail than in junk.

       --pdiag fname
	      Write  a	diagnostic  file to the	specified fname	containing the
	      actual lines the parser processed	(after decoding	of MIME	 parts
	      and exclusion of data deemed unparseable).  Use this option when
	      you suspect problems in decoding or pre-parser filtering.

       --phraselimit n
	      Limit  the  length  of  phrases  assembled  according   to   the
	      --phrasemin  and	--phrasemax  options  to  n  characters.  This
	      permits ignoring ``phrases'' consisting of gibberish  from  mail
	      headers  and un-decoded content.	In most	cases these items will
	      be discarded by a	--prune	in any case, but skipping them as they
	      are  generated  keeps  the dictionary from bloating in the first
	      place.  The default value	is 48 characters.

       --phrasemin n
	      Calculate	probabilities of phrases consisting of a minumum of  n
	      words.   The  default  of	 1 calculates probabilities for	single

       --phrasemax n
	      Calculate	probabilities of phrases consisting of a maximum of  n
	      words.   The  default  of	 1 calculates probabilities for	single
	      words.  If you set this too large, the dictionary	may grow to an
	      absurd size.

       --plot fname
	      After loading the	dictionary, create a plot in fname .png	of the
	      histogram	of words, binned by their probability of appearance in
	      junk  mail.   In order to	generate the histogram the GNUPLOT and
	      NETPbm utilities must be installed on the	system;	 if  they  are
	      absent, the --plot option	will not be available.

       --pop3port n
	      The  POP3	 proxy	server	activated by a subsequent --pop3server
	      option will listen for connections on port n.  If	no  --pop3port
	      is  specified,  the  server  will	 listen	on the default port of
	      9110.  On	most systems, you'll have to run the program  as  root
	      if  you  wish the	proxy server to	listen on a port numbered 1023
	      or less.

       --pop3server server[:port]
	      Activate a POP3 proxy server which relays	requests made  on  the
	      previously  specified  --pop3port	 or  the default of 9110 if no
	      port is specified, to the	specified server, which	may  be	 given
	      either  as  an  IP  address  in  ``dotted	 quad''	notion such as   or	  a   fully-qualified	domain	  name	  like
	      pop.someisp.tld.	 The port on which the server listens for POP3
	      connections may be specified after  the  server  prefixed	 by  a
	      colon  (``:'') ; if no port is specified,	the IANA assigned POP3
	      port 110 will be used. The POP3  proxy  server  will  pass  each
	      message received on behalf of a requestor	through	the classifier
	      and return the annotated transcript to the  requestor,  who  may
	      then  filter  it	based  on  the	classification appended	to the
	      message header. You must load a dictionary before	activating the
	      POP3  proxy server, and the --pop3server option must be the last
	      on the command line.  The	server continues to  run  and  service
	      requests until manually terminated.

	      Write a trace of POP3 proxy server operations to standard	error.
	      Each trace message (apart	from the dump of the  body  of	multi-
	      line replies to clients) is prefixed with	the label ``POP3: ''.

	      After  loading  the  dictionary  from --mail and --junk folders,
	      this   option   discards	 words	 which	 appear	  sufficiently
	      infrequently   that   their   probability	  cannot  be  reliably
	      estimated.  One usually --prune s	the  dictionary	 before	 using
	      --write to save it for subsequent	runs.

	      Include a	token-by-token trace in	the --pdiag output file.  This
	      helps when  adjusting  the  parser's  criteria  for  recognising
	      tokens.	Setting	 this option without also specifying a --pdiag
	      file will	have no	effect other than  perhaps  to	exercise  your
	      fingers typing it	on the command line.

       --read, -r fname
	      Load  a  dictionary (previously created with the --write option)
	      from file	fname.

       --sigwords n
	      The probability that a message is	junk will be computed based on
	      the  individual  probabilities  of  the  n  words	 with extremal
	      probabilities; that is, probabilities most indicative of junk or
	      mail.  The default is 15,	but there's no obvious optimal setting
	      for this parameter; it depends in	part on	the average length  of
	      messages you receive.

	      To  evade	 filtering  programs, some junk	mail is	sent with MIME
	      part headers which violate the  standard	but  which  most  mail
	      clients  accept  anyway.	This option causes such	messages to be
	      parsed as	a browser would, at the	cost of	standards  compliance.
	      If  --sloppyheaders  is  used,  it should	be specified both when
	      building the dictionary and when testing messages.

	      After loading the	dictionary from	 --mail	 and  --junk  folders,
	      print  statistics	 of  the distribution of junk probabilities of
	      words in the dictionary.	The statistics are written to standard

       --test, -t fname
	      Test  mail  in  fname  and write the estimated probability it is
	      junk to standard output unless the --transcript option  is  also
	      specified	 with  standard	 output	(``-'')	as the destination, in
	      which case the inclusion of the probability  and	classification
	      in  the  transcript  is  adjudged	 sufficient.  If the --verbose
	      option is	specified, the individual probabilities	of the	``most
	      interesting''  words  in	the  message  will also	be output.  If
	      fname is ``-'' the message is read from standard input.

       --threshjunk n
	      Set the threshold	for classifying	 a  message  as	 junk  to  the
	      floating	point  probability  value n.  The default threshold is
	      0.9; messages scored above --threshjunk are deemed junk.

       --threshmail n
	      Set the threshold	for classifying	a message as  legitimate  mail
	      to   the	floating  point	 probability  value  n.	  The  default
	      threshold	is 0.9,	with messages scored below --threshmail	deemed
	      legitimate.    Note  that	 you  may  leave  a  gap  between  the
	      --threshmail and --threshjunk values (although it	makes no sense
	      to  set  --threshmail  higher).	Mail  scored  between  the two
	      thresholds will then be judged of	uncertain status.

       --transcript fname
	      Write an annotated transcript of the  original  message  to  the
	      specified	 fname.	  If fname is ``-'', the transcript is written
	      to standard output.  At  the  end	 of  the  message  header,  an
	      X-Annoyance-Filter-Junk-Probability   header   item  giving  the
	      computed probability  and	 an  X-Annoyance-Filter-Classification
	      item  which gives	the classification of the message according to
	      the --threshmail and --threshjunk	settings;  the	classification
	      is given as ``Mail'', ``Junk'', or ``Indeterminate''.

       --verbose, -v
	      Print  diagnostic	 information  as  the program performs various

	      Print program version information.

       --write fname
	      Write a dictionary to the	file fname.  The dictionary is written
	      in  a  binary format which may be	loaded on subsequent runs with
	      the --read option.  Binary dictionary files are  portable	 among
	      machines with different architectures and	byte order.

       The  program  exits  with a status of 0 when processing is successfully
       completed, 1 when an error (I/O or file access in most  cases)  occurs,
       and  2  to  indicate  a	command	 line syntax error.  If	the --classify
       option is specified, an exit status of 0	identifies the message	tested
       as  legitimate  mail, 3 marks it	as junk, and a status of 4 is returned
       for messages which cannot be confidently	classified as either  mail  or

       Files  are read or written as requested by options on the command line;
       all options which read or write files take a fname argument which gives
       the   file   name.    The   --classify,	--junk,	 --mail,  --test,  and
       --transcript  options  interpret	 an  argument  of  ``-''  as  denoting
       standard	input or output.

       On systems which	provide	the required services and utilities, arguments
       to the --junk and --mail	options	may be compressed files	or the name of
       a  directory  containing	 one or	more messages which will be read as if
       logically concatenated.	Messages in the	directory may be compressed or

       Error  messages	and  diagnostic	 output	 generated  when the --verbose
       option is specified are written to standard error.

       Millions, doubtless.  This is a program which must cope	with  whatever
       garbage	is fed to it from mail folders,	trying to make the best	of it.
       When it messes up, your efforts in identifying the message which	caused
       the  problem  and submitting a verbatim copy of it with your bug	report
       are much	appreciated.

       Please report bugs to and include annoyance-filter in
       the Subject line.  Thanks in advance.

				     John Walker

       This software is	in the public domain. Permission to use, copy, modify,
       and distribute this software and	its documentation for any purpose  and
       without	fee is hereby granted, without any conditions or restrictions.
       This  software  is  provided  ``as  is''	 without  express  or  implied

       gnuplot(1), gs(1), gzip(1), netpbm(1), procmail(1), xpdf(1)

       annoyance-filter	   is	written	  using	  the	Literate   Programming  methodology;  the	user   manual,
       program,	 and  internal	documentation  are developed together, closely
       interlinked.  Whenever the program is modified,	the  documentation  is
       automatically updated, reducing the risk	of divergence between what the
       manual says and what the	program	does.

       This man	page is	intended as a reference	for the	command	 line  options
       and  most  common  applications	of  the	 program.   For	 comprehensive
       documentation, including	details	of how to  integrate  annoyance-filter
       with  the procmail mail processing system, please refer to the complete
       documentation published in PDF format, available	on the Web at:

       If you have downloaded the annoyance-filter  source  distribution,  the
       corresponding  version  of  annoyance-filter.pdf	 is  included  in  the
       archive.	 You can read PDF files	with Acrobat reader (a	free  download
       from   or	 the  xpdf  or
       Ghostscript (gs)	utilities.

4th Berkeley Distribution	  4 AUG	2004		   ANNOYANCE-FILTER(1)


Want to link to this manual page? Use this URL:

home | help