Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
index(1)		    General Commands Manual		      index(1)

NAME
       index - SWISH++ indexer

SYNOPSIS
       index [ options ] directory...  file...

DESCRIPTION
       index  is the SWISH++ file indexer.  It indexes the specified files and
       files in	the specified directories; files in subdirectories  of	speci-
       fied  directories  are also indexed by default (unless either the -r or
       --no-recurse option or the RecurseSubdirs variable  is  given).	 Files
       are  indexed  either only if their filename matches one of the patterns
       in the set specified with either	the -e or --pattern option or the  In-
       cludeFile  variable (unless standard input is used; see next paragraph)
       or is not in the	set specified with either the -E or  --no-pattern  op-
       tion or the ExcludeFile variable.

       If there	is a single filename of	`-', the list of directories and files
       to index	is instead taken from standard input (one per line).  In  this
       case, filename patterns of files	to index need not be specified explic-
       itly: all files,	regardless of whether they  match  a  pattern  (unless
       they  are in the	set not	to index specified with	either the -E or --no-
       pattern option or the ExcludeFile variable), are	indexed,  i.e.,	 index
       assumes	you  know  what	you're doing when specifying filenames in this
       manner.

       In any case, care must be taken not to specify files or	subdirectories
       in  directories	that  are also specified: since	directories are	recur-
       sively indexed by default (unless either	the -r or --no-recurse	option
       or  the RecurseSubdirs variable is given), explicitly specifying	a sub-
       directory or file in a directory	that is	also specified will result  in
       those files being indexed more than once.

   Character Mapping
       Characters  in  the  ISO	 8859-1	 (Latin	1) character set are mapped to
       their closest ASCII equivalent before further examination and indexing.
       (Individual indexing modules may	also do	their own character mapping.)

   Word	Determination
       Stop words, words that occur too	frequently or have no information con-
       tent, are not indexed.  (There is a default built-in set	of a few  hun-
       dred such English words.)  Additionally,	several	heuristics are used to
       determine which words should not	be indexed.

       First, a	word is	checked	to see if it looks like	an acronym.  A word is
       considered  an  acronym	only if	it starts with a capital letter	and is
       composed	exclusively of capital letters,	digits,	and  punctuation  sym-
       bols,  e.g.,  ``AT&T.''	If a word looks	like an	acronym, it is indexed
       and no further checks are done.

       Second, there are several other checks that are applied.	 A word	is not
       indexed if it:

       1.  Is less than	Word_Min_Size letters.	(Default is 4.)

       2.  Contains less than Word_Min_Vowels vowels.  (Default	is 1.)

       3.  Contains  more than Word_Max_Consec_Same of the same	character con-
	   secutively (not including digits).  (Default	is 2.)

       4.  Contains more than  Word_Max_Consec_Consonants  consecutive	conso-
	   nants.  (Default is 5.)

       5.  Contains more than Word_Max_Consec_Vowels consecutive vowels.  (De-
	   fault is 4.)

       6.  Contains more than Word_Max_Consec_Puncts  consecutive  punctuation
	   characters.	(Default is 1.)

   Filters
       Via the FilterFile configuration	file variable, files matching particu-
       lar patterns can	be filtered prior to indexing.	Via the	 FilterAttach-
       ment  configuration  file variable, e-mail attachments whose MIME types
       match particular	patterns can be	filtered prior to indexing.  (See FIL-
       TERS in swish++.conf(4).)

   Incremental Indexing
       In  order  to add words from new	documents to an	existing index,	either
       the entire set of documents can be reindexed or the new documents alone
       can  be incrementally indexed.  In many cases, reindexing everything is
       sufficient since	index is really	fast.  For a very large	document  set,
       however,	this may use too many resources.

       However,	there is a pitfall for incremental indexing: if	any of the -f,
       --word-files, -p, or --word-percent options or WordFilesMax or WordPer-
       centMax	variables  are used, then words	that are too frequent are dis-
       carded.	If new documents are added containing very few of those	words,
       then they could no longer be too	frequent.  However, there is no	way to
       get them	back since they	were discarded.

       The way around this problem is not to discard any words	by  specifying
       101%.   However,	 because no words are discarded, the size of the index
       file will be larger, perhaps significantly so.

       It is possible that, in practice, the loss of words may not be that im-
       portant	especially  if new documents are very similar to old documents
       and that	words that were	too frequent in	the old	set would also be  too
       frequent	in new set.

       Another way around this problem is to do	periodic full indexing.

INDEXING MODULES
       index  is  written  in a	modular	fashion	where different	types of files
       have different indexing modules.	 Currently, there are 7	modules:  Text
       (plain text), HTML (HTML	and XHTML), ID3	(ID3 tags found	in MP3 files),
       LaTeX, Mail (RFC	822 and	Usenet News), Manual  (Unix  manual  pages  in
       nroff(1)	with man(7) macros), and RTF (Rich Text	Format).

   Text	Module
       This  module  simply indexes plain text files performing	character map-
       ping and	word determination as has already been described.

   HTML	and XHTML Module
       Additional processing is	done for HTML and XHTML	files.	The additional
       processing is:

       1.  Character  and  numeric (decimal and	hexadecimal) entity references
	   are converted to their ASCII	character equivalents  before  further
	   examination	and indexing.  For example, ``résumé'' be-
	   comes ``resume'' before indexing.

       2.  If a	matched	set of <TITLE> ... </TITLE> tags is found  within  the
	   first  TitleLines  lines of the file	(default is 12), then the text
	   between the tags is stored in  the  generated  index	 file  as  the
	   file's  title rather	than the file's	name.  (Every non-space	white-
	   space character in the title	is converted to	a space;  leading  and
	   trailing spaces are removed.)

       3.  If  an HTML or XHTML	element	contains a CLASS attribute whose value
	   is among the	set of class names specified as	 those	not  to	 index
	   (via	 one  or more of either	the -C or --no-class option or the Ex-
	   cludeClass variable), then all the text up to the tag that ends the
	   element will	not be indexed.

	   For	an  element  that has an optional end tag, ``the tag that ends
	   the element'' is either the element's end tag or a tag  of  another
	   element  that implicitly ends it; for an element that does not have
	   an end tag, ``the tag that ends  the	 element''  is	the  element's
	   start tag.  (See the	EXAMPLES.)

	   All	elements from the HTML 4.0 specification (including deprecated
	   elements), Ruby elements, plus  common,  browser-specific  elements
	   are	recognized; unrecognized elements are ignored.	(See the -H or
	   --dump-html option.)

       4.  If an HTML or XHTML element contains	a TITLE	 attribute,  then  the
	   words specified as the value	of the TITLE attribute are indexed.

       5.  If  an  AREA, IMG, or INPUT element contains	an ALT attribute, then
	   the words specified as the value of the ALT attribute are indexed.

       6.  If a	META element contains both a NAME and CONTENT attribute,  then
	   the	words  specified as the	value of the CONTENT attribute are in-
	   dexed associated with the meta name specified as the	value  of  the
	   NAME	attribute.

	   (However,  if either	the -A or --no-assoc-meta options or the Asso-
	   ciateMeta variable is specified, then the words  specified  as  the
	   value  of  the CONTENT attribute are	still indexed, but not associ-
	   ated	with the meta name.)

	   (See	also the -m, --meta, -M, and  --no-meta	 options  or  the  In-
	   cludeMeta  or  ExcludeMeta  variables.)   Meta  names  can later be
	   queried against specifically	using search(1).

       7.  If a	TABLE element contains a SUMMARY  attribute,  then  the	 words
	   specified as	the value of the SUMMARY attribute are indexed.

       8.  If  an  OBJECT element contains a STANDBY attribute,	then the words
	   specified as	the value of the STANDBY attribute are indexed.

       9.  All other HTML or XHTML tags	and comments (anything between < and >
	   characters) are discarded.

       In compliance with the HTML specification, any one of no	quotes,	single
       quotes, or double quotes	may be used to contain	attribute  values  and
       attributes can appear in	any order.  Values containing whitespace, how-
       ever, must be quoted.  The specification	is vague as to whether	white-
       space surrounding the = is legal, but index allows it.

   ID3 Module
       ID3 tags	are used to store audio	meta information for MP3 files (gener-
       ally).  Since audio files contain mostly	binary information,  only  the
       ID3 tag text fields are indexed.	 ID3 tag versions 1.x and 2.x (through
       2.4) are	supported (except for encrypted	frames).  If a	file  contains
       both  1.x  and  2.x  tags, only the 2.x tag is indexed.	The processing
       done for	files containing an ID3	tag is:

       1.  If a	title field is found, then the value of	the title is stored in
	   the generated index file as the file's title	rather than the	file's
	   name.  (Every non-space whitespace character	in the title  is  con-
	   verted to a space; leading and trailing spaces are removed.)

       2.  Words  that are the value of	fields are indexed associated with the
	   field name as a meta	name.  (However, if either the -A or  --no-as-
	   soc-meta  options  or the AssociateMeta variable is specified, then
	   the words specified as the value of the field  are  still  indexed,
	   but not associated with the field.)

	   (See	 also  the  -m,	 --meta,  -M, and --no-meta options or the In-
	   cludeMeta or	ExcludeMeta  variables.)   Meta	 names	can  later  be
	   queried against specifically	using search(1).

	   For	ID3v1.x,  the  recommended  fields  to	be indexed are:	album,
	   artist, comments, genre, and	title.

	   For ID3v2.2,	the recommended	text fields (with reassignments) to be
	   indexed  are:  com=comments,	 tal=album,  tcm=composer,  tco=genre,
	   tcr=copyright, ten=encoder, txt=lyricist,  tt1=content,  tt2=title,
	   tt3=subtitle, ipl=musicians,	tot=original-title, tol=original-lyri-
	   cist, toa=original-artist, tp1=artist, tp2=performers,  tp3=conduc-
	   tor,	tpb=publisher, txx=user, slt=lyrics, and ult=lyrics.

	   For ID3v2.4,	the recommended	text fields (with reassignments) to be
	   indexed are:	comm=comments, talb=album, tcom=composer,  tcon=genre,
	   tcop=copyright, tenc=encoder, text=lyricist,	tipl=people, tit1=con-
	   tent,   tit2=title,	 tit3=subtitle,	  tmcl=musicians,   tmoo=mood,
	   toal=original-title,	 toly=original-lyricist, tope=original-artist,
	   town=owner, tpe1=artist, tpe2=performers, tpe3=conductor, tpub=pub-
	   lisher,  tsst=set-subtitle, txxx=user, user=terms, sylt=lyrics, and
	   uslt=lyrics.

	   ID3v2.3 is the same	as  2.4	 except	 replace  tmcl=musicians  with
	   ipls=musicians.

	   All	text  fields  (with reassignments) for all versions of ID3 can
	   (and	should)	be specified concurrently so it	need not be  known  in
	   advance which version(s) of ID3 MP3 files are encoded with.

       3.  For ID3v2.x,	text fields that are compressed	are uncompressed prior
	   to indexing.

       4.  For ID3v2.x,	Unicode	text that is encoded in	either UTF-8 or	UTF-16
	   (either big-	or little-endian) is decoded prior to indexing.

   LaTeX Module
       Additional  processing is done for LaTeX	files.	If a \title command is
       found within the	first TitleLines lines of the file  (default  is  12),
       then  the  value	 of the	title is stored	in the generated index file as
       the file's title	rather than the	file's name.  (Every non-space	white-
       space  character	 in  the  title	 is  converted to a space; leading and
       trailing	spaces are removed.)

   Mail	Module
       Additional processing is	done for mail and news files.  The  additional
       processing is:

       1.  If  a  Subject header is found within the first TitleLines lines of
	   the file (default is	12), then the value of the subject  is	stored
	   in  the  generated  index  file as the file's title rather than the
	   file's name.	 (Every	non-space whitespace character in the title is
	   converted to	a space; leading and trailing spaces are removed.)

       2.  Words  that	are  the value of a header are indexed associated with
	   the header name as a	meta name.  (However,  if  either  the	-A  or
	   --no-assoc-meta options or the AssociateMeta	variable is specified,
	   then	the words specified as the value of the	header are  still  in-
	   dexed, but not associated with the header.)

	   (See	 also  the  -m,	 --meta,  -M, and --no-meta options or the In-
	   cludeMeta or	ExcludeMeta  variables.)   Meta	 names	can  later  be
	   queried against specifically	using search(1).

	   The	recommended headers to be indexed are: Bcc, Cc,	Comments, Con-
	   tent-Description, From, Keywords, Newsgroups,  Resent-To,  Subject,
	   and To.

       3.  MIME	attachments are	indexed.

       4.  Text	 that  is  in  the  text/enriched content type is converted to
	   plain text prior to indexing.

       5.  Text	that is	encoded	as either quoted-printable or base-64  is  de-
	   coded prior to indexing.

       6.  Unicode text	that is	encoded	in either the UTF-7 or UTF-8 character
	   set is decoded prior	to indexing.

       7.  Text	in vCards is indexed such that the values  of  types  (fields)
	   are	associated  with the types as meta names.  (However, if	either
	   the -A or --no-assoc-meta options or	the AssociateMeta variable  is
	   specified, then the words specified as the value of types are still
	   indexed, but	not associated with the	types.)

	   The recommended vCard types (with reassignments) to be indexed are:
	   adr=address,	 categories,  class, label=address, fn=name, nickname,
	   note, org, role, and	title.

       Indexing	mail and news files is most effective only when	there  is  ex-
       actly  one  message per file.  While Usenet news	files are usually this
       way, mail files are not.	 Mail files, e.g., mailboxes, are usually com-
       prised of multiple messages.  Such files	would need to be split up into
       files of	individual messages prior to indexing since there's  no	 point
       in  indexing  a single mailbox: every search result would return	a rank
       of 100 for the same file.  Therefore, the splitmail(1) utility  is  in-
       cluded in the SWISH++ distribution.

   Manual Module
       Additional  processing  is  done	for Unix manual	page files.  The addi-
       tional processing is:

       1.  If a	NAME section heading macro (.SH) is found within the first Ti-
	   tleLines  lines  of	the file (default is 12), then the contents of
	   the next line are stored in the generated index file	as the	file's
	   title  rather  than	the  file's name.  (Every non-space whitespace
	   character in	the title is converted to a space; leading and	trail-
	   ing	spaces	as  well  as backslash sequences, such as \f2, are re-
	   moved.)

       2.  Words that are in a section are indexed associated with the name of
	   the section as a meta name.	(However, if either the	-A or --no-as-
	   soc-meta options or the AssociateMeta variable is  specified,  then
	   the	words  in a section are	still indexed, but not associated with
	   the section heading.)

	   Spaces in multi-word	section	 headings  are	converted  to  dashes,
	   e.g.,  ``see	also'' becomes ``see-also'' as a meta name.  (See also
	   the -m, --meta, -M, and --no-meta options or	the IncludeMeta	or Ex-
	   cludeMeta  variables.)   Meta  names	 can  later be queried against
	   specifically	using search(1).

	   The recommended sections to be indexed are: AUTHOR, BUGS,  CAVEATS,
	   DESCRIPTION,	 DIAGNOSTICS, ENVIRONMENT, ERRORS, EXAMPLES, EXIT-STA-
	   TUS,	FILES, HISTORY,	NAME, NOTES, OPTIONS, RETURN-VALUE,  SEE-ALSO,
	   SYNOPSIS, and WARNINGS.

   RTF Module
       This  module  simply indexes rich text format files without all format-
       ting commands.

OPTIONS
       Options begin with either a `-' for short options or a ``--'' for  long
       options.	 Either	a `-' or ``--''	by itself explicitly ends the options;
       either short or long options may	be used.  Long option names may	be ab-
       breviated so long as the	abbreviation is	unambiguous.

       For a short option that takes an	argument, the argument is either taken
       to be the remaining characters of the same option, if any, or, if  not,
       is taken	from the next option unless said option	begins with a `-'.

       Short  options  that take no arguments can be grouped (but the last op-
       tion in the group can take an argument),	e.g., -lrv4 is	equivalent  to
       -l -r -v4.

       For  a long option that takes an	argument, the argument is either taken
       to be the characters after a `=', if any, or, if	not, is	taken from the
       next option unless said option begins with a `-'.

       -?
       --help		   Print the usage (``help'') message and exit.

       -A
       --no-assoc-meta	   Do  not  associate words with meta names during in-
			   dexing nor store such associations in the generated
			   index  file.	  This	sacrifices  meta names for de-
			   creased memory usage	and index file size.

       -cf
       --config-file=f	   The name of the  configuration  file,  f,  to  use.
			   (Default is swish++.conf in the current directory.)
			   A configuration file	is not required:  if  none  is
			   specified  and  the default does not	exist, none is
			   used; however, if one is specified and it does  not
			   exist, then this is an error.

       -Cc
       --no-class=c	   For	HTML  or XHTML files only, a class name, c, of
			   an HTML or XHTML element whose text is  not	to  be
			   indexed.   Multiple -C or --no-class	options	may be
			   specified.

       -em:p[,p...]
       --pattern=m:p[,p...]
			   A module name, m, and a filename pattern (or	set of
			   patterns  separated	by commas), p, of files	to in-
			   dex.	 Case is irrelevant for	the module  name,  but
			   significant	for  the  patterns.   Multiple	-e  or
			   --pattern options may be specified.

       -Ep[,p...]
       --no-pattern=p[,p...]
			   A filename pattern (or set of patterns separated by
			   commas), p, of files	not to index.  Case is signif-
			   icant.  Multiple -E or --no-pattern options may  be
			   specified.

       -fn
       --word-files=n	   The maximum number of files,	n, a word may occur in
			   before it is	discarded as being too frequent.  (De-
			   fault is infinity.)

       -Fn
       --files-reserve=n   Reserve  space  for	this  number  of  files, n, to
			   start.  More	space will be allocated	as  necessary,
			   but with a slight performance penalty.  (Default is
			   1000.)

       -gn
       --files-grow=n	   Grow	the space for the reserved number of files, n,
			   when	incrementally indexing.	 The number can	either
			   be an absolute number  of  files  or	 a  percentage
			   (when  followed by a	percent	sign %).  Just as with
			   the -F option, more space will be allocated as nec-
			   essary,  but	 with  a  slight  performance penalty.
			   (Default is 100.)

       -H
       --dump-html	   Dump	the built-in set of recognized HTML and	 XHTML
			   elements to standard	output and exit.

       -if
       --index-file=f	   The	name  of  the generated	index file, f (for new
			   indexes; default is swish++.index  in  the  current
			   directory)  or the old index	file when doing	incre-
			   mental indexing.

       -I
       --incremental	   Incrementally add the indexed files and words to an
			   existing index.  The	existing index is not touched;
			   instead, a new index	is  created  having  the  same
			   pathname  of	 the  existing index with ``.new'' ap-
			   pended.

       -l
       --follow-links	   Follow symbolic links during	indexing.  (Default is
			   not	to follow them.)  This option is not available
			   under Microsoft Windows since  it  doesn't  support
			   symbolic links.

       -mm[=n]
       --meta=m[=n]	   The value of	a meta name, m,	for which words	are to
			   be associated when indexed.	 Case  is  irrelevant.
			   Multiple -m or --meta options may be	specified.

			   A  meta  name  can be reassigned when followed by a
			   new name, n,	meaning	that the name n	and not	 m  is
			   stored  in the generated index file so that queries
			   would use the new name rather than the original.

			   By default, words associated	with  all  meta	 names
			   are indexed.	 Specifying at least one meta name via
			   this	option changes that so that only the words as-
			   sociated with a member of the set of	meta names ex-
			   plicitly specified via one or more -m or --meta op-
			   tions are indexed.

       -Mm
       --no-meta=m	   The	value  of  a meta name,	m, for which words are
			   not to be indexed.  Case is	irrelevant.   Multiple
			   -M or --no-meta options may be specified.

       -pn
       --word-percent=n	   The	maximum	percentage, n, of files	a word may oc-
			   cur in before it is discarded  as  being  too  fre-
			   quent.   (Default is	100.)  If you want to keep all
			   words regardless, specify 101.

       -P
       --no-pos-data	   Do not store	word positions in memory during	index-
			   ing	nor  in	 the generated index file needed to do
			   ``near'' searches  later  during  searching.	  This
			   sacrifices  ``near''	searching for decreased	memory
			   usage and index file	size (approximately 50%).

       -r
       --no-recurse	   Do not recursively index the	files  in  subdirecto-
			   ries, that is: when a directory is encountered, all
			   the files in	that directory are indexed (modulo the
			   filename  patterns  specified  via  either  the -e,
			   --pattern, -E, or --no-pattern options or  the  In-
			   cludeFile or	ExcludeFile variables) but subdirecto-
			   ries	encountered  are  ignored  and	therefore  the
			   files  contained in them are	not indexed.  This op-
			   tion	is most	useful when specifying the directories
			   and files to	index via standard input.  (Default is
			   to index the	files in subdirectories	recursively.)

       -sf
       --stop-file=f	   The name of a file, f, containing the set of	 stop-
			   words  to  use instead of the built-in set.	White-
			   space, including blank lines, and characters	start-
			   ing	with  #	 and continuing	to the end of the line
			   (comments) are ignored.

       -S
       --dump-stop	   Dump	the built-in set  of  stop-words  to  standard
			   output and exit.

       -tn
       --title-lines=n	   The maximum number of lines,	n, into	a file to look
			   at for a file's title.  (Default  is	 12.)	Larger
			   numbers slow	indexing.

       -Td
       --temp-dir=d	   The	path of	the directory, d, to use for temporary
			   files.  The directory must exist.  (Default is /tmp
			   for Unix or /temp for Windows.)

			   If  your  OS	mounts swap space on /tmp, as indexing
			   progresses and more files get created in /tmp,  you
			   will	 have  less  swap  space,  indexing  will  get
			   slower, and you may run out of memory.  If this  is
			   the	case, you should specify a directory on	a real
			   filesystem, i.e., one on a physical disk.

       -vn
       --verbosity=n	   The verbosity level,	n, for printing	additional in-
			   formation  to standard output during	indexing.  The
			   verbosity levels, 0-4, are:

			   0   No  output  is  generated  except  for  errors.
			       (This is	the default.)
			   1   Only  run  statistics  (elapsed time, number of
			       files, word count) are printed.
			   2   Directories are printed as indexing progresses.
			   3   Directories and files are printed with a	 word-
			       count for each file.
			   4   Same  as	 3  but	also prints all	files that are
			       not indexed and why.

       -V
       --version	   Print the version number  of	 SWISH++  to  standard
			   output and exit.

       -Wn
       --word-threshold=n  The	word count past	which partial indices are gen-
			   erated and merged since all the words are  too  big
			   to  fit into	memory at the same time.  If you index
			   and your machine begins to  swap  like  mad,	 lower
			   this	 value.	  Only	the  super-user	 can specify a
			   value larger	than the compiled-in default.

CONFIGURATION FILE
       The following variables can be set in a configuration file.   Variables
       and command-line	options	can be mixed, the latter taking	priority.

	    AssociateMeta	Same as	-A or --no-assoc-meta
	    ExcludeClass	Same as	-C or --no-class
	    ExcludeFile		Same as	-E or --no-pattern
	    ExcludeMeta		Same as	-M or --no-meta
	    FilesGrow		Same as	-g or --files-grow
	    FilesReserve	Same as	-F or --files-reserve
	    FilterAttachment	(See FILTERS in	swish++.conf(4).)
	    FilterFile		(See FILTERS in	swish++.conf(4).)
	    FollowLinks		Same as	-l or --follow-links
	    IncludeFile		Same as	-e or --pattern
	    IncludeMeta		Same as	-m or --meta
	    Incremental		Same as	-I or --incremental
	    IndexFile		Same as	-i or --index-file
	    RecurseSubdirs	Same as	-r or --no-recurse
	    StopWordFile	Same as	-s or --stop-file
	    StoreWordPositions	Same as	-P or --no-pos-data
	    TempDirectory	Same as	-T or --temp-dir
	    TitleLines		Same as	-t or --title-lines
	    Verbosity		Same as	-v or --verbosity
	    WordFilesMax	Same as	-f or --word-files
	    WordPercentMax	Same as	-p or --word-percent
	    WordsNear		Same as	-n or --near
	    WordThreshold	Same as	-W or --word-threshold

EXAMPLES
   Unix	Command-Lines
       All  these example assume you change your working directory to your web
       server's	document root prior to indexing.

       To index	all HTML and text files	on a web server:

	    index -v3 -e 'html:*.*htm*'	-e 'text:*.txt'	.

       To index	all files not under directories	named CVS:

	    find . -name CVS -prune -o -type f -a -print | index -e 'html:*.*htm*' -

   Windows Command-Lines
       When using the Windows command interpreter, single quotes around	 file-
       name patterns don't work; you must use double quotes:

	    index -v3 -e "html:*.*htm*"	-e "text:*.txt"	.

       This  is	a problem with Windows,	not SWISH++.  (Double quotes will also
       work under Unix.)

   Using CLASS Attributes to Index HTML	Selectively
       In an HTML or XHTML document, there may be sections that	should not  be
       indexed.	  For  example,	if every page of a web site contains a naviga-
       tion menu such as:

	    <SELECT NAME="menu">
	      <OPTION>Home
	      <OPTION>Automotive
	      <OPTION>Clothing
	      <OPTION>Hardware
	    </SELECT>

       or a common header and footer, then, ordinarily,	those words  would  be
       indexed for every page and therefore be discarded because they would be
       too frequent.  However, via either the -C or --no-class option  or  the
       ExcludeClass  variable,	one  or	 more class names can be specified and
       then HTML or XHTML elements belonging to	one of those classes will  not
       have the	text up	to the tag that	ends them indexed.  Given a class name
       of, say,	no_index, the above menu can be	changed	to:

	    <SELECT NAME="menu"	CLASS="no_index">

       and then	everything up to the </SELECT> tag will	not be indexed.

       For an HTML element that	has an optional	end tag	(such as the <P>  ele-
       ment),  the  text up to the tag that ends it will not be	indexed, which
       is either the element's own end tag or a	tag of some other element that
       implicitly ends it.  For	example, in:

	    <P CLASS="no_index">
	    This was the poem that Alice read:
	    <BLOCKQUOTE>
	      <B>Jabberwocky</B><BR>
	      `Twas brillig, and the slithy toves<BR>
	      Did gyre and gimble in the wabe;<BR>
	      All mimsy	were the borogoves,<BR>
	      And the mome raths outgrabe.
	    </BLOCKQUOTE>

       the  <BLOCKQUOTE> tag implicitly	ends the <P> element (as do all	block-
       level elements) so the only text	that is	not indexed above  is:	``This
       was the poem that Alice read.''

       For  an	HTML  or XHTML element that does not have an end tag, only the
       text within the start tag will not be indexed.  For example, in:

	    <IMG SRC="home.gif"	ALT="Home" CLASS="no_index">

       the word	``Home'' will not be indexed even though it  ordinarily	 would
       have been if the	CLASS attribute	were not there.

   Filters
       (See Filters under EXAMPLES in swish++.conf(4).)

EXIT STATUS
       Exits with one of the values given below:

	    0	 Success.
	    1	 Error in configuration	file.
	    2	 Error in command-line options.
	    10	 Unable	to open	temporary file.
	    11	 Unable	to write index file.
	    12	 Unable	to write temporary file.
	    13	 Root-only operation attempted.
	    30	 Unable	to read	stop-word file.
	    40	 Unable	to read	index file.
	    127	 Internal error.

CAVEATS
       1.  Generated index files are machine-dependent (size of	data types and
	   byte	order).

       2.  The word-determination heuristics employed are heavily  geared  for
	   English.  Using SWISH++ as-is to index and search files in non-Eng-
	   lish	languages is not recommended.

       3.  Unless otherwise noted above, the character encoding	always used is
	   ISO	8859-1	(Latin	1).  Character encodings that are specified in
	   HTML	or XHTML files are ignored.

       4.  An e-mail message can have both an encoding and a non-ASCII or non-
	   ISO-8859-1 charset simultaneously, e.g., base64-encoded UTF-8.  (In
	   practice, this particular case  should  never  happen  since	 UTF-7
	   should be used instead; but you get the idea.)

	   However,  handling  both  an	encoding and such a charset simultane-
	   ously is problematic; hence,	an e-mail message  or  attachment  can
	   have	 either	an encoding or a non-ASCII or a	non-ISO-8859-1 charac-
	   ter set, but	not both.  If it does, the encoding takes precedence.

FILES
       swish++.conf	 default configuration file name
       swish++.index	 default index file name

ENVIRONMENT
       TMPDIR	 If set, the default path of the directory to use  for	tempo-
		 rary files.  The directory must exist.	 This is superseded by
		 either	the -T or --temp-dir option or the TempDirectory vari-
		 able.

SEE ALSO
       extract(1),     find(1),	    nroff(1),	  search(1),	 splitmail(1),
       swish++.conf(4),	glob(7), man(7).

       Tim Berners-Lee.	 ``The text/enriched MIME Content-type,'' Request  for
       Comments	 1563,	Network	Working	Group of the Internet Engineering Task
       Force, January 1994.

       David H.	Crocker.  ``Standard for the Format of ARPA Internet Text Mes-
       sages,''	 Request  for Comments 822, Department of Electrical Engineer-
       ing, University of Delaware, August 1982.

       Frank Dawson and	Tim Howes.  ``vCard MIME Directory Profile,''  Request
       for  Comments  2426,  Network Working Group of the Internet Engineering
       Task Force, September 1998.

       Ned Freed and Nathaniel S. Borenstein.	``Multipurpose	Internet  Mail
       Extensions  (MIME)  Part	 One: Format of	Internet Message Bodies,'' Re-
       quest for Comments 2045,	RFC 822	Extensions Working Group of the	Inter-
       net Engineering Task Force, November 1996.

       David  Goldsmith	 and  Mark Davis.  ``UTF-7, a mail-safe	transformation
       format of Unicode,'' Request for	Comments 2152, Network	Working	 Group
       of the Internet Engineering Task	Force, May 1997.

       International Standards Organization.  ISO 8859-1: Information Process-
       ing -- 8-bit single-byte	coded graphic character	sets --	Part 1:	 Latin
       alphabet	No. 1, 1987.

       --.   ISO  8879:	 Information  Processing -- Text and Office Systems --
       Standard	Generalized Markup Language (SGML), 1986.

       --.  ISO/IEC 9945-2: Information	Technology -- Portable Operating  Sys-
       tem Interface (POSIX) --	Part 2:	Shell and Utilities, 1993.

       Leslie  Lamport.	  LaTeX: A Document Preparation	System,	2nd ed., Addi-
       son-Wesley, Reading, MA,	1994.

       Martin Nilsson.	ID3 tag	version	2, March 1998.

       --.  ID3	tag version 2.3.0, February 1999.

       --.  ID3	tag version 2.4.0 - Main Structure, November 2002.

       --.  ID3	tag version 2.4.0 - Native Frames, November 2002.

       Steven Pemberton, et al.	 XHTML 1.0: The	 Extensible  HyperText	Markup
       Language, World Wide Web	Consortium, January 2000.

       Dave Raggett, Arnaud Le Hors, and Ian Jacobs.  ``On SGML	and HTML: SGML
       constructs  used	 in  HTML:   Entities,''   HTML	  4.0	Specification,
       _section_3.2.3, World Wide Web Consortium, April	1998.

       --.  ``The global structure of an HTML document:	The document head: The
       title attribute,'' HTML 4.0 Specification, _section_7.4.3,  World  Wide
       Web Consortium, April 1998.

       --.   ``The  global  structure  of an HTML document: The	document head:
       Meta data,'' HTML 4.0 Specification,  _section_7.4.4,  World  Wide  Web
       Consortium, April 1998.

       --.  ``The global structure of an HTML document:	The document body: El-
       ement identifiers: the id and class attributes,'' HTML  4.0  Specifica-
       tion, _section_7.5.2, World Wide	Web Consortium,	April 1998.

       --.   ``Tables:	Elements for constructing tables: The TABLE element,''
       HTML 4.0	Specification, _section_11.2.1,	 World	Wide  Web  Consortium,
       April 1998.

       --.  ``Objects, Images, and Applets: Generic inclusion: the OBJECT ele-
       ment,'' HTML 4.0	Specification, _section_13.3, World Wide  Web  Consor-
       tium, April 1998.

       --.   ``Objects,	 Images, and Applets: How to specify alternate text,''
       HTML 4.0	Specification, _section_13.8, World Wide Web Consortium, April
       1998.

       --.  ``Index of Elements,'' HTML	4.0 Specification, World Wide Web Con-
       sortium,	April 1998.

       Marcin Sawicki, et al.  Ruby Annotation,	 World	Wide  Web  Consortium,
       April 2001.

       The  Unicode Consortium.	 ``Encoding Forms,'' The Unicode Standard 3.0,
       _section_2.3, Addison-Wesley, 2000.

       Francois	Yergeau.  ``UTF-8, a transformation format of ISO 10646,'' Re-
       quest  for  Comments  2279, Network Working Group of the	Internet Engi-
       neering Task Force, January 1998.

AUTHOR
       Paul J. Lucas <pauljlucas@mac.com>

SWISH++				March 25, 2004			      index(1)

NAME | SYNOPSIS | DESCRIPTION | INDEXING MODULES | OPTIONS | CONFIGURATION FILE | EXAMPLES | EXIT STATUS | CAVEATS | FILES | ENVIRONMENT | SEE ALSO | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=index&sektion=1&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help