Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
SWISH-FAQ(1)		     SWISH-E Documentation		  SWISH-FAQ(1)

NAME
       The Swish-e FAQ - Answers to Common Questions

OVERVIEW
       List of commonly	asked and answered questions.  Please review this doc-
       ument before asking questions on	the Swish-e discussion list.

       General Questions

       What is Swish-e?

       Swish-e is Simple Web Indexing System for Humans	- Enhanced.  With it,
       you can quickly and easily index	directories of files or	remote web
       sites and search	the generated indexes for words	and phrases.

       So, is Swish-e a	search engine?

       Well, yes.  Probably the	most common use	of Swish-e is to provide a
       search engine for web sites.  The Swish-e distribution includes CGI
       scripts that can	be used	with it	to add a search	engine for your	web
       site.  The CGI scripts can be found in the example directory of the
       distribution package.  See the README file for information about	the
       scripts.

       But Swish-e can also be used to index all sorts of data,	such as	email
       messages, data stored in	a relational database management system, XML
       documents, or documents such as Word and	PDF documents -- or any	combi-
       nation of those sources at the same time.  Searches can be limited to
       fields or MetaNames within a document, or limited to areas within an
       HTML document (e.g. body, title).  Programs other than CGI applications
       can use Swish-e,	as well.

       Should I	upgrade	if I'm already running a previous version of Swish-e?

       A large number of bug fixes, feature additions, and logic corrections
       were made in version 2.2.  In addition, indexing	speed has been drasti-
       cally improved (reports of indexing times changing from four hours to 5
       minutes), and major parts of the	indexing and search parsers have been
       rewritten.  There's better debugging options, enhanced output formats,
       more document meta data (e.g. last modified date, document summary),
       options for indexing from external data sources,	and faster spidering
       just to name a few changes.  (See the CHANGES file for more informa-
       tion.

       Since so	much effort has	gone into version 2.2, support for previous
       versions	will probably be limited.

       Are there binary	distributions available	for Swish-e on platform	foo?

       Foo?  Well, yes there are some binary distributions available.  Please
       see the Swish-e web site	for a list at http://swish-e.org/.

       In general, it is recommended that you build Swish-e from source, if
       possible.

       Do I need to reindex my site each time I	upgrade	to a new Swish-e ver-
       sion?

       At times	it might not strictly be necessary, but	since you don't	really
       know if anything	in the index has changed, it is	a good rule to rein-
       dex.

       What's the advantage of using the libxml2 library for parsing HTML?

       Swish-e may be linked with libxml2, a library for working with HTML and
       XML documents.  Swish-e can use libxml2 for parsing HTML	and XML	docu-
       ments.

       The libxml2 parser is a better parser than Swish-e's built-in HTML
       parser.	It offers more features, and it	does a much better job at ex-
       tracting	out the	text from a web	page.  In addition, you	can use	the
       "ParserWarningLevel" configuration setting to find structural errors in
       your documents that could (and would with Swish-e's HTML	parser)	cause
       documents to be indexed incorrectly.

       Libxml2 is not required,	but is strongly	recommended for	parsing	HTML
       documents.  It's	also recommended for parsing XML, as it	offers many
       more features than the internal Expat xml.c parser.

       The internal HTML parser	will have limited support, and does have a
       number of bugs.	For example, HTML entities may not always be correctly
       converted and properties	do not have entities converted.	 The internal
       parser tends to get confused when invalid HTML is parsed	where the
       libxml2 parser doesn't get confused as often.  The structure is better
       detected	with the libxml2 parser.

       If you are using	the Perl module	(the C interface to the	Swish-e	li-
       brary) you may wish to build two	versions of Swish-e, one with the
       libxml2 library linked in the binary, and one without, and build	the
       Perl module against the library without the libxml2 code.  This is to
       save space in the library.  Hopefully, the library will someday soon be
       split into indexing and searching code (volunteers welcome).

       Does Swish-e include a CGI interface?

       Yes.  Kind of.

       There's two example CGI scripts included, swish.cgi and search.cgi.
       Both are	installed at $prefix/lib/swish-e.

       Both require a bit of work to setup and use.  Swish.cgi is probably
       what most people	will want to use as it contains	more features.
       Search.cgi is for those that want to start with a small script and cus-
       tomize it to fit	their needs.

       An example of using swish.cgi is	given in the INSTALL man page, and it
       the swish.cgi documentation.  Like often	is the case, it	will be	easier
       to use if you first read	the documentation.

       Please use caution about	CGI scripts found on the Internet for use with
       Swish-e.	 Some are not secure.

       The included example CGI	scripts	were designed with security in mind.
       Regardless, you are encouraged to have your local Perl expert review it
       (and all	other CGI scripts you use) before placing it into production.
       This is just a good policy to follow.

       How secure is Swish-e?

       We know of no security issues with using	Swish-e.  Careful attention
       has been	made with regard to common security problems such as buffer
       overruns	when programming Swish-e.

       The most	likely security	issue with Swish-e is when it is run via a
       poorly written CGI interface.  This is not limited to CGI scripts writ-
       ten in Perl, as it's just as easy to write an insecure CGI script in C,
       Java, PHP, or Python.  A	good source of information is included with
       the Perl	distribution.  Type "perldoc perlsec" at your local prompt for
       more information.  Another must-read document is	located	at
       "http://www.w3.org/Security/faq/wwwsf4.html".

       Note that there are many	free yet insecure and poorly written CGI
       scripts available -- even some designed for use with Swish-e.  Please
       carefully review	any CGI	script you use.	 Free is not such a good price
       when you	get your server	hacked...

       Should I	run Swish-e as the superuser (root)?

       No.  Never.

       What files does Swish-e write?

       Swish writes the	index file, of course.	This is	specified with the
       "IndexFile" configuration directive or by the "-f" command line switch.

       The index file is actually a collection of files, but all start with
       the file	name specified with the	"IndexFile" directive or the "-f" com-
       mand line switch.

       For example, the	file ending in .prop contains the document properties.

       When creating the index files Swish-e appends the extension .temp to
       the index file names.  When indexing is complete	Swish-e	renames	the
       .temp files to the index	files specified	by "IndexFile" or "-f".	 This
       is done so that existing	indexes	remain untouched until it completes
       indexing.

       Swish-e also writes temporary files in some cases during	indexing (e.g.
       "-s http", "-s prog" with filters), when	merging, and when using	"-e").
       Temporary files are created with	the mkstemp(3) function	(with 0600
       permission on unix-like operating systems).

       The temporary files are created in the directory	specified by the envi-
       ronment variables "TMPDIR" and "TMP" in that order.  If those are not
       set then	swish uses the setting the configuration setting TmpDir.  Oth-
       erwise, the temporary file will be located in the current directory.

       Can I index PDF and MS-Word documents?

       Yes, you	can use	a Filter to convert documents while indexing, or you
       can use a program that "feeds" documents	to Swish-e that	have already
       been converted.	See "Indexing" below.

       Can I index documents on	a web server?

       Yes, Swish-e provides two ways to index (spider)	documents on a web
       server.	See "Spidering"	below.

       Swish-e can retrieve documents from a file system or from a remote web
       server.	It can also execute a program that returns documents back to
       it.  This program can retrieve documents	from a database, filter	com-
       pressed documents files,	convert	PDF files, extract data	from mail ar-
       chives, or spider remote	web sites.

       Can I implement keywords	in my documents?

       Yes, Swish-e can	associate words	with MetaNames while indexing, and you
       can limit your searches to these	MetaNames while	searching.

       In your HTML files you can put keywords in HTML META tags or in XML
       blocks.

       META tags can have two formats in your source documents:

	   <META NAME="DC.subject" CONTENT="digital libraries">

       And in XML format (can also be used in HTML documents when using
       libxml2):

	   <meta2>
	       Some Content
	   </meta2>

       Then, to	inform Swish-e about the existence of the meta name in your
       documents, edit the line	in your	configuration file:

	   MetaNames DC.subject	meta1 meta2

       When searching you can now limit	some or	all search terms to that
       MetaName.  For example, to look for documents that contain the word ap-
       ple and also have either	fruit or cooking in the	DC.subject meta	tag.

       What are	document properties?

       A document property is typically	data that describes the	document.  For
       example,	properties might include a document's path name, its last mod-
       ified date, its title, or its size.  Swish-e stores a document's	prop-
       erties in the index file, and they can be reported back in search re-
       sults.

       Swish-e also uses properties for	sorting.  You may sort your results by
       one or more properties, in ascending or descending order.

       Properties can also be defined within your documents.  HTML and XML
       files can specify tags (see previous question) as properties.  The con-
       tents of	these tags can then be returned	with search results.  These
       user-defined properties can also	be used	for sorting search results.

       For example, if you had the following in	your documents

	  <meta	name="creator" content="accounting department">

       and "creator" is	defined	as a property (see "PropertyNames" in SWISH-
       CONFIG) Swish-e can return "accounting department" with the result for
       that document.

	   swish-e -w foo -p creator

       Or for sorting:

	   swish-e -w foo -s creator

       What's the difference between MetaNames and PropertyNames?

       MetaNames allows	keywords searches in your documents.  That is, you can
       use MetaNames to	restrict searches to just parts	of your	documents.

       PropertyNames, on the other hand, define	text that can be returned with
       results,	and can	be used	for sorting.

       Both use	meta tags found	in your	documents (as shown in the above two
       questions) to define the	text you wish to use as	a property or meta
       name.

       You may define a	tag as both a property and a meta name.	 For example:

	  <meta	name="creator" content="accounting department">

       placed in your documents	and then using configuration settings of:

	   PropertyNames creator
	   MetaNames creator

       will allow you to limit your searches to	documents created by account-
       ing:

	   swish-e -w 'foo and creator=(accounting)'

       That will find all documents with the word "foo"	that also have a cre-
       ator meta tag that contains the word "accounting".  This	is using
       MetaNames.

       And you can also	say:

	   swish-e -w foo -p creator

       which will return all documents with the	word "foo", but	the results
       will also include the contents of the "creator" meta tag	along with re-
       sults.  This is using properties.

       You can use properties and meta names at	the same time, too:

	   swish-e -w creator=(accounting or marketing)	-p creator -s creator

       That searches only in the "creator" meta	name for either	of the words
       "accounting" or "marketing", prints out the contents of the contents of
       the "creator" property, and sorts the results by	the "creator" property
       name.

       (See also the "-x" output format	switch in SWISH-RUN.)

       Can Swish-e index multi-byte characters?

       No.  This will require much work	to change.  But, Swish-e works with
       eight-bit characters, so	many characters	sets can be used.  Note	that
       it does call the	ANSI-C tolower() function which	does depend on the
       current locale setting.	See locale(7) for more information.

       Indexing

       How do I	pass Swish-e a list of files to	index?

       Currently, there	is not a configuration directive to include a file
       that contains a list of files to	index.	But, there is a	directive to
       include another configuration file.

	   IncludeConfigFile /path/to/other/config

       And in "/path/to/other/config" you can say:

	   IndexDir file1 file2	file3 file4 file5 ...
	   IndexDir file20 file21 file22

       You may also specify more than one configuration	file on	the command
       line:

	   ./swish-e -c	config_one config_two config_three

       Another option is to create a directory with symbolic links of the
       files to	index, and index just that directory.

       How does	Swish-e	know which parser to use?

       Swish can parse HTML, XML, and text documents.  The parser is set by
       associating a file extension with a parser by the "IndexContents" di-
       rective.	 You may set the default parser	with the "DefaultContents" di-
       rective.	 If a document is not assigned a parser	it will	default	to the
       HTML parser (HTML2 if built with	libxml2).

       You may use Filters or an external program to convert documents to
       HTML, XML, or text.

       Can I reindex and search	at the same time?

       Yes.  Starting with version 2.2 Swish-e indexes to temporary files, and
       then renames the	files when indexing is complete.  On most systems re-
       names are atomic.  But, since Swish-e also generates more than one file
       during indexing there will be a very short period of time between re-
       naming the various files	when the index is out of sync.

       Settings	in src/config.h	control	some options related to	temporary
       files, and their	use during indexing.

       Can I index phrases?

       Phrases are indexed automatically.  To search for a phrase simply place
       double quotes around the	phrase.

       For example:

	   swish-e -w 'free and	"fast search engine"'

       How can I prevent phrases from matching across sentences?

       Use the BumpPositionCounterCharacters configuration directive.

       Swish-e isn't indexing a	certain	word or	phrase.

       There are a number of configuration parameters that control what	Swish-
       e considers a "word" and	it has a debugging feature to help pinpoint
       any indexing problems.

       Configuration file directives (SWISH-CONFIG) "WordCharacters", "Begin-
       Characters", "EndCharacters", "IgnoreFirstChar",	and "IgnoreLastChar"
       are the main settings that Swish-e uses to define a "word".  See	SWISH-
       CONFIG and SWISH-RUN for	details.

       Swish-e also uses compile-time defaults for many	settings.  These are
       located in src/config.h file.

       Use of the command line arguments "-k", "-v" and	"-T" are useful	when
       debugging these problems.  Using	"-T INDEXED_WORDS" while indexing will
       display each word as it is indexed.  You	should specify one file	when
       using this feature since	it can generate	a lot of output.

	    ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS

       You may also wish to index a single file	that contains words that are
       or are not indexing as you expect and use -T to output debugging	infor-
       mation about the	index.	A useful command might be:

	   ./swish-e -f	index.swish-e -T INDEX_FULL

       Once you	see how	Swish-e	is parsing and indexing	your words, you	can
       adjust the configuration	settings mentioned above to control what words
       are indexed.

       Another useful command might be:

	    ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS

       This will show white-spaced words parsed	from the document
       (PARSED_WORDS), and how those words are split up	into separate words
       for indexing (INDEXED_WORDS).

       How do I	keep Swish-e from indexing numbers?

       Swish-e indexes words as	defined	by the "WordCharacters"	setting, as
       described above.	 So to avoid indexing numbers you simply remove	digits
       from the	"WordCharacters" setting.

       There are also some settings in src/config.h that control what "words"
       are indexed.  You can configure swish to	never index words that are all
       digits, vowels, or consonants, or that contain more than	some consecu-
       tive number of digits, vowels, or consonants.  In general, you won't
       need to change these settings.

       Also, there's an	experimental feature called "IgnoreNumberChars"	which
       allows you to define a set of characters	that describe a	number.	 If a
       word is made up of only those characters	it will	not be indexed.

       Swish-e crashes and burns on a certain file. What can I do?

       This shouldn't happen.  If it does please post to the Swish-e discus-
       sion list the details so	it can be reproduced by	the developers.

       In the mean time, you can use a "FileRules" directive to	exclude	the
       particular file name, or	pathname, or its title.	 If there are serious
       problems	in indexing certain types of files, they may not have valid
       text in them (they may be binary	files, for instance). You can use No-
       Contents	to exclude that	type of	file.

       Swish-e will issue a warning if an embedded null	character is found in
       a document.  This warning will be an indication that you	are trying to
       index binary data.  If you need to index	binary files try to find a
       program that will extract out the text (e.g. strings(1),	catdoc(1),
       pdftotext(1)).

       How to I	prevent	indexing of some documents?

       When using the file system to index your	files you can use the
       "FileRules" directive.  Other than "FileRules title", "FileRules" only
       works with the file system ("-S fs") indexing method, not with "-S
       prog" or	"-S http".

       If you are spidering a site you have control over, use a	robots.txt
       file in your document root.  This is a standard way to excluded files
       from search engines, and	is fully supported by Swish-e.	See
       http://www.robotstxt.org/

       If spidering a website with the included	spider.pl program then add any
       necessary tests to the spider's configuration file.  Type <perldoc spi-
       der.pl> in the "prog-bin" directory for details or see the spider docu-
       mentation on the	Swish-e	website.  Look for the section on callback
       functions.

       If using	the libxml2 library for	parsing	HTML (which you	probably are),
       you may also use	the Meta Robots	Exclusion in your documents:

	   <meta name="robots" content="noindex">

       See the obeyRobotsNoIndex directive.

       How do I	prevent	indexing parts of a document?

       To prevent Swish-e from indexing	a common header, footer, or navigation
       bar, AND	you are	using libxml2 for parsing HTML,	then you may use a
       fake HTML tag around the	text you wish to ignore	and use	the "Ig-
       noreMetaTags" directive.	 This will generate an error message if	the
       "ParserWarningLevel" is set as it's invalid HTML.

       "IgnoreMetaTags"	works with XML documents (and HTML documents when us-
       ing libxml2 as the parser), but not with	documents parsed by the	text
       (TXT) parser.

       If you are using	the libxml2 parser (HTML2 and XML2) then you can use
       the the following comments in your documents to prevent indexing:

	      <!-- SwishCommand	noindex	-->
	      <!-- SwishCommand	index -->

       and/or these may	be used	also:

	      <!-- noindex -->
	      <!-- index -->

       How do I	modify the path	or URL of the indexed documents.

       Use the "ReplaceRules" configuration directive to rewrite path names
       and URLs.  If you are using "-S prog" input method you may set the path
       to any string.

       How can I index data from a database?

       Use the "prog" document source method of	indexing.  Write a program to
       extract out the data from your database,	and format it as XML, HTML, or
       text.  See the examples in the "prog-bin" directory, and	the next ques-
       tion.

       How do I	index my PDF, Word, and	compressed documents?

       Swish-e can internally only parse HTML, XML and TXT (text) files	by de-
       fault, but can make use of filters that will convert other types	of
       files such as MS	Word documents,	PDF, or	gzipped	files into one of the
       file types that Swish-e understands.

       Please see SWISH-CONFIG and the examples	in the filters and filter-bin
       directory for more information.

       See the next question to	learn about the	filtering options with
       Swish-e.

       How do I	filter documents?

       The term	"filter" in Swish-e means the converstion of a document	of one
       type (one that swish-e cannot index directly) into a type that Swish-e
       can index, namely HTML, plain text, or XML.  To add to the confusion,
       there are a number of ways to accomplish	this in	Swish-e.  So here's a
       bit of background.

       The FileFilter directive	was added to swish first.  This	feature	allows
       you to specify a	program	to run for documents that match	a given	file
       extension.  For example,	to filter PDF files (files that	end in .pdf)
       you can specify the configuation	setting	of:

	   FileFilter .pdf pdftotext   "'%p' -"

       which says to run the program "pdftotext" passing it the	pathname of
       the file	(%p) and a dash	(which tells pdftotext to output to stdout).
       Then for	each .pdf file Swish-e runs this program and reads in the fil-
       tered document from the output from the filter program.

       This has	the advantage that it is easy to setup -- a single line	in the
       config file is all that is needed to add	the filter into	Swish-e.  But
       it also has a number of problems.  For example, if you use a Perl
       script to do your filtering it can be very slow since the filter	script
       must be run (and	thus compiled) for each	processed document.  This is
       exacerbated when	using the -S http method since the -S http method also
       uses a Perl script that is run for every	URL fetched.  Also, when using
       -S prog method of input (reading	input from a program) using FileFilter
       means that Swish-e must first read the file in from the external	pro-
       gram and	then write the file out	to a temporary file before running the
       filter.

       With -S prog it makes much more sense to	filter the document in the
       program that is fetching	the documents than to have swish-e read	the
       file into memory, write it to a temporary file and then run an external
       program.

       The Swish-e distribution	contains a couple of example -S	prog programs.
       spider.pl is a reasonably full-featured web spider that offers many
       more options than the -S	http method.  And it is	much faster than run-
       ning -S http, too.

       The spider has a	perl configuration file, which means you can add pro-
       gramming	logic right into the configuration file	without	editing	the
       spider program.	One bit	of logic that is provided in the spider's con-
       figuration file is a "call-back"	function that allows you to filter the
       content.	 In other words, before	the spider passes a fetched web	docu-
       ment to swish for indexing the spider can call a	simple subroutine in
       the spider's configuration file passing the document and	its content
       type.  The subroutine can then look at the content type and decide if
       the document needs to be	filtered.

       For example, when processing a document of type "application/msword"
       the call-back subroutine	might call the doc2txt.pm perl module, and a
       document	of type	"appliation/pdf" could use the pdf2html.pm module.
       The prog-bin/SwishSpiderConfig.pl file shows this usage.

       This system works reasonably well, but also means that more work	is re-
       quired to setup the filters.  First, you	must explicitly	check for spe-
       cific content types and then call the appropriate Perl module, and sec-
       ond, you	have to	know how each module must be called and	how each re-
       turns the possibly modified content.

       In comes	SWISH::Filter.

       To make things easier the SWISH::Filter Perl module was created.	 The
       idea of this module is that there is one	interface used to filter all
       types of	documents.  So instead of checking for specific	types of con-
       tent you	just pass the content type and the document to the SWISH::Fil-
       ter module and it returns a new content type and	document if it was
       filtered.  The filters that do the actual work are designed with	a
       standard	interface and work like	filter "plug-ins". Adding new filters
       means just downloading the filter to a directory	and no changes are
       needed to the spider's configuation file.  Download a filter for	Post-
       script and next time you	run indexing your Postscript files will	be in-
       dexed.

       Since the filters are standardized, hopefully when you have the need to
       filter documents	of a specific type there will already be a filter
       ready for your use.

       Now, note that the perl modules may or may not do the actual conversion
       of a document.  For example, the	PDF conversion module calls the
       pdfinfo and pdftotext programs.	Those programs (part of	the Xpfd pack-
       age) must be installed separately from the filters.

       The SwishSpiderConfig.pl	examle spider configuration file shows how to
       use the SWISH::Filter module for	filtering.  This file is installed at
       $prefix/share/doc/swish-e/examples/prog-bin, where $prefix is normally
       /usr/local on unix-type machines.

       The SWISH::Filter method	of filtering can also be used with the -S http
       method of indexing.  By default the swishspider program (the Perl
       helper script that fetches documents from the web) will attempt to use
       the SWISH::Filter module	if it can be found in Perls library path.
       This path is set	automatically for spider.pl but	not for	swishspider
       (because	it would slow down a method that's already slow	and spider.pl
       is recommended over the -S http method).

       Therefore, all that's required to use this system with -S http is set-
       ting the	@INC array to point to the filter directory.

       For example, if the swish-e distribution	was unpacked into ~/swish-e:

	  PERL5LIB=~/swish-e/filters swish-e -c	conf -S	http

       will allow the -S http method to	make use of the	SWISH::Filter module.

       Note that if you	are not	using the SWISH::Filter	module you may wish to
       edit the	swishspider program and	disable	the use	of the SWISH::Filter
       module using this setting:

	   use constant	USE_FILTERS  =>	0;  # disable SWISH::Filter

       This prevents the program from attempting to use	the SWISH::Filter mod-
       ule for every non-text URL that is fetched.  Of course, if you are con-
       cerned with indexing speed you should be	using the -S prog method with
       spider.pl instead of -S http.

       If you are not spidering, but you still want to make use	of the
       SWISH::Filter module for	filtering you can use the DirTree.pl program
       (in $prefix/lib/swish-e).  This is a simple program that	traverses the
       file system and uses SWISH::Filter for filtering.

       Here's two examples of how to run a filter program, one using Swish-e's
       "FileFilter" directive, another using a "prog" input method program.
       See the SwishSpiderConfig.pl file for an	example	of using the
       SWISH::Filter module.

       These filters simply use	the program "/bin/cat" as a filter and only
       indexes .html files.

       First, using the	"FileFilter" method, here's the	entire configuration
       file (swish.conf):

	   IndexDir .
	   IndexOnly .html
	   FileFilter .html "/bin/cat"	 "'%p'"

       and index with the command

	   swish-e -c swish.conf -v 1

       Now, the	same thing with	using the "-S prog" document source input
       method and a Perl program called	catfilter.pl.  You can see that's it's
       much more work than using the "FileFilter" method above,	but provides a
       place to	do additional processing.  In this example, the	"prog" method
       is only slightly	faster.	 But if	you needed a perl script to run	as a
       FileFilter then "prog" will be significantly faster.

	   #!/usr/local/bin/perl -w
	   use strict;
	   use File::Find;  # for recursing a directory	tree

	   $/ =	undef;
	   find(
	       { wanted	=> \&wanted, no_chdir => 1, },
	       '.',
	   );

	   sub wanted {
	       return if -d;
	       return unless /\.html$/;

	       my $mtime  = (stat)[9];

	       my $child = open( FH, '-|' );
	       die "Failed to fork $!" unless defined $child;
	       exec '/bin/cat',	$_ unless $child;

	       my $content = <FH>;
	       my $size	= length $content;

	       print <<EOF;
	   Content-Length: $size
	   Last-Mtime: $mtime
	   Path-Name: $_

	   EOF

	       print <FH>;
	   }

       And index with the command:

	   swish-e -S prog -i ./catfilter.pl -v	1

       This example will probably not work under Windows due to	the '-|' open.
       A simple	piped open may work just as well:

       That is,	replace:

	   my $child = open( FH, '-|' );
	   die "Failed to fork $!" unless defined $child;
	   exec	'/bin/cat', $_ unless $child;

       with this:

	   open( FH, "/bin/cat $_ |" ) or die $!;

       Perl will try to	avoid running the command through the shell if meta
       characters are not passed to the	open.  See "perldoc -f open" for more
       information.

       Eh, but I just want to know how to index	PDF documents!

       See the examples	in the conf directory and the comments in the Swish-
       SpiderConfig.pl file.

       See the previous	question for the details on filtering.	The method you
       decide to use will depend on how	fast you want to index,	and your com-
       fort level with using Perl modules.

       Regardless of the filtering method you use you will need	to install the
       Xpdf packages available from http://www.foolabs.com/xpdf/.

       I'm using Windows and can't get Filters or the prog input method	to
       work!

       Both the	"-S prog" input	method and filters use the "popen()" system
       call to run the external	program.  If your external program is, for ex-
       ample, a	perl script, you have to tell Swish-e to run perl, instead of
       the script.  Swish-e will convert forward slashes to backslashes	when
       running under Windows.

       For example, you	would need to specify the path to perl as (assuming
       this is where perl is on	your system):

	   IndexDir e:/perl/bin/perl.exe

       Or run a	filter like:

	   FileFilter .foo e:/perl/bin/perl.exe	'myscript.pl "%p"'

       It's often easier to just install Linux.

       How do I	index non-English words?

       Swish-e indexes 8-bit characters	only.  This is the ISO 8859-1 Latin-1
       character set, and includes many	non-English letters (and symbols).  As
       long as they are	listed in "WordCharacters" they	will be	indexed.

       Actually, you probably can index	any 8-bit character set, as long as
       you don't mix character sets in the same	index and don't	use libxml2
       for parsing (see	below).

       The "TranslateCharacters" directive (SWISH-CONFIG) can translate	char-
       acters while indexing and searching.  You may specify the mapping of
       one character to	another	character with the "TranslateCharacters" di-
       rective.

       "TranslateCharacters :ascii7:" is a predefined set of characters	that
       will translate eight-bit	characters to ascii7 characters.  Using	the
       ":ascii7:" rule will, for example, translate "Aac" to "aac".  This
       means: searching	"Celik", "celik" or "celik" will all match the same
       word.

       Note: When using	libxml2	for parsing, parsed documents are converted
       internally (within libxml2) to UTF-8.  This is converted	to ISO 8859-1
       Latin-1 when indexing.  In cases	where a	string can not be converted
       from UTF-8 to ISO 8859-1	(because it contains non 8859-1	characters),
       the string will be sent to Swish-e in UTF-8 encoding.  This will	re-
       sults in	some words indexed incorrectly.	 Setting "ParserWarningLevel"
       to 1 or more will display warnings when UTF-8 to	8859-1 conversion
       fails.

       Can I add/remove	files from an index?

       Try building swish-e with the "--enable-incremental" option.

       The rest	of this	FAQ applies to the default swish-e format.

       Swish-e currently has no	way to add or remove items from	its index.
       But, Swish-e indexes so quickly that it's often possible	to reindex the
       entire document set when	a file needs to	be added, modified or removed.
       If you are spidering a remote site then consider	caching	documents lo-
       cally compressed.

       Incremental additions can be handled in a couple	of ways, depending on
       your situation.	It's probably easiest to create	one main index every
       night (or every week), and then create an index of just the new files
       between main indexing jobs and use the "-f" option to pass both indexes
       to Swish-e while	searching.

       You can merge the indexes into one index	(instead of using -f), but
       it's not	clear that this	has any	advantage over searching multiple in-
       dexes.

       How does	one create the incremental index?

       One method is by	using the "-N" switch to pass a	file path to Swish-e
       when indexing.  It will only index files	that have a last modification
       date "newer" than the file supplied with	the "-N" switch.

       This option has the disadvantage	that Swish-e must process every	file
       in every	directory as if	they were going	to be indexed (the test	for
       "-N" is done last right before indexing of the file contents begin and
       after all other tests on	the file have been completed) -- all that just
       to find a few new files.

       Also, if	you use	the Swish-e index file as the file passed to "-N"
       there may be files that were added after	indexing was started, but be-
       fore the	index file was written.	 This could result in a	file not being
       added to	the index.

       Another option is to maintain a parallel	directory tree that contains
       symlinks	pointing to the	main files.  When a new	file is	added (or
       changed)	to the main directory tree you create a	symlink	to the real
       file in the parallel directory tree.  Then just index the symlink di-
       rectory to generate the incremental index.

       This option has the disadvantage	that you need to have a	central	pro-
       gram that creates the new files that can	also create the	symlinks.
       But, indexing is	quite fast since Swish-e only has to look at the files
       that need to be indexed.	 When you run full indexing you	simply unlink
       (delete)	all the	symlinks.

       Both of these methods have issues where files could end up in both in-
       dexes, or files being left out of an index.  Use	of file	locks while
       indexing, and hash lookups during searches can help prevent these prob-
       lems.

       I run out of memory trying to index my files.

       It's true that indexing can take	up a lot of memory!  Swish-e is	ex-
       tremely fast at indexing, but that comes	at the cost of memory.

       The best	answer is install more memory.

       Another option is use the "-e" switch.  This will require less memory,
       but indexing will take longer as	not all	data will be stored in memory
       while indexing.	How much less memory and how much more time depends on
       the documents you are indexing, and the hardware	that you are using.

       Here's an example of indexing all .html files in	/usr/doc on Linux.
       This first example is without "-e" and used about 84M of	memory:

	   270279 unique words indexed.
	   23841 files indexed.	 177640166 total bytes.
	   Elapsed time: 00:04:45 CPU time: 00:03:19

       This is with "-e", and used about 26M or	memory:

	   270279 unique words indexed.
	   23841 files indexed.	 177640166 total bytes.
	   Elapsed time: 00:06:43 CPU time: 00:04:12

       You can also build a number of smaller indexes and then merge together
       with "-M".  Using "-e" while merging will save memory.

       Finally,	if you do build	a number of smaller indexes, you can specify
       more than one index when	searching by using the "-f" switch.  Sorting
       large results sets by a property	will be	slower when specifying multi-
       ple index files while searching.

       "too many open files" when indexing with	-e option

       Some platforms report "too many open files" when	using the -e economy
       option.	The -e feature uses many temporary files (something like 377)
       plus the	index files and	this may exceed	your system's limits.

       Depending on your platform you may need to set "ulimit" or "unlimit".

       For example, under Linux	bash shell:

	 $ ulimit -n 1024

       Or under	an old Sparc

	 % unlimit openfiles

       My system admin says Swish-e uses too much of the CPU!

       That's a	good thing!  That expensive CPU	is supposed to be busy.

       Indexing	takes a	lot of work -- to make indexing	fast much of the work
       is done in memory which reduces the amount of time Swish-e is waiting
       on I/O.	But, there's two things	you can	try:

       The "-e"	option will run	Swish-e	in economy mode, which uses the	disk
       to store	data while indexing.  This makes Swish-e run somewhat slower,
       but also	uses less memory.  Since it is writing to disk more often it
       will be spending	more time waiting on I/O and less time in CPU.	Maybe.

       The other thing is to simply lower the priority of the job using	the
       nice(1) command:

	   /bin/nice -15 swish-e -c search.conf

       If concerned about searching time, make sure you	are using the -b and
       -m switches to only return a page at a time.  If	you know that your re-
       sult sets will be large,	and that you wish to return results one	page
       at a time, and that often times many pages of the same query will be
       requested, you may be smart to request all the documents	on the first
       request,	and then cache the results to a	temporary file.	 The perl mod-
       ule File::Cache makes this very simple to accomplish.

       Spidering

       How can I index documents on a web server?

       If possible, use	the file system	method "-S fs" of indexing to index
       documents in you	web area of the	file system.  This avoids the overhead
       of spidering a web server and is	much faster.  ("-S fs" is the default
       method if "-S" is not specified).

       If this is impossible (the web server is	not local, or documents	are
       dynamically generated), Swish-e provides	two methods of spidering.
       First, it includes the http method of indexing "-S http". A number of
       special configuration directives	are available that control spidering
       (see "Directives	for the	HTTP Access Method Only" in SWISH-CONFIG).  A
       perl helper script (swishspider)	is included in the src directory to
       assist with spidering web servers. There	are example configurations for
       spidering in the	conf directory.

       As of Swish-e 2.2, there's a general purpose "prog" document source
       where a program can feed	documents to it	for indexing.  A number	of ex-
       ample programs can be found in the "prog-bin" directory,	including a
       program to spider web servers.  The provided spider.pl program is full-
       featured	and is easily customized.

       The advantage of	the "prog" document source feature over	the "http"
       method is that the program is only executed one time, where the swish-
       spider.pl program used in the "http" method is executed once for	every
       document	read from the web server.  The forking of Swish-e and compil-
       ing of the perl script can be quite expensive, time-wise.

       The other advantage of the "spider.pl" program is that it's simple and
       efficient to add	filtering (such	as for PDF or MS Word docs) right into
       the spider.pl's configuration, and it includes features such as MD5
       checks to prevent duplicate indexing, options to	avoid spidering	some
       files, or index but avoid spidering.  And since it's a perl program
       there's no limit	on the features	you can	add.

       Why does	swish report "./swishspider: not found"?

       Does the	file swishspider exist where the error message displays?  If
       not, either set the configuration option	SpiderDirectory	to point to
       the directory where the swishspider program is found, or	place the
       swishspider program in the current directory when running swish-e.

       If you are running Windows, make	sure "perl" is in your path.  Try typ-
       ing perl	from a command prompt.

       If you not running windows, make	sure that the shebang line (the	first
       line of the swishspider program that starts with	#!) points to the cor-
       rect location of	perl.  Typically this will be /usr/bin/perl or
       /usr/local/bin/perl.  Also, make	sure that you have execute and read
       permissions on swishspider.

       The swishspider perl script is only used	with the -S http method	of in-
       dexing.

       I'm using the spider.pl program to spider my web	site, but some large
       files are not indexed.

       The "spider.pl" program has a default limit of 5MB file size.  This can
       be changed with the "max_size" parameter	setting.  See "perldoc spi-
       der.pl" for more	information.

       I still don't think all my web pages are	being indexed.

       The spider.pl program has a number of debugging switches	and can	be
       quite verbose in	telling	you what's happening, and why.	See "perldoc
       spider.pl" for instructions.

       Swish is	not spidering Javascript links!

       Swish cannot follow links generated by Javascript, as they are gener-
       ated by the browser and are not part of the document.

       How do I	spider other websites and combine it with my own (filesystem)
       index?

       You can either merge "-M" two indexes into a single index, or use "-f"
       to specify more than one	index while searching.

       You will	have better results with the "-f" method.

       Searching

       How do I	limit searches to just parts of	the index?

       If you can identify "parts" of your index by the	path name you have two
       options.

       The first options is by indexing	the document path.  Add	this to	your
       configuration:

	   MetaNames swishdocpath

       Now you can search for words or phrases in the path name:

	   swish-e -w 'foo AND swishdocpath=(sales)'

       So that will only find documents	with the word "foo" and	where the
       file's path contains "sales".  That might not works as well as you
       like, though, as	both of	these paths will match:

	   /web/sales/products/index.html
	   /web/accounting/private/sales_we_messed_up.html

       This can	be solved by searching with a phrase (assuming "/" is not a
       WordCharacter):

	   swish-e -w 'foo AND swishdocpath=("/web/sales/")'
	   swish-e -w 'foo AND swishdocpath=("web sales")'  (same thing)

       The second option is a bit more powerful.  With the "ExtractPath" di-
       rective you can use a regular expression	to extract out a sub-set of
       the path	and save it as a separate meta name:

	   MetaNames department
	   ExtractPath department regex	!^/web/([^/]+).+$!$1/

       Which says match	a path that starts with	"/web/"	and extract out	every-
       thing after that	up to, but not including the next "/" and save it in
       variable	$1, and	then match everything from the "/" onward.  Then re-
       place the entire	matches	string with $1.	 And that gets indexed as meta
       name "department".

       Now you can search like:

	   swish-e -w 'foo AND department=sales'

       and be sure that	you will only match the	documents in the /www/sales/*
       path.  Note that	you can	map completely different areas of your file
       system to the same metaname:

	   # flag the marketing	specific pages
	   ExtractPath department regex	!^/web/(marketing|sales)/.+$!marketing/
	   ExtractPath department regex	!^/internal/marketing/.+$!marketing/

	   # flag the technical	departments pages
	   ExtractPath department regex	!^/web/(tech|bugs)/.+$!tech/

       Finally,	if you have something more complicated,	use "-S	prog" and
       write a perl program or use a filter to set a meta tag when processing
       each file.

       How is ranking calculated?

       The "swishrank" property	value is calculated based on which Ranking
       Scheme (or algorithm) you have selected.	In this	discussion, any	time
       the word	fancy is used, you should consult the actual code for more de-
       tails. It is open source, after all.

       Things you can do to affect ranking:

       MetaNamesRank
	   You may configure your index	to bias	certain	metaname values	more
	   or less than	others.	 See the "MetaNamesRank" configuration option
	   in SWISH-CONFIG.

       IgnoreTotalWordCountWhenRanking
	   Set to 1 (default) or 0 in your config file.	See SWISH-CONFIG.
	   NOTE: You must set this to 0	to use the IDF Ranking Scheme.

       structure
	   Each	term's position	in each	HTML document is given a structure
	   value based on the context in which the word	appears. The structure
	   value is used to artificially inflate the frequency of each term in
	   that	particular document.  These structural values are defined in
	   config.h:

	    #define RANK_TITLE		   7
	    #define RANK_HEADER		   5
	    #define RANK_META		   3
	    #define RANK_COMMENTS	   1
	    #define RANK_EMPHASIZED	   0

	   For example,	if the word "foo" appears in the title of a document,
	   the Scheme will treat that document as if "foo" appeared 7 addi-
	   tional times.

       All Schemes share the following characteristics:

       AND searches
	   The rank value is averaged for all AND'd terms. Terms within	a set
	   of parentheses () are averaged as a single term (this is an ac-
	   knowledged weakness and is on the TODO list).

       OR searches
	   The rank value is summed and	then doubled for each pair of OR'd
	   terms. This results in higher ranks for documents that have multi-
	   ple OR'd terms.

       scaled rank
	   After a document's raw rank score is	calculated, a final rank score
	   is calculated using a fancy "log()" function. All the documents are
	   then	scaled against a base score of 1000.  The top-ranked document
	   will	therefore always have a	"swishrank" value of 1000.

       Here is a brief overview	of how the different Schemes work. The number
       in parentheses after the	name is	the value to invoke that scheme	with
       "swish-e	-R" or "RankScheme()".

       Default (0)
	   The default ranking scheme considers	the number of times a term ap-
	   pears in a document (frequency), the	MetaNamesRank and the struc-
	   ture	value. The rank	might be summarized as:

	    DocRank = Sum of ( structure + metabias )

	   Consider this output	with the DEBUG_RANK variable set at compile
	   time:

	    Ranking Scheme: 0
	    Word entry 0 at position 6 has struct 7
	    Word entry 1 at position 64	has struct 41
	    Word entry 2 at position 71	has struct 9
	    Word entry 3 at position 132 has struct 9
	    Word entry 4 at position 154 has struct 9
	    Word entry 5 at position 423 has struct 73
	    Word entry 6 at position 541 has struct 73
	    Word entry 7 at position 662 has struct 73
	    File num: 1104.  Raw Rank: 21.  Frequency: 8 scaled	rank: 30445
	     Structure tally:
	     struct 0x7	= count	of 1 ( HEAD TITLE FILE ) x rank	map of 8 = 8

	     struct 0x9	= count	of 3 ( BODY FILE ) x rank map of 1 = 3

	     struct 0x29 = count of 1 (	HEADING	BODY FILE ) x rank map of 6 = 6

	     struct 0x49 = count of 3 (	EM BODY	FILE ) x rank map of 1 = 3

	   Every word instance starts with a base score	of 1.  Then for	each
	   instance of your word, a running sum	is taken of the	structural
	   value of that word position plus any	bias you've configured.	 In
	   the example above, the raw rank is "1 + 8 + 3 + 6 + 3 = 21".

	   Consider this line:

	     struct 0x7	= count	of 1 ( HEAD TITLE FILE ) x rank	map of 8 = 8

	   That	means there was	one instance of	our word in the	title of the
	   file.  It's context was in the <head> tagset, inside	the <title>.
	   The <title> is the most specific structure, so it gets the RANK_TI-
	   TLE score: 7. The base rank of 1 plus the structure score of	7
	   equals 8. If	there had been two instances of	this word in the ti-
	   tle,	then the score would have been "8 + 8 =	16".

       IDF (1)
	   IDF is short	for Inverse Document Frequency.	That's fancy ranking
	   lingo for taking into account the total frequency of	a term across
	   the entire index, in	addition to the	term's frequency in a single
	   document. IDF ranking also uses the relative	density	of a word in a
	   document to judge its relevancy. Words that appear more often in a
	   doc make that doc's rank higher, and	longer docs are	not weighted
	   higher than shorter docs.

	   The IDF Scheme might	be summarized as:

	     DocRank = Sum of (	density	* idf *	( structure + metabias ) )

	   Consider this output	from DEBUG_RANK:

	    Ranking Scheme: 1
	    File num: 1104  Word Score:	1  Frequency: 8	 Total files: 1451
	    Total word freq: 108   IDF:	2564
	    Total words: 1145877   Indexed words in this doc: 562
	    Average words: 789	 Density: 1120	  Word Weight: 28716
	    Word entry 0 at position 6 has struct 7
	    Word entry 1 at position 64	has struct 41
	    Word entry 2 at position 71	has struct 9
	    Word entry 3 at position 132 has struct 9
	    Word entry 4 at position 154 has struct 9
	    Word entry 5 at position 423 has struct 73
	    Word entry 6 at position 541 has struct 73
	    Word entry 7 at position 662 has struct 73
	    Rank after IDF weighting: 574321
	    scaled rank: 132609
	     Structure tally:
	     struct 0x7	= count	of  1 (	HEAD TITLE FILE	) x rank map of	8 = 8

	     struct 0x9	= count	of  3 (	BODY FILE ) x rank map of 1 = 3

	     struct 0x29 = count of  1 ( HEADING BODY FILE ) x rank map	of 6 = 6

	     struct 0x49 = count of  3 ( EM BODY FILE )	x rank map of 1	= 3

	   It is similar to the	default	Scheme,	but notice how the total num-
	   ber of files	in the index and the total word	frequency (as opposed
	   to the document frequency) are both part of the equation.

       Ranking is a complicated	subject. SWISH-E allows	for more Ranking
       Schemes to be developed and experimented	with, using the	-R option
       (from the swish-e command) and the RankScheme (see the API documenta-
       tion). Experiment and share your	findings via the discussion list.

       How can I limit searches	to the title, body, or comment?

       Use the "-t" switch.

       I can't limit searches to title/body/comment.

       Or, I can't search with meta names, all the names are indexed as
       "plain".

       Check in	the config.h file if #define INDEXTAGS is set to 1. If it is,
       change it to 0, recompile, and index again.  When INDEXTAGS is 1, ALL
       the tags	are indexed as plain text, that	is you index "title", "h1",
       and so on, AND they loose their indexing	meaning.  If INDEXTAGS is set
       to 0, you will still index meta tags and	comments, unless you have in-
       dicated otherwise in the	user config file with the IndexComments	direc-
       tive.

       Also, check for the "UndefinedMetaTags" setting in your configuration
       file.

       I've tried running the included CGI script and I	get a "Internal	Server
       Error"

       Debugging CGI scripts are beyond	the scope of this document.  Internal
       Server Error basically means "check the web server's log	for an error
       message", as it can mean	a bad shebang (#!) line, a missing perl	mod-
       ule, FTP	transfer error,	or simply an error in the program.  The	CGI
       script swish.cgi	in the example directory contains some debugging sug-
       gestions.  Type "perldoc	swish.cgi" for information.

       There are also many, many CGI FAQs available on the Internet.  A	quick
       web search should offer help.  As a last	resort you might ask your we-
       badmin for help...

       When I try to view the swish.cgi	page I see the contents	of the Perl
       program.

       Your web	server is not configured to run	the program as a CGI script.
       This problem is described in "perldoc swish.cgi".

       How do I	make Swish-e highlight words in	search results?

       Short answer:

       Use the supplied	swish.cgi or search.cgi	scripts	located	in the example
       directory.

       Long answer:

       Swish-e can't because it	doesn't	have access to the source documents
       when returning results, of course.  But a front-end program of your
       creation	can highlight terms.  Your program can open up the source doc-
       uments and then use regular expressions to replace search terms with
       highlighted or bolded words.

       But, that will fail with	all but	the most simple	source documents.  For
       HTML documents, for example, you	must parse the document	into words and
       tags (and comments).  A word you	wish to	highlight may span multiple
       HTML tags, or be	a word in a URL	and you	wish to	highlight the entire
       link text.

       Perl modules such as HTML::Parser and XML::Parser make word extraction
       possible.  Next,	you need to consider that Swish-e uses settings	such
       as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, and
       IgnoreLast, char	to define a "word".  That is, you can't	consider that
       a string	of characters with white space on each side is a word.

       Then things like	TranslateCharacters, and HTML Entities may transform a
       source word into	something else,	as far as Swish-e is concerned.	 Fi-
       nally, searches can be limited by metanames, so you may need to limit
       your highlighting to only parts of the source document.	Throw phrase
       searches	and stopwords into the equation	and you	can see	that it's not
       a trivial problem to solve.

       All hope	is not lost, thought, as Swish-e does provide some help.  Us-
       ing the "-H" option it will return in the headers the current index (or
       indexes)	settings for WordCharacters (and others) required to parse
       your source documents as	it parses them during indexing,	and will re-
       turn a "Parsed Words:" header that will show how	it parsed the query
       internally.  If you use fuzzy indexing (word stemming, soundex, or
       metaphone) then you will	also need to stem each word in your document
       before comparing	with the "Parsed Words:" returned by Swish-e.

       The Swish-e stemming code is available either by	using the Swish-e Perl
       module (SWISH::API) or the C library (included with the swish-e distri-
       bution),	or by using the	SWISH::Stemmer module available	on CPAN.  Also
       on CPAN is the module Text::DoubleMetaphone.  Using SWISH::API probably
       provides	the best stemming support.

       Do filters effect the performance during	search?

       No.  Filters (FileFilter	or via "prog" method) are only used for	build-
       ing the search index database.  During search requests there will be no
       filter calls.

       I have read the FAQ but I still have questions about using Swish-e.

       The Swish-e discussion list is the place	to go.	http://swish-e.org/.
       Please do not email developers directly.	 The list is the best place to
       ask questions.

       Before you post please read QUESTIONS AND TROUBLESHOOTING located in
       the INSTALL page.  You should also search the Swish-e discussion	list
       archive which can be found on the swish-e web site.

       In short, be sure to include in the following when asking for help.

       * The swish-e version (./swish-e	-V)
       * What you are indexing (and perhaps a sample), and the number of files
       * Your Swish-e configuration file
       * Any error messages that Swish-e is reporting

Document Info
       $Id: SWISH-FAQ.pod 2147 2008-07-21 02:48:55Z karpet $

       .

2.4.7				  2009-04-04			  SWISH-FAQ(1)

NAME | OVERVIEW | Document Info

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=SWISH-FAQ&sektion=1&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help