Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
CRAWL(1)		  BSD General Commands Manual		      CRAWL(1)

     crawl -- a	small and efficient HTTP crawler

     crawl [-u urlincl]	[-e urlexcl] [-i imgincl] [-I imgexcl] [-d imgdir]
	   [-m depth] [-c state] [-t timeout] [-A agent] [-R] [-E external]
	   [url	...]

     The crawl utility starts a	depth-first traversal of the web at the	speci-
     fied URLs.	 It stores all JPEG images that	match the configured con-

     The options are as	follows:

     -v	level	  The verbosity	level of crawl in regards to printing informa-
		  tion about URL processing.  The default is 1.

     -u	urlincl	  A regex(3) expression	that all URLs that should be included
		  in the traversal have	to match.

     -e	urlexcl	  A regex(3) expression	that determines	which URLs will	be ex-
		  cluded from the traversal.

     -i	imgincl	  A regex(3) expression	that all image URLs have to match in
		  order	to be stored on	disk.

     -I	imgexcl	  A regex(3) expression	that determines	the images that	will
		  not be stored.

     -d	imagedir  Specifies the	directory under	which the images will be

     -m	depth	  Specifies the	maximum	depth of the traversal.	 A 0 means
		  that only the	URLs specified on the command line will	be re-
		  trieved. A -1	stands for unlimited traversal and should be
		  used with caution.

     -c	state	  Continues a traversal	that was interrupted previosly.	 The
		  remaining URLs with be read from the file state.

     -t	timeout	  Specifies the	time in	seconds	that needs to pass between
		  successive access of a single	host.  The parameter is	a
		  float.  The default is five seconds.

     -A	agent	  Specifies the	agent string that will be included in all HTTP

     -R		  Specifies that the crawler should ignore the robots.txt

     -E	external  Specifies an external	filter program that can	refine which
		  URLs are to be included in the traversal.  The filter	pro-
		  gram reads the URLs on stdin and outputs a single character
		  on stdout.  An output	of `y' indicates that the URL may be
		  included, `n'	means that the URL should be excluded.

     The source	code for existing web crawlers tend to be very complicated.
     crawl is a	very simple design with	pretty simple source code.

     A configuration file can be used instead of the command line arguments.
     The configuration file contains the MIME-type that	is being used.	To
     download other objects besides images the MIME-type needs to be adjusted
     accordingly.  For more information, see crawl.conf.

     crawl -m 0

     Searches for images in  the index page of the web consortium without fol-
     lowing any	other links.

     This product includes software developed by Ericsson Radio	Systems.

     This product includes software developed by the University	of California,
     Berkeley and its contributors.

     The crawl utility has been	developed by Niels Provos.

BSD				 May 29, 2001				   BSD


Want to link to this manual page? Use this URL:

home | help