Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
WWW::Robot(3)	      User Contributed Perl Documentation	 WWW::Robot(3)

NAME
       WWW::Robot - configurable web traversal engine (for web robots &
       agents)

SYNOPSIS
	   use WWW::Robot;

	   $robot = new	WWW::Robot(
	       'NAME'	  => 'MyRobot',
	       'VERSION'  => '1.000',
	       'EMAIL'	  => 'fred@foobar.com'
	   );

	   # ... configure the robot's operation ...

	   $robot->run(	'http://www.foobar.com/' );

DESCRIPTION
       This module implements a	configurable web traversal engine, for a robot
       or other	web agent.  Given an initial web page (URL), the Robot will
       get the contents	of that	page, and extract all links on the page,
       adding them to a	list of	URLs to	visit.

       Features	of the Robot module include:

       o   Follows the Robot Exclusion Protocol.

       o   Supports the	META element proposed extensions to the	Protocol.

       o   Implements many of the Guidelines for Robot Writers.

       o   Configurable.

       o   Builds on standard Perl 5 modules for WWW, HTTP, HTML, etc.

       A particular application	(robot instance) has to	configure the engine
       using hooks, which are perl functions invoked by	the Robot engine at
       specific	points in the control loop.

       The robot engine	obeys the Robot	Exclusion protocol, as well as a
       proposed	addition.  See "SEE ALSO" for references to documents
       describing the Robot Exclusion protocol and web robots.

QUESTIONS
       This section contains a number of questions. I'm	interested in hearing
       what people think, and what you've done faced with similar questions.

       o   What	style of API is	preferable for setting attributes? Maybe
	   something like the following:

	       $robot->verbose(1);
	       $traversal = $robot->traversal();

	   I.e.	a method for setting and getting each attribute, depending on
	   whether you passed an argument?

       o   Should the robot module support a standard logging mechanism?  For
	   example, an LOGFILE attribute, which	is set to either a filename,
	   or a	filehandle reference.  This would need a useful	file format.

       o   Should the module also support an ERRLOG attribute, with all
	   warnings and	error messages sent there?

       o   At the moment the robot will	print warnings and error messages to
	   stderr, as well as returning	error status. Should this behaviour be
	   configurable?  I.e. the ability to turn off warnings.

       The basic architecture of the Robot is as follows:

	   Hook: restore-state
	   Get Next URL
	       Hook: invoke-on-all-url
	       Hook: follow-url-test
	       Hook: invoke-on-followed-url
	       Get contents of URL
	       Hook: invoke-on-contents
	       Skip if not HTML
	       Foreach link on page:
		   Hook: invoke-on-link
		   Hook: add-url-test
		   Add link to robot's queue
	   Continue? Hook: continue-test
	   Hook: save-state
	   Hook: generate-report

       Each of the hook	procedures and functions is described below.  A	robot
       must provide a "follow-url-test"	hook, and at least one of the
       following:

       o   "invoke-on-all-url"

       o   "invoke-on-followed-url"

       o   "invoke-on-contents"

       o   "invoke-on-link"

CONSTRUCTOR
	  $robot = new WWW::Robot( <attribute-value-pairs> );

       Create a	new robot engine instance.  If the constructor fails for any
       reason, a warning message will be printed, and "undef" will be
       returned.

       Having created a	new robot, it should be	configured using the methods
       described below.	 Certain attributes of the Robot can be	set during
       creation; they can be (re)set after creation, using the
       "setAttribute()"	method.

       The attributes of the Robot are described below,	in the Robot
       Attributes section.

METHODS
   run
	   $robot->run(	@url_list );

       Invokes the robot, initially traversing the root	URLs provided in
       @url_list, and any which	have been provided with	the "addUrl()" method
       before invoking "run()".	 If you	have not correctly configured the
       robot, the method will return "undef".

       The initial set of URLs can either be passed as arguments to the	run()
       method, or with the addUrl() method before you invoke run().  Each URL
       can be specified	either as a string, or as a URI::URL object.

       Before invoking this method, you	should have provided at	least some of
       the hook	functions.  See	the example given in the EXAMPLES section
       below.

       By default the run() method will	iterate	until there are	no more	URLs
       in the queue.  You can override this behavior by	providing a
       "continue-test" hook function, which checks for the termination
       conditions.  This particular hook function, and use of hook functions
       in general, are described below.

   setAttribute
	 $robot->setAttribute( ... attribute-value-pairs ... );

       Change the value	of one or more robot attributes.  Attributes are
       identified using	a string, and take scalar values.  For example,	to
       specify the name	of your	robot, you set the "NAME" attribute:

	  $robot->setAttribute(	'NAME' => 'WebStud' );

       The supported attributes	for the	Robot module are listed	below, in the
       ROBOT ATTRIBUTES	section.

   getAttribute
	 $value	= $robot->getAttribute(	'attribute-name' );

       Queries a Robot for the value of	an attribute.  For example, to query
       the version number of your robot, you would get the "VERSION"
       attribute:

	  $version = $robot->getAttribute( 'VERSION' );

       The supported attributes	for the	Robot module are listed	below, in the
       ROBOT ATTRIBUTES	section.

   getAgent
	 $agent	= $robot->getAgent();

       Returns the agent that is being used by the robot.

   addUrl
	 $robot->addUrl( $url1,	..., $urlN );

       Used to add one or more URLs to the queue for the robot.	 Each URL can
       be passed as a simple string, or	as a URI::URL object.

       Returns True (non-zero) if all URLs were	successfully added, False
       (zero) if at least one of the URLs could	not be added.

   unshiftUrl
	 $robot->unshiftUrl( $url1, ..., $urlN );

       Used to add one or more URLs to the queue for the robot.	 Each URL can
       be passed as a simple string, or	as a URI::URL object.

       Returns True (non-zero) if all URLs were	successfully added, False
       (zero) if at least one of the URLs could	not be added.

   listUrls
	 $robot->listUrls( );

       Returns a list of the URLs currently in the robots list to be
       traversed.

   addHook
	 $robot->addHook( $hook_name, \&hook_function );

	 sub hook_function { ... }

       Register	a hook function	which should be	invoked	by the robot at	a
       specific	point in the control flow. There are a number of hook points
       in the robot, which are identified by a string.	For a list of hook
       points, see the SUPPORTED HOOKS section below.

       If you provide more than	one function for a particular hook, then the
       hook functions will be invoked in the order they	were added.  I.e. the
       first hook function called will be the first hook function you added.

   proxy, no_proxy, env_proxy
       These are convenience functions are setting proxy information on	the
       User agent being	used to	make the requests.

	   $robot->proxy( protocol, proxy );

       Used to specify a proxy for the given scheme.  The protocol argument
       can be a	reference to a list of protocols.

	   $robot->no_proxy(domain1, ... domainN);

       Specifies that proxies should not be used for the specified domains or
       hosts.

	   $robot->env_proxy();

       Load proxy settings from	protocol_proxy environment variables:
       "ftp_proxy", "http_proxy", "no_proxy", etc.

ROBOT ATTRIBUTES
       This section lists the attributes used to configure a Robot object.
       Attributes are set using	the "setAttribute()" method, and queried using
       the "getAttribute()" method.

       Some of the attributes must be set before you start the Robot (with the
       "run()" method).	 These are marked as mandatory in the list below.

       NAME
	   The name of the Robot.  This	should be a sequence of	alphanumeric
	   characters, and is used to identify your Robot.  This is used to
	   set the "User-Agent"	field of HTTP requests,	and so will appear in
	   server logs.

	   mandatory

       VERSION
	   The version number of your Robot.  This should be a floating	point
	   number, in the format N.NNN.

	   mandatory

       EMAIL
	   A valid email address which can be used to contact the Robot's
	   owner, for example by someone who wishes to complain	about the
	   behavior of your robot.

	   mandatory

       VERBOSE
	   A boolean flag which	specifies whether the Robot should display
	   verbose status information as it runs.

	   Default: 0 (false)

       TRAVERSAL
	   Specifies what traversal style should be adopted by the Robot.
	   Valid values	are depth and breadth.

	   Default: depth

       IGNORE_TEXT
	   Specifies whether the HTML structure	passed to the invoke-on-
	   contents hook function should include the textual content of	the
	   page, or just the HTML elements.

	   Default: 1 (true)

       IGNORE_UNKNOWN
	   Specifies whether the HTML structure	passed to the invoke-on-
	   contents hook function should ignore	unkonwn	HTML elements.

	   Default: 1 (true)

       CHECK_MIME_TYPES
	   This	tells the robot	that if	it can't easily	determine the MIME
	   type	of a link from its URL,	to issue a HEAD	request	to check the
	   MIME	type directly, before adding the link.

	   Default: 1 (true)

       USERAGENT
	   Allows the caller to	specify	its own	user agent to make the HTTP
	   requests.

	   Default: LWP::RobotUA object	created	by the robot

       ACCEPT_LANGUAGE
	   Optionally allows the caller	to specify the list of languages that
	   the robot accepts. This is added as an "Accept-Language" header
	   field in the	HTTP request. Takes an array reference.

       DELAY
	   Optionally set the delay between requests for the user agent, in
	   minutes. The	default	for this is 1 (see LWP::RobotUA).

SUPPORTED HOOKS
       This section lists the hooks which are supported	by the WWW::Robot
       module.	The first two arguments	passed to a hook function are always
       the Robot object	followed by the	name of	the hook being invoked.	I.e.
       the start of a hook function should look	something like:

	   sub my_hook_function
	   {
	       my $robot = shift;
	       my $hook	 = shift;
	       # ... other, hook-specific, arguments

       Wherever	a hook function	is passed a $url argument, this	will be	a
       URI::URL	object,	with the URL fully specified.  I.e. even if the	URL
       was seen	in a relative link, it will be passed as an absolute URL.

   restore-state
	  sub hook { my($robot,	$hook_name) = @_; }

       This hook is invoked just before	entering the main iterative loop of
       the robot.  The intention is that the hook will be used to restore
       state, if such an operation is required.

       This can	be helpful if the robot	is running in an incremental mode,
       where state is saved between each run of	the robot.

   invoke-on-all-url
	  sub hook { my($robot,	$hook_name, $url) = @_;	}

       This hook is invoked on all URLs	seen by	the robot, regardless of
       whether the URL is actually traversed.  In addition to the standard
       $robot and $hook	arguments, the third argument is $url, which is	the
       URL being travered by the robot.

       For a given URL,	the hook function will be invoked at most once,
       regardless of how many times the	URL is seen by the Robot.  If you are
       interested in seeing the	URL every time,	you can	use the	invoke-on-link
       hook.

   follow-url-test
	  sub hook { my($robot,	$hook_name, $url) = @_;	return $boolean; }

       This hook is invoked to determine whether the robot should traverse the
       given URL.  If the hook function	returns	0 (zero), then the robot will
       do nothing further with the URL.	 If the	hook function returns non-
       zero, then the robot will get the contents of the URL, invoke further
       hooks, and extract links	if the contents	are HTML.

   invoke-on-followed-url
	  sub hook { my($robot,	$hook_name, $url) = @_;	}

       This hook is invoked on URLs which are about to be traversed by the
       robot; i.e. URLs	which have passed the follow-url-test hook.

   invoke-on-get-error
	  sub hook { my($robot,	$hook_name, $url, $response) = @_; }

       This hook is invoked if the Robot ever fails to get the contents	of a
       URL.  The $response argument is an object of type HTTP::Response.

   invoke-on-contents
	  sub hook { my($robot,	$hook, $url, $response,	$structure) = @_; }

       This hook function is invoked for all URLs for which the	contents are
       successfully retrieved.

       The $url	argument is a URI::URL object for the URL currently being
       processed by the	Robot engine.

       The $response argument is an HTTP::Response object, the result of the
       GET request on the URL.

       The $structure argument is an HTML::Element object which	is the root of
       a tree structure	constructed from the contents of the URL.  You can set
       the "IGNORE_TEXT" attribute to specify whether the structure passed
       includes	the textual content of the page, or just the HTML elements.
       You can set the "IGNORE_UNKNOWN"	attribute to specify whether the
       structure passed	includes unkown	HTML elements.

   invoke-on-link
	  sub hook { my($robot,	$hook_name, $from_url, $to_url)	= @_; }

       This hook function is invoked for all links seen	as the robot
       traverses.  When	the robot is parsing a page ($from_url)	for links, for
       every link seen the invoke-on-link hook is invoked with the URL of the
       source page, and	the destination	URL.  The destination URL is in
       canonical form.

   add-url-test
	  sub hook { my($robot,	$hook_name, $url) = @_;	}

       This hook function is invoked for all links seen	as the robot
       traverses.  If the hook function	returns	non-zero, then the robot will
       add the URL given by $url to its	list of	URLs to	be traversed.

   continue-test
	  sub hook { my($robot)	= @_; }

       This hook is invoked at the end of the robot's main iterative loop.  If
       the hook	function returns non-zero, then	the robot will continue
       execution with the next URL.  If	the hook function returns zero,	then
       the Robot will terminate	the main loop, and close down after invoking
       the following two hooks.

       If no "continue-test" hook function is provided,	then the robot will
       always loop.

   save-state
	  sub hook { my($robot)	= @_; }

       This hook is used to save any state information required	by the robot
       application.

   generate-report
	  sub hook { my($robot)	= @_; }

       This hook is used to generate a report for the run of the robot,	if
       such is desired.

   modified-since
       If you provide this hook	function, it will be invoked for each URL
       before the robot	actually requests it.  The function can	return a time
       to use with the If-Modified-Since HTTP header.  This can	be used	by a
       robot to	only process those pages which have changed since the last
       visit.

       Your hook function should be declared as	follows:

	   sub modifed_since_hook
	   {
	       my $robot = shift;	 # instance of Robot module
	       my $hook	 = shift;	 # name	of hook	invoked
	       my $url	 = shift;	 # URI::URL for	the url	in question

	       # ... calculate time ...
	       return $time;
	   }

       If your function	returns	anything other than "undef", then a If-
       Modified-Since: field will be added to the request header.

   invoke-after-get
       This hook function is invoked immediately after the robot makes each
       GET request.  This means	your hook function will	see every type of
       response, not just successful GETs.  The	hook function is passed	two
       arguments: the $url we tried to GET, and	the $response which resulted.

       If you provided a modified-since	hook, then provide an invoke-after-get
       function, and look for error code 304 (or RC_NOT_MODIFIED if you	are
       using HTTP::Status, which you should be :-):

	   sub after_get_hook
	   {
	       my($robot, $hook, $url, $response) = @_;

	       if ($response->code == RC_NOT_MODIFIED)
	       {
	       }
	   }

EXAMPLES
       This section illustrates	use of the Robot module, with code snippets
       from several sample Robot applications.	The code here is not intended
       to show the right way to	code a web robot, but just illustrates the API
       for using the Robot.

   Validating Robot
       This is a simple	robot which you	could use to validate your web site.
       The robot uses weblint to check the contents of URLs of type text/html

	  #!/usr/bin/perl
	  require 5.002;
	  use WWW::Robot;

	  $rootDocument	= $ARGV[0];

	  $robot = new WWW::Robot('NAME'     =>	 'Validator',
				  'VERSION'  =>	 1.000,
				  'EMAIL'    =>	 'fred@foobar.com');

	  $robot->addHook('follow-url-test', \&follow_test);
	  $robot->addHook('invoke-on-contents',	\&validate_contents);

	  $robot->run($rootDocument);

	  #-------------------------------------------------------
	  sub follow_test {
	     my($robot,	$hook, $url) = @_;

	     return 0 unless $url->scheme eq 'http';
	     return 0 if $url =~ /\.(gif|jpg|png|xbm|au|wav|mpg)$/;

	     #---- we're only interested in pages on our site ----
	     return $url =~ /^$rootDocument/;
	  }

	  #-------------------------------------------------------
	  sub validate_contents	{
	     my($robot,	$hook, $url, $response,	$structure) = @_;

	     return unless $response->content_type eq 'text/html';

	     # some validation on $structure ...

	  }

       If you are behind a firewall, then you will have	to add something like
       the following, just before calling the "run()" method:

	  $robot->proxy(['ftp',	'http',	'wais',	'gopher'],
			'http://firewall:8080/');

MODULE DEPENDENCIES
       The Robot.pm module builds on a lot of existing Net, WWW	and other Perl
       modules.	 Some of the modules are part of the core Perl distribution,
       and the latest versions of all modules are available from the
       Comprehensive Perl Archive Network (CPAN).  The modules used are:

       HTTP::Request
	   This	module is used to construct HTTP requests, when	retrieving the
	   contents of a URL, or using the HEAD	request	to see if a URL
	   exists.

       HTML::LinkExtor
	   This	is used	to extract the URLs from the links on a	page.

       HTML::TreeBuilder
	   This	module builds a	tree data structure from the contents of an
	   HTML	page.  This is also used to check for page-specific Robot
	   exclusion commands, using the META element.

       URI::URL
	   This	module implements a class for URL objects, providing
	   resolution of relative URLs,	and access to the different components
	   of a	URL.

       LWP::RobotUA
	   This	is a wrapper around the	LWP::UserAgent class.  A UserAgent is
	   used	to connect to servers over the network,	and make requests.
	   The RobotUA module provides transparent compliance with the Robot
	   Exclusion Protocol.

       HTTP::Status
	   This	has definitions	for HTTP response codes, so you	can say
	   RC_NOT_MODIFIED instead of 304.

       All of these modules are	available as part of the libwww-perl5
       distribution, which is also available from CPAN.

SEE ALSO
       The SAS Group Home Page
	   http://www.cre.canon.co.uk/sas.html

	   This	is the home page of the	Group at Canon Research	Centre Europe
	   who are responsible for Robot.pm.

       Robot Exclusion Protocol
	   http://info.webcrawler.com/mak/projects/robots/norobots.html

	   This	is a de	facto standard which defines how a `well behaved'
	   Robot client	should interact	with web servers and web pages.

       Guidelines for Robot Writers
	   http://info.webcrawler.com/mak/projects/robots/guidelines.html

	   Guidelines and suggestions for those	who are	(considering)
	   developing a	web robot.

       Weblint Home Page
	   http://www.cre.canon.co.uk/~neilb/weblint/

	   Weblint is a	perl script which is used to check HTML	for syntax
	   errors and stylistic	problems, in the same way lint is used to
	   check C.

       Comprehensive Perl Archive Network (CPAN)
	   http://www.perl.com/perl/CPAN/

	   This	is a well-organized collection of Perl resources, such as
	   modules, documents, and scripts.  CPAN is mirrored at FTP sites
	   around the world.

VERSION
       This documentation describes version 0.021 of the Robot module.	The
       module requires at least	version	5.002 of Perl.

AUTHOR
	   Neil	Bowers <neilb@cre.canon.co.uk>
	   Ave Wrigley <wrigley@cre.canon.co.uk>

	   Web Department, Canon Research Centre Europe

COPYRIGHT
	Copyright (C) 1997, Canon Research Centre Europe.
	Copyright (C) 2006,2007	Konstantin Matyukhin.

       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.0			  2009-08-07			 WWW::Robot(3)

NAME | SYNOPSIS | DESCRIPTION | QUESTIONS | CONSTRUCTOR | METHODS | ROBOT ATTRIBUTES | SUPPORTED HOOKS | EXAMPLES | MODULE DEPENDENCIES | SEE ALSO | VERSION | AUTHOR | COPYRIGHT

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=WWW::Robot&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help