Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Web::Scraper(3)	      User Contributed Perl Documentation      Web::Scraper(3)

NAME
       Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors	or
       XPath expressions

SYNOPSIS
	 use URI;
	 use Web::Scraper;
	 use Encode;

	 # First, create your scraper block
	 my $authors = scraper {
	     # Parse all TDs inside 'table[width="100%]"', store them into
	     # an array	'authors'.  We embed other scrapers for	each TD.
	     process 'table[width="100%"] td', "authors[]" => scraper {
	       # And, in each TD,
	       # get the URI of	"a" element
	       process "a", uri	=> '@href';
	       # get text inside "small" element
	       process "small",	fullname => 'TEXT';
	     };
	 };

	 my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") );

	 # iterate the array 'authors'
	 for my	$author	(@{$res->{authors}}) {
	     # output is like:
	     # Andy Adler      http://search.cpan.org/~aadler/
	     # Aaron K Dancygier       http://search.cpan.org/~aakd/
	     # Aamer Akhter    http://search.cpan.org/~aakhter/
	     print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n");
	 }

       The structure would resemble this (visually)
	 {
	   authors => [
	     { fullname	=> $fullname, link => $uri },
	     { fullname	=> $fullname, link => $uri },
	   ]
	 }

DESCRIPTION
       Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
       Scrapi. It provides a DSL-ish interface for traversing HTML documents
       and returning a neatly arranged Perl data structure.

       The scraper and process blocks provide a	method to define what segments
       of a document to	extract.  It understands HTML and CSS Selectors	as
       well as XPath expressions.

METHODS
   scraper
	 $scraper = scraper { ... };

       Creates a new Web::Scraper object by wrapping the DSL code that will be
       fired when scrape method	is called.

   scrape
	 $res =	$scraper->scrape(URI->new($uri));
	 $res =	$scraper->scrape($html_content);
	 $res =	$scraper->scrape(\$html_content);
	 $res =	$scraper->scrape($http_response);
	 $res =	$scraper->scrape($html_element);

       Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings
       and creates a DOM object, then fires the	callback scraper code to
       retrieve	the data structure.

       If you pass URI or HTTP::Response object, Web::Scraper will
       automatically guesses the encoding of the content by looking at
       Content-Type headers and	META tags. Otherwise you need to decode	the
       HTML to Unicode before passing it to scrape method.

       You can optionally pass the base	URL when you pass the HTML content as
       a string	instead	of URI or HTTP::Response.

	 $res =	$scraper->scrape($html_content,	"http://example.com/foo");

       This way	Web::Scraper can resolve the relative links found in the
       document.

   process
	 scraper {
	     process "tag.class", key => 'TEXT';
	     process '//tag[contains(@foo, "bar")]', key2 => '@attr';
	     process '//comment()', 'comments[]' => 'TEXT';
	 };

       process is the method to	find matching elements from HTML with CSS
       selector	or XPath expression, then extract text or attributes into the
       result stash.

       If the first argument begins with "//" or "id(" it's treated as an
       XPath expression	and otherwise CSS selector.

	 # <span class="date">2008/12/21</span>
	 # date	=> "2008/12/21"
	 process ".date", date => 'TEXT';

	 # <div	class="body"><a	href="http://example.com/">foo</a></div>
	 # link	=> URI->new("http://example.com/")
	 process ".body	> a", link => '@href';

	 # <div	class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>
	 # comment => "	HTML Comment here "
	 #
	 # NOTES: A comment nodes are accessed when installed
	 # the HTML::TreeBuilder::XPath	(version >= 0.14) and/or
	 # the HTML::TreeBuilder::LibXML (version >= 0.13)
	 process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';

	 # <div	class="body"><a	href="http://example.com/">foo</a></div>
	 # link	=> URI->new("http://example.com/"), text => "foo"
	 process ".body	> a", link => '@href', text => 'TEXT';

	 # <ul><li>foo</li><li>bar</li></ul>
	 # list	=> [ "foo", "bar" ]
	 process "li", "list[]"	=> "TEXT";

	 # <ul><li id="1">foo</li><li id="2">bar</li></ul>
	 # list	=> [ { id => "1", text => "foo"	}, { id	=> "2",	text =>	"bar" }	];
	 process "li", "list[]"	=> { id	=> '@id', text => "TEXT" };

   process_first
       "process_first" is the same as "process"	but stops when the first
       matching	result is found.

	 # <span class="date">2008/12/21</span>
	 # <span class="date">2008/12/22</span>
	 # date	=> "2008/12/21"
	 process_first ".date",	date =>	'TEXT';

   result
       "result"	allows to return not the default value after processing	but a
       single value specified by a key or a hash reference built from several
       keys.

	 process 'a', 'want[]' => 'TEXT';
	 result	'want';

EXAMPLES
       There are many examples in the "eg/" dir	packaged in this distribution.
       It is recommended to look through these.

NESTED SCRAPERS
       Scrapers	can be nested thus allowing to scrape already captured data.

	 # <ul>
	 # <li class="foo"><a href="foo1">bar1</a></li>
	 # <li class="bar"><a href="foo2">bar2</a></li>
	 # <li class="foo"><a href="foo3">bar3</a></li>
	 # </ul>
	 # friends => [	{href => 'foo1'}, {href	=> 'foo2'} ];
	 process 'li', 'friends[]' => scraper {
	   process 'a',	href =>	'@href',
	 };

FILTERS
       Filters are applied to the result after processing. They	can be
       declared	as anonymous subroutines or as class names.

	 process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];
	 process $exp, $key => [ 'TEXT', 'Something' ];
	 process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];

       Filters can be stacked

	 process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar',	\&baz ];

       More about filters you can find in Web::Scraper::Filter documentation.

XML backends
       By default HTML::TreeBuilder::XPath is used, this can be	replaces by a
       XML::LibXML backend using Web::Scraper::LibXML module.

	 use Web::Scraper::LibXML;

	 # same	as Web::Scraper
	 my $scraper = scraper { ... };

AUTHOR
       Tatsuhiko Miyagawa <miyagawa@bulknews.net>

LICENSE
       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       <http://blog.labnotes.org/category/scrapi/>

       HTML::TreeBuilder::XPath

perl v5.32.0			  2014-10-20		       Web::Scraper(3)

NAME | SYNOPSIS | DESCRIPTION | METHODS | EXAMPLES | NESTED SCRAPERS | FILTERS | XML backends | AUTHOR | LICENSE | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Web::Scraper&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help