Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
HTML::ExtractContent(3User Contributed Perl DocumentatiHTML::ExtractContent(3)

NAME
       HTML::ExtractContent - An HTML content extractor	with scoring
       heuristics

SYNOPSIS
	use HTML::ExtractContent;
	use LWP::UserAgent;

	my $agent = LWP::UserAgent->new;
	my $res	= $agent->get('http://www.example.com/');

	my $extractor =	HTML::ExtractContent->new;
	$extractor->extract($res->decoded_content);
	print $extractor->as_text;

DESCRIPTION
       HTML::ExtractContent is a module	for extracting content from HTML with
       scoring heuristics. It guesses which block of HTML looks	like content
       according to scores depending on	the amount of punctuation marks	and
       the lengths of non-tag texts. It	also guesses whether content end in
       the block or continue to	the next block.

METHODS
       new
	    $extractor = HTML::ExtractContent->new;

	   Creates a new HTML::ExtractContent instance.

       extract
	    $extractor->extract($html);

	   Extracts content from $html.	 $html must have its UTF-8 flag	on.

       as_text
	    $extractor->extract($html)->as_text;

	   Returns extracted content as	a plain	text. All tags are eliminated.

       as_html
	    $extractor->extract($html)->as_html;

	   Returns extracted content as	an HTML	text.  Note that the returned
	   text	is neither fully tagged	nor valid HTML.	 It doesn't contain
	   tags	such as	<html> and it may have block tags that are not closed,
	   or closed but not opened.  This method is intended for the case
	   that	you need to analyse link tags in the text for example.

ACKNOWLEDGMENT
       Hiromichi Kishi contributed towards development of this module as a
       partner of pair programming.

       Implementation of this module is	based on the Ruby module
       ExtractContent by Nakatani Shuyo.

AUTHOR
       INA Lintaro <tarao at cpan.org>

COPYRIGHT
       Copyright (C) 2008 INA Lintaro /	Hatena.	All rights reserved.

   Copyright of	the original implementation
       Copyright (c) 2007/2008 Nakatani	Shuyo /	Cybozu Labs Inc. All rights
       reserved.

LICENCE
       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       <http://rubyforge.org/projects/extractcontent/>

perl v5.24.1			  2015-03-10	       HTML::ExtractContent(3)

NAME | SYNOPSIS | DESCRIPTION | METHODS | ACKNOWLEDGMENT | AUTHOR | COPYRIGHT | LICENCE | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=HTML::ExtractContent&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help