Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
HTML::Parser::Simple(3User Contributed Perl DocumentatiHTML::Parser::Simple(3)

NAME
       HTML::Parser::Simple - Parse nice HTML files without needing a compiler

Synopsis
	       #!/usr/bin/env perl

	       use strict;
	       use warnings;

	       use HTML::Parser::Simple;

	       # -------------------------

	       # Method	1:

	       my($p) =	HTML::Parser::Simple ->	new
	       (
		       input_file  => 'data/s.1.html',
		       output_file => 'data/s.2.html',
	       );

	       $p -> parse_file;

	       # Method	2:

	       my($p) =	HTML::Parser::Simple ->	new;

	       $p -> parse_file('data/s.1.html', 'data/s.2.html');

	       # Method	3:

	       my($p) =	HTML::Parser::Simple ->	new;

	       print $p	-> parse('<html>...</html>') ->	traverse($p -> root) ->	result;

       Of course, these	can be abbreviated by using method chaining. E.g.
       Method 2	could be:

	       HTML::Parser::Simple -> new -> parse_file('data/s.1.html', 'data/s.2.html');

       See scripts/parse.html.pl and scripts/parse.xhtml.pl.

Description
       "HTML::Parser::Simple" is a pure	Perl module.

       It parses HTML V	4 files, and generates a tree of nodes,	with 1 node
       per HTML	tag.

       The data	associated with	each node is documented	in the "FAQ".

       See also	HTML::Parser::Simple::Attributes and
       HTML::Parser::Simple::Reporter.

Distributions
       This module is available	as a Unix-style	distro (*.tgz).

       See <http://savage.net.au/Perl-modules.html> for	details.

       See <http://savage.net.au/Perl-modules/html/installing-a-module.html>
       for help	on unpacking and installing.

Constructor and	initialization
       new(...)	returns	an object of type "HTML::Parser::Simple".

       This is the class contructor.

       Usage: "HTML::Parser::Simple -> new".

       This method takes a hash	of options.

       Call "new()" as "new(option_1 =>	value_1, option_2 => value_2, ...)".

       Available options (each one of which is also a method):

       o input_file => $a_file_name
	   This	takes the file name, including the path, of the	input file.

	   Default: '' (the empty string).

       o output_file =>	$a_file_name
	   This	takes the file name, including the path, of the	output file.

	   Default: '' (the empty string).

       o verbose => $Boolean
	   This	takes either a 0 or a 1.

	   Write more or less progress messages.

	   Default: 0.

       o xhtml => $Boolean
	   This	takes either a 0 or a 1.

	   0 means do not accept an XML	declaration, such as <?xml
	   version="1.0" encoding="UTF-8"?> at the start of the	input file,
	   and some other XHTML	features, explained next.

	   1 means accept XHTML	input.

	   Default: 0.

	   The only XHTML changes to this code,	so far,	are:

	   o Accept the	XML declaration
	       E.g.: <?xml version="1.0" standalone='yes'?>.

	   o Accept attribute names containing the ':' char
	       E.g.: <html xmlns="http://www.w3.org/1999/xhtml"	xml:lang="en"
	       lang="en">.

Methods
   block()
       Returns a hashref where the keys	are the	names of block-level HTML
       tags.

       The corresponding values	in the hashref are just	1.

       Typical keys: address, form, p, table, tr.

       Note: Some keys,	e.g. tr, are also returned by "self_close()".

   current_node()
       Returns the Tree::Simple	object which the parser	calls the current
       node.

   depth()
       Returns the nesting depth of the	current	tag.

       The method is just here in case you need	it.

   empty()
       Returns a hashref where the keys	are the	names of HTML tags of type
       empty.

       The corresponding values	in the hashref are just	1.

       Typical keys: area, base, input,	wbr.

   inline()
       Returns a hashref where the keys	are the	names of HTML tags of type
       inline.

       The corresponding values	in the hashref are just	1.

       Typical keys: a,	em, img, textarea.

   input_file($in_file_name)
       Gets or sets the	input file name	used by	"parse($input_file_name,
       $output_file_name)".

       Note: The parameters passed in to "parse_file($input_file_name,
       $output_file_name)", take precedence over the input_file	and
       output_file parameters passed in	to "new()", and	over the internal
       values set with "input_file($in_file_name)" and
       "output_file($out_file_name)".

       'input_file' is a parameter to "new()". See "Constructor	and
       Initialization" for details.

   log($msg)
       Print $msg to STDERR if "new()" was called as "new(verbose => 1)", or
       if "$p -> verbose(1)" was called.

       Otherwise, print	nothing.

   new()
       This is the constructor.	See "Constructor and initialization" for
       details.

   node_type()
       Returns the type	of the most recently created node, global, head, or
       body.

       See the first question in the "FAQ" for details.

   output_file($out_file_name)
       Gets or sets the	output file name used by "parse($input_file_name,
       $output_file_name)".

       Note: The parameters passed in to "parse_file($input_file_name,
       $output_file_name)", take precedence over the input_file	and
       output_file parameters passed in	to "new()", and	over the internal
       values set with "input_file($in_file_name)" and
       "output_file($out_file_name)".

       'output_file' is	a parameter to "new()".	See "Constructor and
       Initialization" for details.

   parse($html)
       Returns the invocant. Thus "$p -> parse"	returns	$p. This allows	for
       method chaining.	See the	"Synopsis".

       Parses the string of HTML in $html, and builds a	tree of	nodes.

       After calling "$p -> parse($html)", you must call "$p ->	traverse($p ->
       root)" before calling "$p -> result".

       Alternately, use	"$p -> parse_file", which calls	all these methods for
       you.

       Note: "parse()" may be called directly or via "parse_file()".

   parse_file($input_file_name,	$output_file_name)
       Returns the invocant. Thus "$p -> parse_file" returns $p. This allows
       for method chaining. See	the "Synopsis".

       Parses the HTML in the input file, and writes the result	to the output
       file.

       "parse_file()" calls "parse($html)" and "traverse($node)", using	"$p ->
       root" for $node.

       Note: The parameters passed in to "parse_file($input_file_name,
       $output_file_name)", take precedence over the input_file	and
       output_file parameters passed in	to "new()", and	over the internal
       values set with "input_file($in_file_name)" and
       "output_file($out_file_name)".

       Lastly, the parameters passed in	to "parse_file($input_file_name,
       $output_file_name)" are used to update the internal values set with the
       input_file and output_file parameters passed in to "new()", or set with
       calls to	"input_file($in_file_name)" and	"output_file($out_file_name)".

   result()
       Returns the string which	is the result of the parse.

       See scripts/parse.html.pl.

   root()
       Returns the Tree::Simple	object which the parser	calls the root of the
       tree of nodes.

   self_close()
       Returns a hashref where the keys	are the	names of HTML tags of type
       self close.

       The corresponding values	in the hashref are just	1.

       Typical keys: dd, dt, p,	tr.

       Note: Some keys,	e.g. tr, are also returned by "block()".

   tagged_attribute()
       Returns a string	to be used as a	regexp,	to capture tags	and their
       optional	attributes.

       It does not return qr/$s/; it just returns $s.

       This regexp takes one of	two forms, depending on	the state of the xhtml
       option. See "xhtml($Boolean)".

       The regexp has four (4) sets of capturing parentheses:

       o 1 for the whole tag and attribute and trailing	/ combination
	   E.g.: <(....)>

       o 1 for the tag itself
	   E.g.: <(img)...>

       o 1 for the optional attributes of the tag
	   E.g.: <img (src="/graph.svg"	alt="A graph")>

       o 1 for the optional trailing / of the tag
	   E.g.: <img ... (/)>

   traverse($node)
       Returns the invocant. Thus "$p -> traverse" returns $p. This allows for
       method chaining.	 See the "Synopsis".

       Traverses the tree of nodes, starting at	$node.

       You normally call this as "$p ->	traverse($p -> root)", to ensure all
       nodes are visited.

       See the "Synopsis" for sample code.

       Or, see scripts/traverse.file.pl, which uses
       HTML::Parser::Simple::Reporter, and calls "traverse($node)" via
       "traverse_file($input_file_name)" in HTML::Parser::Simple::Reporter.

   verbose($Boolean)
       Gets or sets the	verbose	parameter.

       'verbose' is a parameter	to "new()". See	"Constructor and
       Initialization" for details.

   xhtml($Boolean)
       Gets or sets the	xhtml parameter.

       If you call this	after object creation, the trigger feature of Moos is
       used to call "tagged_attribute()" so as to correctly set	the regexp
       which recognises	xhtml.

       'xhtm'> is a parameter to "new()". See "Constructor and Initialization"
       for details.

FAQ
   What	is the format of the data stored in each node of the tree?
       The data	of each	node is	a hash ref. The	keys/values of this hash ref
       are:

       o attributes
	   This	is the string of HTML attributes associated with the HTML tag.

	   Attributes are stored in lower-case.

	   So, <table align = 'center' summary = 'Body'> will have an
	   attributes string of	" align	= 'center' summary = 'body'".

	   Note	the leading space.

       o content
	   This	is an arrayref of bits and pieces of content.

	   Consider this fragment of HTML:

	   <p>I	did <i>not</i> say I <i>liked</i> debugging.</p>

	   When	parsing	'I did ', the number of	child nodes (of	<p>) is	0,
	   since <i> has not yet been detected.

	   So, 'I did '	is stored in the 0th element of	the arrayref belonging
	   to <p>.

	   Likewise, 'not' is stored in	the 0th	element	of the arrayref
	   belonging to	the node <i>.

	   Next, ' say I ' is stored in	the 1st	element	of the arrayref
	   belonging to	<p>, because it	follows	the 1st	child node (<i>).

	   Likewise, ' debugging' is stored in the 2nd element of the arrayref
	   belonging to	<p>.

	   This	way, the input string can be reproduced	by successively
	   outputting the elements of the arrayref of content interspersed
	   with	the contents of	the child nodes	(processed recusively).

	   Note: If you	are processing this tree, never	forget that there can
	   be content after the	last child node	has been closed, but before
	   the current node is closed.

	   Note: The DOCTYPE declaration is stored as the 0th element of the
	   content of the root node.

       o depth
	   The nesting depth of	the tag	within the document.

	   The root is at depth	0, '<html>' is at depth	1, '<head>' and
	   '<body>' are	a depth	2, and so on.

	   It is just there in case you	need it.

       o name
	   So, the tag '<html>'	will mean the name is 'html'.

	   Tag names are stored	in lower-case.

	   The root of the tree	is called 'root', and holds the	DOCTYPE, if
	   any,	as content.

	   The root has	the node 'html'	as the only child, of course.

       o node_type
	   This	holds 'global' before '<head>' and between '</head>' and
	   '<body>', and after '</body>'.

	   It holds 'head' for all nodes from '<head>' to '</head>', and holds
	   'body' from '<body>'	to '</body>'.

	   It is just there in case you	need it.

   How are tags	and attributes handled?
       Tags are	stored in lower-case, in a tree	managed	by Tree::Simple.

       Attributes are stored in	the same case as in the	original HTML.

       The root	of the tree is returned	be "root()".

   How are HTML	comments handled?
       They are	treated	as content. This includes the prefix '<!--' and	the
       suffix '-->'.

   How is DOCTYPE handled?
       It is treated as	content	belonging to the root of the tree.

   How is the XML declaration handled?
       It is treated as	content	belonging to the root of the tree.

   Does	this module handle all HTML pages?
       No, never.

   Which versions of HTML does this module handle?
       Up to V 4.

   What	do I do	if this	module does not	handle my HTML page?
       Make yourself a nice cup	of tea,	and then fix your page.

   Does	this validate the HTML input?
       No.

       For example, if you feed	in a HTML page without the title tag, this
       module does not care.

   How do I view the output HTML?
       There are various ways.

       o See scripts/parse.html.pl
       o By installing HTML::Revelation, of course!
	   Sample output:

	   <http://savage.net.au/Perl-modules/html/CreateTable.html>.

   How do I test this module (or my file)?
       Preferably, see the previous question, or...

       Suggested steps:

       Note: There are quite a few files involved. Proceed with	caution.

       o Select	a HTML file to test
	   Call	this input.html.

       o Run input.html	thru reveal.pl
	   Reveal.pl ships with	HTML::Revelation.

	   Call	the output file	output.1.html.

       o Run input.html	thru parse.html.pl
	   parse.html.pl ships with HTML::Parser::Simple.

	   Call	the output file	parsed.html.

       o Run parsed.html thru reveal.pl
	   Call	the output file	output.2.html.

       o Compare output.1.html and output.2.html
	   If they match, or even if they don't	match, you're finished.

   Will	you implement a	'quirks' mode to handle	my special HTML	file?
       No, never.

       Help with quirks: <http://www.quirksmode.org/sitemap.html>.

   Is there anything I should be aware of?
       Yes. If your HTML file is not nice, the interpretation of tag nesting
       will not	match your preconceptions.

       In such cases, do not seek to fix the code. Instead, fix	your (faulty)
       preconceptions, and fix your HTML file.

       The 'a' tag, for	example, is defined to be an inline tag, but the 'div'
       tag is a	block-level tag.

       I do not	define 'a' to be inline, others	do, e.g.
       <http://www.w3.org/TR/html401/> and hence HTML::Tagset.

       Inline means:

	       <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>

       will not	be parsed as an	'a' containing a 'div'.

       The 'a' tag will	be closed before the 'div' is opened. So, the result
       will look like:

	       <a href = "#NAME"></a><div class	= 'global_toc_text'>NAME</div>

       To achieve what was presumably intended,	use 'span':

	       <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>

       Some people (*cough* *cough*) have had to redo their entire websites
       due to this very	problem.

       Of course, this is just one of a	vast set of possible problems.

       You have	been warned.

   Why did you use Tree::Simple	but not	Tree or	Tree::Fast or Tree::DAG_Node?
       During testing, Tree::Fast crashed, so I	replaced it with Tree and
       everything worked. Spooky.

       Late news: Tree does not	cope with an arrayref stored in	the metadata,
       so I have switched to Tree::DAG_Node.

       Stop press: As an experiment I switched to Tree::Simple.	Since it also
       works I will just keep using  it.

   Why is this module not called HTML::Parser::PurePerl?
       o The API
	   That	name sounds like a pure	Perl version of	the same API as	used
	   by HTML::Parser.

	   But the 2 APIs are not, and are not meant to	be, compatible.

       o The tie-in
	   Some	people might falsely assume HTML::Parser can automatically
	   fall	back to	HTML::Parser::PurePerl in the absence of a compiler.

   How do I output my own stuff	while traversing the tree?
       o The sophisticated way
	   As always with OO code, sub-class! In this case, you	write a	new
	   version of the traverse() method.

	   See HTML::Parser::Simple::Reporter, for example. It overrides
	   "traverse($node)".

       o The crude way
	   Alternately,	implement another method in your sub-class, e.g.
	   process(), which recurses like traverse(). Then call	parse()	and
	   process().

   How is the source formatted?
       I edit with UltraEdit. That means, in general, leading 4-space tabs.

       All vertical alignment within lines is done manually with spaces.

       Perl::Critic is off the agenda.

   Why did you choose Moos?
       For the 2012 Google Code-in, I had a quick look at 122 class-building
       classes,	and decided Moos was suitable, given it	is pure-Perl and has
       the trigger feature I needed.

       See
       <http://savage.net.au/Module-reviews/html/gci.2012.class.builder.modules.html>.

Credits
       This Perl HTML parser has been converted	from a JavaScript one written
       by John Resig.

       <http://ejohn.org/files/htmlparser.js>.

       Well done John!

       Note also the comments published	here:

       <http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58>.

Repository
       <https://github.com/ronsavage/HTML-Parser-Simple>

Support
       Email the author, or log	a bug on RT:

       <https://rt.cpan.org/Public/Dist/Display.html?Name=HTML::Parser::Simple>.

Author
       "HTML::Parser::Simple" was written by Ron Savage	_ron@savage.net.au_ in
       2009.

       Home page: <http://savage.net.au/index.html>.

Copyright
       Australian copyright (c)	2009 Ron Savage.

	       All Programs of mine are	'OSI Certified Open Source Software';
	       you can redistribute them and/or	modify them under the terms of
	       The Artistic License, a copy of which is	available at:
	       http://www.opensource.org/licenses/index.html

perl v5.32.1			  2015-01-25	       HTML::Parser::Simple(3)

NAME | Synopsis | Description | Distributions | Constructor and initialization | Methods | FAQ | Credits | Repository | Support | Author | Copyright

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=HTML::Parser::Simple&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help