Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
HTML::TableParser(3)  User Contributed Perl Documentation HTML::TableParser(3)

NAME
       HTML::TableParser - Extract data	from an	HTML table

SYNOPSIS
	 use HTML::TableParser;

	 @reqs = (
		  {
		   id => 1.1,			 # id for embedded table
		   hdr => \&header,		 # function callback
		   row => \&row,		 # function callback
		   start => \&start,		 # function callback
		   end => \&end,		 # function callback
		   udata => { Snack => 'Food' }, # arbitrary user data
		  },
		  {
		   id => 1,			 # table id
		   cols	=> [ 'Object Type',
			     qr/object/	],	 # column name matches
		   obj => $obj,			 # method callbacks
		  },
		 );

	 # create parser object
	 $p = HTML::TableParser->new( \@reqs,
			  { Decode => 1, Trim => 1, Chomp => 1 } );
	 $p->parse_file( 'foo.html' );

	 # function callbacks
	 sub start {
	   my (	$id, $line, $udata ) = @_;
	   #...
	 }

	 sub end {
	   my (	$id, $line, $udata ) = @_;
	   #...
	 }

	 sub header {
	   my (	$id, $line, $cols, $udata ) = @_;
	   #...
	 }

	 sub row  {
	   my (	$id, $line, $cols, $udata ) = @_;
	   #...
	 }

DESCRIPTION
       HTML::TableParser uses HTML::Parser to extract data from	an HTML	table.
       The data	is returned via	a series of user defined callback functions or
       methods.	 Specific tables may be	selected either	by a matching a	unique
       table id	or by matching against the column names.  Multiple (even
       nested) tables may be parsed in a document in one pass.

   Table Identification
       Each table is given a unique id,	relative to its	parent,	based upon its
       order and nesting. The first top	level table has	id 1, the second 2,
       etc.  The first table nested in table 1 has id 1.1, the second 1.2,
       etc.  The first table nested in table 1.1 has id	1.1.1, etc.  These, as
       well as the tables' column names, may be	used to	identify which tables
       to parse.

   Data	Extraction
       As the parser traverses a selected table, it will pass data to user
       provided	callback functions or methods after it has digested particular
       structures in the table.	 All functions are passed the table id (as
       described above), the line number in the	HTML source where the table
       was found, and a	reference to any table specific	user provided data.

       Table Start
	       The start callback is invoked when a matched table has been
	       found.

       Table End
	       The end callback	is invoked after a matched table has been
	       parsed.

       Header  The hdr callback	is invoked after the table header has been
	       read in.	 Some tables do	not use	the <th> tag to	indicate a
	       header, so this function	may not	be called.  It is passed the
	       column names.

       Row     The row callback	is invoked after a row in the table has	been
	       read.  It is passed the column data.

       Warn    The warn	callback is invoked when a non-fatal error occurs
	       during parsing.	Fatal errors croak.

       New     This is the class method	to call	to create a new	object when
	       HTML::TableParser is supposed to	create new objects upon	table
	       start.

   Callback API
       Callbacks may be	functions or methods or	a mixture of both.  In the
       latter case, an object must be passed to	the constructor.  (More	on
       that later.)

       The callbacks are invoked as follows:

	 start(	$tbl_id, $line_no, $udata );

	 end( $tbl_id, $line_no, $udata	);

	 hdr( $tbl_id, $line_no, \@col_names, $udata );

	 row( $tbl_id, $line_no, \@data, $udata	);

	 warn( $tbl_id,	$line_no, $message, $udata );

	 new( $tbl_id, $udata );

   Data	Cleanup
       There are several cleanup operations that may be	performed
       automatically:

       Chomp   chomp() the data

       Decode  Run the data through HTML::Entities::decode.

       DecodeNBSP
	       Normally	HTML::Entitites::decode	changes	a non-breaking space
	       into a character	which doesn't seem to be matched by Perl's
	       whitespace regexp.  Setting this	attribute changes the HTML
	       "nbsp" character	to a plain 'ol blank.

       Trim    remove leading and trailing white space.

   Data	Organization
       Column names are	derived	from cells delimited by	the <th> and </th>
       tags. Some tables have header cells which span one or more columns or
       rows to make things look	nice.  HTML::TableParser determines the	actual
       number of columns used and provides column names	for each column,
       repeating names for spanned columns and concatenating spanned rows and
       columns.	 For example,  if the table header looks like this:

	+----+--------+----------+-------------+-------------------+
	|    |	      |	Eq J2000 |	       | Velocity/Redshift |
	| No | Object |----------| Object Type |-------------------|
	|    |	      |	RA | Dec |	       | km/s |	 z  | Qual |
	+----+--------+----------+-------------+-------------------+

       The columns will	be:

	 No
	 Object
	 Eq J2000 RA
	 Eq J2000 Dec
	 Object	Type
	 Velocity/Redshift km/s
	 Velocity/Redshift z
	 Velocity/Redshift Qual

       Row data	are derived from cells delimited by the	<td> and </td> tags.
       Cells which span	more than one column or	row are	handled	correctly,
       i.e. the	values are duplicated in the appropriate places.

METHODS
       new
		  $p = HTML::TableParser->new( \@reqs, \%attr );

	       This is the class constructor.  It is passed a list of table
	       requests	as well	as attributes which specify defaults for
	       common operations.  Table requests are documented in "Table
	       Requests".

	       The %attr hash provides default values for some of the table
	       request attributes, namely the data cleanup operations (
	       "Chomp",	"Decode", "Trim" ), and	the multi match	attribute
	       "MultiMatch", i.e.,

		 $p = HTML::TableParser->new( \@reqs, {	Chomp => 1 } );

	       will set	Chomp on for all of the	table requests,	unless
	       overridden by them.  The	data cleanup operations	are documented
	       above; "MultiMatch" is documented in "Table Requests".

	       Decode defaults to on; all of the others	default	to off.

       parse_file
	       This is the same	function as in HTML::Parser.

       parse   This is the same	function as in HTML::Parser.

Table Requests
       A table request is a hash used by HTML::TableParser to determine	which
       tables are to be	parsed,	the callbacks to be invoked, and any data
       cleanup.	 There may be multiple requests	processed by one call to the
       parser; each table is associated	with a single request (even if several
       requests	match the table).

       A single	request	may match several tables, however unless the
       MultiMatch attribute is specified for that request, it will be used for
       the first matching table	only.

       A table request which matches a table id	of "DEFAULT" will be used as a
       catch-all request, and will match all tables not	matched	by other
       requests.  Please note that tables are compared to the requests in the
       order that the latter are passed	to the new() method; place the DEFAULT
       method last for proper behavior.

   Identifying tables to parse
       HTML::TableParser needs to be told which	tables to parse.  This can be
       done by matching	table ids or column names, or a	combination of both.
       The table request hash elements dedicated to this are:

       id      This indicates a	match on table id.  It can take	one of these
	       forms:

	       exact match
			 id => $match
			 id => '1.2'

		       Here $match is a	scalar which is	compared directly to
		       the table id.

	       regular expression
			 id => $re
			 id => qr/1\.\d+\.2/

		       $re is a	regular	expression, which must be constructed
		       with the	"qr//" operator.

	       subroutine
			 id => \&my_match_subroutine
			 id => sub { my	( $id, $oids ) = @_ ;
				  $oids[0] > 3 && $oids[1] < 2 }

		       Here "id" is assigned a coderef to a subroutine which
		       returns true if the table matches, false	if not.	 The
		       subroutine is passed two	arguments: the table id	as a
		       scalar string ( e.g. 1.2.3) and the table id as an
		       arrayref	(e.g. "$oids = [ 1, 2, 3]").

	       "id" may	be passed an array containing any combination of the
	       above:

		 id => [ '1.2',	qr/1\.\d+\.2/, sub { ... } ]

	       Elements	in the array may be preceded by	a modifier indicating
	       the action to be	taken if the table matches on that element.
	       The modifiers and their meanings	are:

	       "-"     If the id matches, it is	explicitly excluded from being
		       processed by this request.

	       "--"    If the id matches, it is	skipped	by all requests.

	       "+"     If the id matches, it will be processed by this
		       request.	 This is the default action.

	       An example:

		 id => [ '-', '1.2', 'DEFAULT' ]

	       indicates that this request should be used for all tables,
	       except for table	1.2.

		 id => [ '--', '1.2' ]

	       Table 2 is just plain skipped altogether.

       cols    This indicates a	match on column	names.	It can take one	of
	       these forms:

	       exact match
			 cols => $match
			 cols => 'Snacks01'

		       Here $match is a	scalar which is	compared directly to
		       the column names.  If any column	matches, the table is
		       processed.

	       regular expression
			 cols => $re
			 cols => qr/Snacks\d+/

		       $re is a	regular	expression, which must be constructed
		       with the	"qr//" operator.  Again, a successful match
		       against any column name causes the table	to be
		       processed.

	       subroutine
			 cols => \&my_match_subroutine
			 cols => sub { my ( $id, $oids,	$cols )	= @_ ;
				       ... }

		       Here "cols" is assigned a coderef to a subroutine which
		       returns true if the table matches, false	if not.	 The
		       subroutine is passed three arguments: the table id as a
		       scalar string ( e.g. 1.2.3), the	table id as an
		       arrayref	(e.g. "$oids = [ 1, 2, 3]"), and the column
		       names, as an arrayref (e.g. "$cols = [ 'col1', 'col2'
		       ]").  This option gives the calling routine the ability
		       to make arbitrary selections based upon table id	and
		       columns.

	       "cols" may be passed an arrayref	containing any combination of
	       the above:

		 cols => [ 'Snacks01', qr/Snacks\d+/, sub { ...	} ]

	       Elements	in the array may be preceded by	a modifier indicating
	       the action to be	taken if the table matches on that element.
	       They are	the same as the	table id modifiers mentioned above.

       colre   This is deprecated, and is present for backwards	compatibility
	       only.  An arrayref containing the regular expressions to	match,
	       or a scalar containing a	single reqular expression

       More than one of	these may be used for a	single table request. A
       request may match more than one table.  By default a request is used
       only once (even the "DEFAULT" id	match!). Set the "MultiMatch"
       attribute to enable multiple matches per	request.

       When attempting to match	a table, the following steps are taken:

       1.      The table id is compared	to the requests	which contain an id
	       match.  The first such match is used (in	the order given	in the
	       passed array).

       2.      If no explicit id match is found, column	name matches are
	       attempted.  The first such match	is used	(in the	order given in
	       the passed array)

       3.      If no column name match is found	(or there were none
	       requested), the first request which matches an id of "DEFAULT"
	       is used.

   Specifying the data callbacks
       Callback	functions are specified	with the callback attributes "start",
       "end", "hdr", "row", and	"warn".	 They should be	set to code
       references, i.e.

	 %table_req = (	..., start => \&start_func, end	=> \&end_func )

       To use methods, specify the object with the "obj" key, and the method
       names via the callback attributes, which	should be set to strings.  If
       you don't specify method	names they will	default	to (you	guessed	it)
       "start",	"end", "hdr", "row", and "warn".

	 $obj =	SomeClass->new();
	 # ...
	 %table_req_1 =	( ..., obj => $obj );
	 %table_req_2 =	( ..., obj => $obj, start => 'start',
				    end	=> 'end' );

       You can also have HTML::TableParser create a new	object for you for
       each table by specifying	the "class" attribute.	By default the
       constructor is assumed to be the	class new() method; if not, specify it
       using the "new" attribute:

	 use MyClass;
	 %table_req = (	..., class => 'MyClass', new =>	'mynew'	);

       To use a	function instead of a method for a particular callback,	set
       the callback attribute to a code	reference:

	 %table_req = (	..., obj => $obj, end => \&end_func );

       You don't have to provide all the callbacks.  You should	not use	both
       "obj" and "class" in the	same table request.

       HTML::TableParser automatically determines if your object or class has
       one of the required methods.  If	you wish it not	to use a particular
       method, set it equal to "undef".	 For example

	 %table_req = (	..., obj => $obj, end => undef )

       indicates the object's end method should	not be called, even if it
       exists.

       You can specify arbitrary data to be passed to the callback functions
       via the "udata" attribute:

	 %table_req = (	..., udata => \%hash_of_my_special_stuff )

   Specifying Data cleanup operations
       Data cleanup operations may be specified	uniquely for each table. The
       available keys are "Chomp", "Decode", "Trim".  They should be set to a
       non-zero	value if the operation is to be	performed.

   Other Attributes
       The "MultiMatch"	key is used when a request is capable of handling
       multiple	tables in the document.	 Ordinarily, a request will process a
       single table only (even "DEFAULT" requests).  Set it to a non-zero
       value to	allow the request to handle more than one table.

LICENSE
       This software is	released under the GNU General Public License.	You
       may find	a copy at

	  http://www.fsf.org/copyleft/gpl.html

AUTHOR
       Diab Jerius (djerius@cpan.org)

SEE ALSO
       HTML::Parser, HTML::TableExtract.

perl v5.24.1			  2017-04-20		  HTML::TableParser(3)

NAME | SYNOPSIS | DESCRIPTION | METHODS | Table Requests | LICENSE | AUTHOR | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=HTML::TableParser&sektion=3&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help