Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
HTML::FormatExternal(3User Contributed Perl DocumentatiHTML::FormatExternal(3)

       HTML::FormatExternal - HTML to text formatting using external programs

       This is a collection of formatter modules which turn HTML into plain
       text by dumping it through the respective external programs.


       The module interfaces are compatible with "HTML::Formatter" modules
       such as "HTML::FormatText", but the external programs do	all the	work.

       Common formatting options are used where	possible, such as "leftmargin"
       and "rightmargin".  So just by switching	the class you can use a
       different program (or the plain "HTML::FormatText") according to
       personal	preference, or strengths and weaknesses, or what you've	got.

       There's nothing particularly difficult about piping through these
       programs, but a unified interface hides details like how	to set margins
       and how to force	input or output	charsets.

       Each of the classes above provide the following functions.  The "XXX"
       in the class names here is a placeholder	for any	of "Elinks", "Lynx",
       etc as above.

       See examples/ in the HTML-FormatExternal sources for a complete
       sample program.

   Formatter Compatible	Functions
       "$text =	HTML::FormatText::XXX->format_file ($filename,
       "$text =	HTML::FormatText::XXX->format_string ($html_string,
	   Run the formatter program over a file or string with	the given
	   options and return the formatted result as a	string.	 See "OPTIONS"
	   below for possible key/value	options.  For example,

	       $text = HTML::FormatText::Lynx->format_file ('/my/file.html');

	       $text = HTML::FormatText::W3m->format_string
		 ('<html><body>	<p> Hello world! </p </body></html>');

	   "format_file()" ensures any $filename is interpreted	as a filename
	   (by escaping	as necessary against however the programs interpret
	   command line	arguments).

       "$formatter = HTML::FormatText::XXX->new	(key=>value, ...)"
	   Create a formatter object with the given options.  In the current
	   implementation an object doesn't do much more than remember the
	   options for future use.

	       $formatter = HTML::FormatText::Elinks->new(rightmargin => 60);

       "$text =	$formatter->format ($tree_or_string)"
	   Run the $formatter program on a "HTML::TreeBuilder" tree or a
	   string, using the options in	$formatter, and	return the result as a

	   A TreeBuilder argument (ie. a "HTML::Element") is accepted for
	   compatibility with "HTML::Formatter".  The tree is simply turned
	   into	a string with "$tree->as_HTML" to pass to the program, so if
	   you've got a	string already then give that instead of a tree.

	   "HTML::Element" itself has a	"format()" method (see "format"	in
	   HTML::Element) which	runs a given $formatter.  A
	   "HTML::FormatExternal" object can be	used for $formatter.

	       $text = $tree->format($formatter);

	       # which dispatches to
	       $text = $formatter->format($tree);

   Extra Functions
       The following are extra methods not available in	the plain

       "HTML::FormatText::XXX->program_version ()"
       "HTML::FormatText::XXX->program_full_version ()"
       "$formatter->program_version ()"
       "$formatter->program_full_version ()"
	   Return the version number of	the formatter program as reported by
	   its "--version" or similar option.  If the formatter	program	is not
	   available then return "undef".

	   "program_version()" is the bare version number, perhaps with	"beta"
	   or similar indication.  "program_full_version()" is the entire
	   version output, which may include build options, copyright notice,

	       $str = HTML::FormatText::Lynx->program_version();
	       # eg. "2.8.7dev.10"

	       $str = HTML::FormatText::W3m->program_full_version();
	       # eg. "w3m version w3m/0.5.2, options lang=en,m17n,image,..."

	   The version number of the respective	Perl module itself is
	   available in	the usual way (see "VERSION" in	UNIVERSAL).

	       $modulever = HTML::FormatText::Netrik->VERSION;
	       $modulever = $formatter->VERSION

       File or byte string input is by default interpreted by the programs in
       their usual ways.  This should mean HTML	Latin-1	but user
       configurations might override that and some programs recognise a
       "<meta>"	charset	declaration or a Unicode BOM.  The "input_charset"
       option below can	force the input	charset.

       Perl wide-character input string	is encoded and passed to the program
       in whatever way it best understands.  Usually this is UTF-8 but in some
       cases it	is entitized instead.  The "input_charset" option can force
       the input charset to use	if for some reason UTF-8 is not	best.

       The output string is either bytes or wide chars.	 By default output is
       the same	as input, so wide char string input gives wide output and byte
       input string or file input gives	byte output.  The "output_wide"	option
       can force the output type (and is the way to get	wide chars back	from

       Byte output is whatever the program produces.  Its default might	be the
       locale charset or other user configuration which	suits direct display
       to the user's terminal.	The "output_charset" option can	force the
       output to be certain or to be ready for further processing.

       Wide char output	is done	by choosing the	best output charset the
       program can do and decoding its output.	Usually	this means UTF-8 but
       some of the programs may	only have less.	 The "output_charset" option
       can force the charset used and decoded.	If it's	something less than
       UTF-8 then some programs	might for example give ASCII art
       approximations of otherwise unrepresentable characters.

       Byte input is usual for HTML downloaded from a HTTP server or from a
       MIME email and the headers have the "input_charset" which applies.
       Byte output is good to go straight out to a tty or back to more MIME
       etc.  The input and output charsets could differ	if a server gives
       something other than what you want for final output.

       Wide chars are most convenient for crunching text within	Perl.  The
       default wide input giving wide output is	designed to be transparent for

       For reference, if a "HTML::Element" tree	contains wide char strings
       then its	usual "as_HTML()" method, which	is used	by "format()" above,
       produces	wide char HTML so the formatters here give wide	char text.
       Actually	"as_HTML()" produces all ASCII because its default behaviour
       is to entitize anything "unsafe", but it's still	a wide char string so
       the formatted output text is wide.

       The following options can be given to the constructor or	to the
       formatting methods.  The	defaults are whatever the respective programs
       do.  The	programs generally read	their config files when	dumping	so the
       defaults	and formatting details may follow the user's personal
       preferences.  Usually this is a good thing.

       "leftmargin => INTEGER"
       "rightmargin => INTEGER"
	   The column numbers for the left and right hand ends of the text.
	   "leftmargin"	0 means	no padding on the left.	 "rightmargin" is the
	   text	width, so for instance 60 would	mean the longest line is 60
	   characters (inclusive of any	"leftmargin").	These options are
	   compatible with "HTML::FormatText".

	   "rightmargin" is not	necessarily a hard limit.  Some	of the
	   programs will exceed	it in a	HTML literal "<pre>", or a run of
	   "&nbsp;" or similar.

       "input_charset => STRING"
	   Force the HTML input	to be interpreted as bytes of the given
	   charset, irrespective of locale, user configuration,	"<meta>" in
	   the HTML, etc.

       "output_charset => STRING"
	   Force the text output to be encoded as the given charset.  The
	   default varies among	the programs, but usually defaults to the

       "output_wide => 0,1,"as_input""
	   Select output string	as wide	characters rather than bytes.  The
	   default is "as_input" which means a wide char input string results
	   in a	wide char output string	and a byte input or file input is byte
	   output.  See	"CHARSETS" above for how wide characters work.

	   Bytes or wide chars output can be forced by 0 or 1 respectively.
	   For example to get wide char	output when formatting a file,

	       $wide_char_text = HTML::FormatText::W3m->format_file
				  ('/my/file.html', output_wide	=> 1);

       "base =>	STRING"
	   Set the base	URL for	any relative links within the HTML (similar to
	   "HTML::FormatText::WithLinks").  Usually this should	be the
	   location the	HTML was downloaded from.

	   If the document contains its	own "<base>" setting then currently
	   the document	takes precedence.  Only	Lynx and Elinks	display
	   absolutized link targets and	the option has no effect on the	other

       The formatter modules can be used under "perl -T" taint mode.  They run
       external	programs so it's necessary to untaint $ENV{PATH} in the	usual
       way per "Cleaning Up Your Path" in perlsec.

       The formatted text strings returned are always tainted, on the basis
       that they use or	include	data from outside the Perl program.  The
       "program_version()" and "program_full_version()"	strings	are tainted

       "leftmargin" is implemented by adding spaces to the program output.
       For byte	output it this is ASCII	spaces and that	will be	badly wrong
       for unusual output like UTF-16 which is not a byte superset of ASCII.
       For wide	char output the	margin is applied after	decoding to wide chars
       so is correct.  It'd be better to ask the programs to do	the margin but
       their options for that are poor.

       There's nothing done with errors	or warning messages from the programs.
       Generally they make a best effort on doubtful HTML, but fatal errors
       like bad	options	or missing libraries ought to be somehow trapped.

       "elinks"	(from Aug 2008 onwards)	and "netrik" can produce ANSI escapes
       for colours, underline, etc, and	"html2text" and	"lynx" can produce tty
       style backspace overstriking.  This might be good for text destined for
       a tty or	further	crunching.  Perhaps an "ansi" or "tty" option could
       enable this, where possible, but	for now	it's deliberately turned off
       in those	programs to keep the default as	plain text.

       HTML::FormatText::Elinks, HTML::FormatText::Html2text,
       HTML::FormatText::Links,	HTML::FormatText::Netrik,
       HTML::FormatText::Lynx, HTML::FormatText::Vilistextum,
       HTML::FormatText::W3m, HTML::FormatText::Zen

       HTML::FormatText, HTML::FormatText::WithLinks,


       Copyright 2008, 2009, 2010, 2011, 2012, 2013, 2015 Kevin	Ryde

       HTML-FormatExternal is free software; you can redistribute it and/or
       modify it under the terms of the	GNU General Public License as
       published by the	Free Software Foundation; either version 3, or (at
       your option) any	later version.

       HTML-FormatExternal is distributed in the hope that it will be useful,
       but WITHOUT ANY WARRANTY; without even the implied warranty of
       General Public License for more details.

       You should have received	a copy of the GNU General Public License along
       with HTML-FormatExternal.  If not, see <>.

perl v5.32.1			  2015-08-06	       HTML::FormatExternal(3)


Want to link to this manual page? Use this URL:

home | help