Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
HTML::StripScripts(3) User Contributed Perl DocumentationHTML::StripScripts(3)

NAME
       HTML::StripScripts - Strip scripting constructs out of HTML

SYNOPSIS
	 use HTML::StripScripts;

	 my $hss = HTML::StripScripts->new({ Context =>	'Inline' });

	 $hss->input_start_document;

	 $hss->input_start('<i>');
	 $hss->input_text('hello, world!');
	 $hss->input_end('</i>');

	 $hss->input_end_document;

	 print $hss->filtered_document;

DESCRIPTION
       This module strips scripting constructs out of HTML, leaving as much
       non-scripting markup in place as	possible.  This	allows web
       applications to display HTML originating	from an	untrusted source
       without introducing XSS (cross site scripting) vulnerabilities.

       You will	probably use HTML::StripScripts::Parser	rather than using this
       module directly.

       The process is based on whitelists of tags, attributes and attribute
       values.	This approach is the most secure against disguised scripting
       constructs hidden in malicious HTML documents.

       As well as removing scripting constructs, this module ensures that
       there is	a matching end for each	start tag, and that the	tags are
       properly	nested.

       Previously, in order to customise the output, you needed	to subclass
       "HTML::StripScripts" and	override methods.  Now,	most customisation can
       be done through the "Rules" option provided to "new()". (See
       examples/declaration/ and examples/tags/	for cases where	subclassing is
       necessary.)

       The HTML	document must be parsed	into start tags, end tags and text
       before it can be	filtered by this module.  Use either
       HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
       want to input an	unparsed HTML document.

       See examples/direct/ for	an example of how to feed tokens directly to
	HTML::StripScripts.

CONSTRUCTORS
       new ( CONFIG )
	   Creates a new "HTML::StripScripts" filter object, bound to a
	   particular filtering	policy.	 If present, the CONFIG	parameter must
	   be a	hashref.  The following	keys are recognized (unrecognized keys
	   will	be silently ignored).

	       $s = HTML::Stripscripts->new({
		   Context	   => 'Document|Flow|Inline|NoTags',
		   BanList	   => [qw( br img )] | {br => '1', img => '1'},
		   BanAllBut	   => [qw(p div	span)],
		   AllowSrc	   => 0|1,
		   AllowHref	   => 0|1,
		   AllowRelURL	   => 0|1,
		   AllowMailto	   => 0|1,
		   EscapeFiltered  => 0|1,
		   Rules	   => {	See below for details },
	       });

	   "Context"
	       A string	specifying the context in which	the filtered document
	       will be used.  This influences the set of tags that will	be
	       allowed.

	       If present, the "Context" value must be one of:

	       "Document"
		   If "Context"	is "Document" then the filter will allow a
		   full	HTML document, including the "HTML" tag	and "HEAD" and
		   "BODY" sections.

	       "Flow"
		   If "Context"	is "Flow" then most of the cosmetic tags that
		   one would expect to find in a document body are allowed,
		   including lists and tables but not including	forms.

	       "Inline"
		   If "Context"	is "Inline" then only inline tags such as "B"
		   and "FONT" are allowed.

	       "NoTags"
		   If "Context"	is "NoTags" then no tags are allowed.

	       The default "Context" value is "Flow".

	   "BanList"
	       If present, this	option must be an arrayref or a	hashref.  Any
	       tag that	would normally be allowed (because it presents no XSS
	       hazard) will be blocked if the lowercase	name of	the tag	is in
	       this list.

	       For example, in a guestbook application where "HR" tags are
	       used to separate	posts, you may wish to prevent posts from
	       including "HR" tags, even though	"HR" is	not an XSS risk.

	   "BanAllBut"
	       If present, this	option must be reference to an array holding a
	       list of lowercase tag names.  This has the effect of adding all
	       but the listed tags to the ban list, so that only those tags
	       listed will be allowed.

	   "AllowSrc"
	       By default, the filter won't allow constructs that cause	the
	       browser to fetch	things automatically, such as "SRC" attributes
	       in "IMG"	tags.  If this option is present and true then those
	       constructs will be allowed.

	   "AllowHref"
	       By default, the filter won't allow constructs that cause	the
	       browser to fetch	things if the user clicks on something,	such
	       as the "HREF" attribute in "A" tags.  Set this option to	a true
	       value to	allow this type	of construct.

	   "AllowRelURL"
	       By default, the filter won't allow relative URLs	such as
	       "../foo.html" in	"SRC" and "HREF" attribute values.  Set	this
	       option to a true	value to allow them. "AllowHref" and / or
	       "AllowSrc" also need to be set to true for this to have any
	       effect.

	   "AllowMailto"
	       By default, "mailto:" links are not allowed. If "AllowMailto"
	       is set to a true	value, then this construct will	be allowed.
	       This can	be enabled separately from AllowHref.

	   "EscapeFiltered"
	       By default, any filtered	tags are outputted as
	       "<!--filtered-->". If "EscapeFiltered" is set to	a true value,
	       then the	filtered tags are converted to HTML entities.

	       For instance:

		 <br>  -->  &lt;br&gt;

	   "Rules"
	       The "Rules" option provides a very flexible way of customising
	       the filter.

	       The focus is safety-first, so it	is applied after all of	the
	       previous	validation.  This means	that you cannot	all malicious
	       data should already have	been cleared.

	       Rules can be specified for tags and for attributes. Any tag or
	       attribute not explicitly	listed will be handled by the default
	       "*" rules.

	       The following is	a synopsis of all of the options that you can
	       use to configure	rules.	Below, an example is broken into
	       sections	and explained.

		Rules => {

		    tag	=> 0 | 1 | sub { tag_callback }
			   | {
			       attr	 => 0 |	1 | 'regex' | qr/regex/	| sub {	attr_callback},
			       '*'	 => 0 |	1 | 'regex' | qr/regex/	| sub {	attr_callback},
			       required	 => [qw(attrname attrname)],
			       tag	 => sub	{ tag_callback }
			     },

		   '*' => 0 | 1	| sub {	tag_callback }
			  | {
			      attr => 0	| 1 | 'regex' |	qr/regex/ | sub	{ attr_callback},
			      '*'  => 0	| 1 | 'regex' |	qr/regex/ | sub	{ attr_callback},
			      tag  => sub { tag_callback }
			    }

		   }

	       EXAMPLE:

		   Rules => {

		       ##########################
		       ##### EXPLICIT RULES #####
		       ##########################

		       ## Allow	<br> tags, reject <img>	tags
		       br	   => 1,
		       img	   => 0,

		       ## Send all <div> tags to a sub
		       div	   => sub { tag_callback },

		       ## Allow	<blockquote> tags,and allow the	'cite' attribute
		       ## All other attributes are handled by the default C<*>
		       blockquote  => {
			   cite	   => 1,
		       },

		       ## Allow	<a> tags, and
		       a  => {

			   ## Allow the	'title'	attribute
			   title     =>	1,

			   ## Allow the	'href' attribute if it matches the regex
			   href	   =>	'^http://yourdomain.com'
		      OR   href	   => qr{^http://yourdomain.com},

			   ## 'style' attributes are handled by	a sub
			   style     =>	sub { attr_callback },

			   ## All other	attributes are rejected
			   '*'	     =>	0,

			   ## Additionally, the	<a> tag	should be handled by this sub
			   tag	     =>	sub { tag_callback},

			   ## If the <a> tag doesn't have these	attributes, filter the tag
			   required  =>	[qw(href title)],

		       },

		       ##########################
		       ##### DEFAULT RULES #####
		       ##########################

		       ## The default '*' rule - accepts all the same options as above.
		       ## If a tag or attribute	is not mentioned above,	then the default
		       ## rule is applied:

		       ## Reject all tags
		       '*'	   => 0,

		       ## Allow	all tags and all attributes
		       '*'	   => 1,

		       ## Send all tags	to the sub
		       '*'	   => sub { tag_callback },

		       ## Allow	all tags, reject all attributes
		       '*'	   => {	'*'  =>	0 },

		       ## Allow	all tags, and
		       '*' => {

			   ## Allow the	'title'	attribute
			   title   => 1,

			   ## Allow the	'href' attribute if it matches the regex
			   href	   =>	'^http://yourdomain.com'
		      OR   href	   => qr{^http://yourdomain.com},

			   ## 'style' attributes are handled by	a sub
			   style   => sub { attr_callback },

			   ## All other	attributes are rejected
			   '*'	   => 0,

			   ## Additionally, all	tags should be handled by this sub
			   tag	   => sub { tag_callback},

		       },

	       Tag Callbacks
		       sub tag_callback	{
			   my ($filter,$element) = (@_);

			   $element = {
			       tag	=> 'tag',
			       content	=> 'inner_html',
			       attr	=> {
				   attr_name =>	'attr_value',
			       }
			   };
			   return 0 | 1;
		       }

		   A tag callback accepts two parameters, the $filter object
		   and the C$element>.	It should return 0 to completely
		   ignore the tag and its content (which includes any nested
		   HTML	tags), or 1 to accept and output the tag.

		   The $element	is a hash ref containing the keys:

	       "tag"
		   This	is the tagname in lowercase, eg	"a", "br", "img". If
		   you set the tag value to an empty string, then the tag will
		   not be outputted, but the tag contents will.

	       "content"
		   This	is the equivalent of DOM's innerHTML. It contains the
		   text	content	and any	HTML tags contained within this
		   element. You	can change the content or set it to an empty
		   string so that it is	not outputted.

	       "attr"
		   "attr" contains a hashref containing	the attribute names
		   and values

	       If for instance,	you wanted to replace "<b>" tags with "<span>"
	       tags, you could do this:

		   sub b_callback {
		       my ($filter,$element)   = @_;
		       $element->{tag}	       = 'span';
		       $element->{attr}{style} = 'font-weight:bold';
		       return 1;
		   }

	   Attribute Callbacks
		   sub attr_callback {
		       my ( $filter, $tag, $attr_name, $attr_val ) = @_;
		       return undef | '' | 'value';
		   }

	       Attribute callbacks accept four parameters, the $filter object,
	       the $tag	name, the $attr_name and the $attr_value.

	       It should return	either "undef" to reject the attribute,	or the
	       value to	be used. An empty string keeps the attribute, but
	       without a value.

	   "BanList" vs	"BanAllBut" vs "Rules"
	       It is not necessary to use "BanList" or "BanAllBut" -
	       everything can be done via "Rules", however it may be simpler
	       to write:

		   BanAllBut =>	[qw(p div span)]

	       The logic works as follows:

		  * If BanAllBut exists, then ban everything but the tags in the list
		  * Add	to the ban list	any elements in	BanList
		  * Any	tags mentioned explicitly in Rules (eg a => 0, br => 1)
		    are	added or removed from the BanList
		  * A default rule of {	'*' => 0 } would ban all tags except
		    those mentioned in Rules
		  * A default rule of {	'*' => 1 } would allow all tags	except
		    those disallowed in	the ban	list, or by explicit rules

METHODS
       This class provides the following methods:

       hss_init	()
	   This	method is called by new() and does the actual initialisation
	   work	for the	new HTML::StripScripts object.

       input_start_document ()
	   This	method initializes the filter, and must	be called once before
	   starting on each HTML document to be	filtered.

       input_start ( TEXT )
	   Handles a start tag from the	input document.	 TEXT must be the full
	   text	of the tag, including angle-brackets.

       input_end ( TEXT	)
	   Handles an end tag from the input document.	TEXT must be the full
	   text	of the end tag,	including angle-brackets.

       input_text ( TEXT )
	   Handles some	non-tag	text from the input document.

       input_process ( TEXT )
	   Handles a processing	instruction from the input document.

       input_comment ( TEXT )
	   Handles an HTML comment from	the input document.

       input_declaration ( TEXT	)
	   Handles an declaration from the input document.

       input_end_document ()
	   Call	this method to signal the end of the input document.

       filtered_document ()
	   Returns the filtered	document as a string.

SUBCLASSING
       The only	reason for subclassing this module now is to add to the	list
       of accepted tags, attributes and	styles (See "WHITELIST INITIALIZATION
       METHODS").  Everything else can be achieved with	"Rules".

       The "HTML::StripScripts"	class is subclassable.	Filter objects are
       plain hashes and	"HTML::StripScripts" reserves only hash	keys that
       start with "_hss".  The filter configuration can	be set up by invoking
       the hss_init() method, which takes the same arguments as	new().

OUTPUT METHODS
       The filter outputs a stream of start tags, end tags, text, comments,
       declarations and	processing instructions, via the following "output_*"
       methods.	 Subclasses may	override these to intercept the	filter output.

       The default implementations of the "output_*" methods pass the text on
       to the output() method.	The default implementation of the output()
       method appends the text to a string, which can be fetched with the
       filtered_document() method once processing is complete.

       If the output() method or the individual	"output_*" methods are
       overridden in a subclass, then filtered_document() will not work	in
       that subclass.

       output_start_document ()
	   This	method gets called once	at the start of	each HTML document
	   passed through the filter.  The default implementation does
	   nothing.

       output_end_document ()
	   This	method gets called once	at the end of each HTML	document
	   passed through the filter.  The default implementation does
	   nothing.

       output_start ( TEXT )
	   This	method is used to output a filtered start tag.

       output_end ( TEXT )
	   This	method is used to output a filtered end	tag.

       output_text ( TEXT )
	   This	method is used to output some filtered non-tag text.

       output_declaration ( TEXT )
	   This	method is used to output a filtered declaration.

       output_comment (	TEXT )
	   This	method is used to output a filtered HTML comment.

       output_process (	TEXT )
	   This	method is used to output a filtered processing instruction.

       output (	TEXT )
	   This	method is invoked by all of the	default	"output_*" methods.
	   The default implementation appends the text to the string that the
	   filtered_document() method will return.

       output_stack_entry ( TEXT )
	   This	method is invoked when a tag plus all text and nested HTML
	   content within the tag has been processed. It adds the tag plus its
	   content to the content for its parent tag.

REJECT METHODS
       When the	filter encounters something in the input document which	it
       cannot transform	into an	acceptable construct, it invokes one of	the
       following "reject_*" methods to put something in	the output document to
       take the	place of the unacceptable construct.

       The TEXT	parameter is the full text of the unacceptable construct.

       The default implementations of these methods output an HTML comment
       containing the text "filtered". If "EscapeFiltered" is set to true,
       then the	rejected text is HTML escaped instead.

       Subclasses may override these methods, but should exercise caution.
       The TEXT	parameter is unfiltered	input and may contain malicious
       constructs.

       reject_start ( TEXT )
       reject_end ( TEXT )
       reject_text ( TEXT )
       reject_declaration ( TEXT )
       reject_comment (	TEXT )
       reject_process (	TEXT )

WHITELIST INITIALIZATION METHODS
       The filter refers to various whitelists to determine which constructs
       are acceptable.	To modify these	whitelists, subclasses can override
       the following methods.

       Each method is called once at object initialization time, and must
       return a	reference to a nested data structure.  These references	are
       installed into the object, and used whenever the	filter needs to	refer
       to a whitelist.

       The default implementations of these methods can	be invoked as class
       methods.

       See examples/tags/ and examples/declaration/ for	examples of how	to
       override	these methods.

       init_context_whitelist ()
	   Returns a reference to the "Context"	whitelist, which determines
	   which tags may appear at each point in the document,	and which
	   other tags may be nested within them.

	   It is a hash, and the keys are context names, such as "Flow"	and
	   "Inline".

	   The values in the hash are hashrefs.	 The keys in these subhashes
	   are lowercase tag names, and	the values are context names,
	   specifying the context that the tag provides	to any other tags
	   nested within it.

	   The special context "EMPTY" as a value in a subhash indicates that
	   nothing can be nested within	that tag.

       init_attrib_whitelist ()
	   Returns a reference to the "Attrib" whitelist, which	determines
	   which attributes each tag can have and the values that those
	   attributes can take.

	   It is a hash, and the keys are lowercase tag	names.

	   The values in the hash are hashrefs.	 The keys in these subhashes
	   are lowercase attribute names, and the values are attribute value
	   class names,	which are short	strings	describing the type of values
	   that	the attribute can take,	such as	"color"	or "number".

       init_attval_whitelist ()
	   Returns a reference to the "AttVal" whitelist, which	is a hash that
	   maps	attribute value	class names from the "Attrib" whitelist	to
	   coderefs to subs to validate	(and optionally	transform) a
	   particular attribute	value.

	   The filter calls the	attribute value	validation subs	with the
	   following parameters:

	   "filter"
	       A reference to the filter object.

	   "tagname"
	       The lowercase name of the tag in	which the attribute appears.

	   "attrname"
	       The name	of the attribute.

	   "attrval"
	       The attribute value found in the	input document,	in canonical
	       form (see "CANONICAL FORM").

	   The validation sub can return undef to indicate that	the attribute
	   should be removed from the tag, or it can return the	new value for
	   the attribute, in canonical form.

       init_style_whitelist ()
	   Returns a reference to the "Style" whitelist, which determines
	   which CSS style directives are permitted in "style" tag attributes.
	   The keys are	value names such as "color" and	"background-color",
	   and the values are class names to be	used as	keys into the "AttVal"
	   whitelist.

       init_deinter_whitelist
	   Returns a reference to the "DeInter"	whitelist, which determines
	   which inline	tags the filter	should attempt to automatically	de-
	   interleave if they are encountered interleaved.  For	example, the
	   filter will transform:

	     <b>hello <i>world</b> !</i>

	   Into:

	     <b>hello <i>world</i></b><i> !</i>

	   because both	"b" and	"i" appear as keys in the "DeInter" whitelist.

CHARACTER DATA PROCESSING
       These methods transform attribute values	and non-tag text from the
       input document into canonical form (see "CANONICAL FORM"), and
       transform text in canonical form	into a suitable	form for the output
       document.

       text_to_canonical_form (	TEXT )
	   This	method is used to reduce non-tag text from the input document
	   to canonical	form before passing it to the filter_text() method.

	   The default implementation unescapes	all entities that map to
	   "US-ASCII" characters other than ampersand, and replaces any
	   ampersands that don't form part of valid entities with "&amp;".

       quoted_to_canonical_form	( VALUE	)
	   This	method is used to reduce attribute values quoted with
	   doublequotes	or singlequotes	to canonical form before passing it to
	   the handler subs in the "AttVal" whitelist.

	   The default behavior	is the same as that of
	   "text_to_canonical_form()", plus it converts	any CR,	LF or TAB
	   characters to spaces.

       unquoted_to_canonical_form ( VALUE )
	   This	method is used to reduce attribute values without quotes to
	   canonical form before passing it to the handler subs	in the
	   "AttVal" whitelist.

	   The default implementation simply replaces all ampersands with
	   "&amp;", since that corresponds with	the way	most browsers treat
	   entities in unquoted	values.

       canonical_form_to_text (	TEXT )
	   This	method is used to convert the text in canonical	form returned
	   by the filter_text()	method to a form suitable for inclusion	in the
	   output document.

	   The default implementation runs anything that doesn't look like a
	   valid entity	through	the escape_html_metachars() method.

       canonical_form_to_attval	( ATTVAL )
	   This	method is used to convert the text in canonical	form returned
	   by the "AttVal" handler subs	to a form suitable for inclusion in
	   doublequotes	in the output tag.

	   The default implementation converts CR, LF and TAB characters to a
	   single space, and runs anything that	doesn't	look like a valid
	   entity through the escape_html_metachars() method.

       validate_href_attribute ( TEXT )
	   If the "AllowHref" filter configuration option is set, then this
	   method is used to validate "href" type attribute values.  TEXT is
	   the attribute value in canonical form.  Returns a possibly modified
	   attribute value (in canonical form) or "undef" to reject the
	   attribute.

	   The default implementation allows only absolute "http" and "https"
	   URLs, permits port numbers and query	strings, and imposes
	   reasonable length limits.

	   It does not URI escape the query string, and	it does	not guarantee
	   properly formatted URIs, it just tries to give safe URIs. You can
	   always use an attribute callback (see "Attribute Callbacks")	to
	   provide stricter handling.

       validate_mailto ( TEXT )
	   If the "AllowMailto"	filter configuration option is set, then this
	   method is used to validate "href" type attribute values which begin
	   with	"mailto:".  TEXT is the	attribute value	in canonical form.
	   Returns a possibly modified attribute value (in canonical form) or
	   "undef" to reject the attribute.

	   This	uses a lightweight regex and does not guarantee	that email
	   addresses are properly formatted. You can always use	an attribute
	   callback (see "Attribute Callbacks")	to provide stricter handling.

       validate_src_attribute (	TEXT )
	   If the "AllowSrc" filter configuration option is set, then this
	   method is used to validate "src" type attribute values.  TEXT is
	   the attribute value in canonical form.  Returns a possibly modified
	   attribute value (in canonical form) or "undef" to reject the
	   attribute.

	   The default implementation behaves as validate_href_attribute().

OTHER METHODS TO OVERRIDE
       As well as the output, reject, init and cdata methods listed above, it
       might make sense	for subclasses to override the following methods:

       filter_text ( TEXT )
	   This	method will be invoked to filter blocks	of non-tag text	in the
	   input document.  Both input and output are in canonical form, see
	   "CANONICAL FORM".

	   The default implementation does no filtering.

       escape_html_metachars ( TEXT )
	   This	method is used to escape all HTML metacharacters in TEXT.  The
	   return value	must be	a copy of TEXT with metacharacters escaped.

	   The default implementation escapes a	minimal	set of metacharacters
	   for security	against	XSS vulnerabilities.  The set of characters to
	   escape is a compromise between the need for security	and the	need
	   to ensure that the filter will work for documents in	as many
	   different character sets as possible.

	   Subclasses which make strong	assumptions about the document
	   character set will be able to escape	much more aggressively.

       strip_nonprintable ( TEXT )
	   Returns a copy of TEXT with runs of nonprintable characters
	   replaced with spaces	or some	other harmless string.	Avoids
	   replacing anything with the empty string, as	that can lead to other
	   security issues.

	   The default implementation strips out only NULL characters, in
	   order to avoid scrambling text for as many different	character sets
	   as possible.

	   Subclasses which make some sort of assumption about the character
	   set in use will be able to have a much wider	definition of a
	   nonprintable	character, and hence a more secure
	   strip_nonprintable()	implementation.

ATTRIBUTE VALUE	HANDLER	SUBS
       References to the following subs	appear in the "AttVal" whitelist
       returned	by the init_attval_whitelist() method.

       _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value hander for the "style" attribute.

       _hss_attval_size	( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for attributes who's	values are some	sort
	   of size or length.

       _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for attributes who's	values are a simple
	   integer.

       _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for color attributes.

       _hss_attval_text	( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for text attributes.

       _hss_attval_word	( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for attributes who's	values must consist of
	   a single short word,	with minus characters permitted.

       _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for attributes who's	values must consist of
	   one or more words, separated	by spaces and/or commas.

       _hss_attval_wordlistq ( FILTER, TAGNAME,	ATTRNAME, ATTRVAL )
	   Attribute value handler for attributes who's	values must consist of
	   one or more words, separated	by commas, with	optional doublequotes
	   around words	and spaces allowed within the doublequotes.

       _hss_attval_href	( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for "href" type attributes.	If the
	   "AllowHref" or "AllowMailto"	configuration options are set, uses
	   the validate_href_attribute() method	to check the attribute value.

       _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for "src" type attributes.  If the
	   "AllowSrc" configuration option is set, uses	the
	   validate_src_attribute() method to check the	attribute value.

       _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
	   Attribute value handler for "src" type style	pseudo attributes.

       _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME,	ATTRVAL	)
	   Attribute value handler for attributes that have no value or	a
	   value that is ignored.  Just	returns	the attribute name as the
	   value.

CANONICAL FORM
       Many of the methods described above deal	with text from the input
       document, encoded in what I call	"canonical form", defined as follows:

       All characters other than ampersands represent themselves.  Literal
       ampersands are encoded as "&amp;".  Non "US-ASCII" characters may
       appear as literals in whatever character	set is in use, or they may
       appear as named or numeric HTML entities	such as	"&aelig;", "&#31337;"
       and "&#xFF;".  Unknown named entities such as "&foo;" may appear.

       The idea	is to be able to be able to reduce input text to a minimal
       form, without making too	many assumptions about the character set in
       use.

PRIVATE	METHODS
       The following methods are internal to this class, and should not	be
       invoked from elsewhere.	Subclasses should not use or override these
       methods.

       _hss_prepare_ban_list (CFG)
	   Returns a hash ref representing all the banned tags,	based on the
	   values of BanList and BanAllBut

       _hss_prepare_rules (CFG)
	   Returns a hash ref representing the tag and attribute rules (See
	   "Rules").

	   Returns undef if no filters are specified, in which case the
	   attribute filter code has very little performance impact. If	any
	   rules are specified,	then every tag and attribute is	checked.

       _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
	   Returns the attribute filter	rule to	apply to this particular
	   attribute.

	   Checks for:

	     - a named attribute rule in a named tag
	     - a default * attribute rule in a named tag
	     - a named attribute rule in the default * rules
	     - a default * attribute rule in the default * rules

       _hss_join_attribs (FILTERED_ATTRIBS)
	   Accepts a hash ref containing the attribute names as	the keys, and
	   the attribute values	as the values.	Escapes	them and returns a
	   string ready	for output to HTML

       _hss_decode_numeric ( NUMERIC )
	   Returns the string that should replace the numeric entity NUMERIC
	   in the text_to_canonical_form() method.

       _hss_tag_is_banned ( TAGNAME )
	   Returns true	if the lower case tag name TAGNAME is on the list of
	   harmless tags that the filter is configured to block, false
	   otherwise.

       _hss_get_to_valid_context ( TAG )
	   Tries to get	the filter to a	context	in which the tag TAG is
	   allowed, by introducing extra end tags or start tags	if necessary.
	   TAG can be either the lower case name of a tag or the string
	   'CDATA'.

	   Returns 1 if	an allowed context is reached, or 0 if there's no
	   reasonable way to get to an allowed context and the tag should just
	   be rejected.

       _hss_close_innermost_tag	()
	   Closes the innermost	open tag.

       _hss_context ()
	   Returns the current named context of	the filter.

       _hss_valid_in_context ( TAG, CONTEXT )
	   Returns true	if the lowercase tag name TAG is valid in context
	   CONTEXT, false otherwise.

       _hss_valid_in_current_context ( TAG )
	   Returns true	if the lowercase tag name TAG is valid in the filter's
	   current context, false otherwise.

BUGS AND LIMITATIONS
       Performance
	   This	module does a lot of work to ensure that tags are correctly
	   nested and are not left open, causing unnecessary overhead for
	   applications	where that doesn't matter.

	   Such	applications may benefit from using the	more lightweight
	   HTML::Scrubber::StripScripts	module instead.

       Strictness
	   URIs	and email addresses are	cleaned	up to be safe, but not
	   necessarily accurate.  That would have required adding
	   dependencies.  Attribute callbacks can be used to add this
	   functionality if required, or the validation	methods	can be
	   overridden.

	   By default, filtered	HTML may not be	valid strict XHTML, for
	   instance empty required attributes may be outputted.	 However, with
	   "Rules", it should be possible to force the HTML to validate.

       REPORTING BUGS
	   Please report any bugs or feature requests to
	   bug-html-stripscripts@rt.cpan.org, or through the web interface at
	   <http://rt.cpan.org>.

SEE ALSO
       HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex

AUTHOR
       Original	author Nick Cleaton <nick@cleaton.net>

       New code	added and module maintained by Clinton Gormley
       <clint@traveljury.com>

COPYRIGHT
       Copyright (C) 2003 Nick Cleaton.	 All Rights Reserved.

       Copyright (C) 2007 Clinton Gormley.  All	Rights Reserved.

LICENSE
       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.1			  2016-05-12		 HTML::StripScripts(3)

NAME | SYNOPSIS | DESCRIPTION | CONSTRUCTORS | METHODS | SUBCLASSING | OUTPUT METHODS | REJECT METHODS | WHITELIST INITIALIZATION METHODS | CHARACTER DATA PROCESSING | OTHER METHODS TO OVERRIDE | ATTRIBUTE VALUE HANDLER SUBS | CANONICAL FORM | PRIVATE METHODS | BUGS AND LIMITATIONS | SEE ALSO | AUTHOR | COPYRIGHT | LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=HTML::StripScripts&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help