Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
HTML::DOM(3)	      User Contributed Perl Documentation	  HTML::DOM(3)

       HTML::DOM - A Perl implementation of the	HTML Document Object Model

       Version 0.058 (alpha)

       WARNING:	This module is still at	an experimental	stage.	The API	is
       subject to change without notice.

	 use HTML::DOM;

	 my $dom_tree =	new HTML::DOM; # empty tree

	 my $other_dom_tree = new HTML::DOM;


	 print $dom_tree->innerHTML, "\n";

	 my $text = $dom_tree->createTextNode('text');
	 $text->data;		   # get attribute
	 $text->data('new value'); # set attribute

       This module implements the HTML Document	Object Model by	extending the
       HTML::Tree modules.  The	HTML::DOM class	serves both as an HTML parser
       and as the document class.

       The following DOM modules are currently supported:

	 Feature	 Version (aka level)
	 -------	 -------------------
	 HTML		 2.0
	 Core		 2.0
	 Events		 2.0
	 UIEvents	 2.0
	 MouseEvents	 2.0
	 MutationEvents	 2.0
	 HTMLEvents	 2.0
	 StyleSheets	 2.0
	 CSS		 2.0 (partially)
	 CSS2		 2.0
	 Views		 2.0

       StyleSheets, CSS	and CSS2 are actually provided by CSS::DOM.  This list
       corresponds to CSS::DOM versions	0.02 to	0.14.

   Construction	and Parsing
       $tree = new HTML::DOM %options;
	   This	class method constructs	and returns a new HTML::DOM object.
	   The %options, which are all optional, are as	follows:

	   url The value that the "URL"	method will return.  This value	is
	       also used by the	"domain" method.

	       The value that the "referrer" method will return

	       An HTTP::Response object.  This will be used for	information
	       needed for writing cookies.  It is expected to have a reference
	       to a request object (accessible via its "request" method--see
	       HTTP::Response).	 Passing a parameter to	the 'cookie' method
	       will be a no-op without this.

	       If this is passed a true	value, then the	HTML::DOM object will
	       hold a weak reference to	the response.

	       An HTTP::Cookies	object.	 As with "response", if	you omit this,
	       arguments passed	to the "cookie"	method will be ignored.

	       The original character set of the document.  This does not
	       affect parsing via the "write" method (which always assumes
	       Unicode).  "parse_file" will use	this, if specified, or
	       HTML::Encoding otherwise.  HTML::DOM::Form's "make_request"
	       method uses this	to encode form data unless the form has	a
	       valid 'accept-charset' attribute.

	   If "referrer" and "url" are omitted,	they can be inferred from

       $tree->elem_handler($elem_name => sub { ... })
	   If you call this method first, then,	when the DOM tree is in	the
	   process of being built (as a	result of a call to "write" or
	   "parse_file"), the subroutine will be called	after each $elem_name
	   element is added to the tree.  If you give '*' as the element name,
	   the subroutine will be called for each element that does not	have a
	   handler.  The subroutine's two arguments will be the	tree itself
	   and the element in question.	 The subroutine	can call the DOM
	   object's "write" method to insert HTML code into the	source after
	   the element.

	   Here	is a lame example (which does not take Content-Script-Type
	   headers or security into account):

	     $tree->elem_handler(script	=> sub {
		 my($document,$elem) = @_;
		 return	unless $elem->attr('type') eq 'application/x-perl';

		 '<p>The time is
		      <script type="application/x-perl">
			   $document->write(scalar localtime)

	     print $tree->documentElement->as_text, "\n";

	   (Note: HTML::DOM::Element's "content_offset"	method might come in
	   handy for reporting line numbers for	script errors.)

       css_url_fetcher(	\&sub )
	   With	this method you	can provide a subroutine that fetches URLs
	   referenced by 'link'	tags.  Its sole	argument is the	URL, which is
	   made	absolute based on the HTML page's own base URL (it is assumed
	   that	this is	absolute).  It should return "undef" or	an empty list
	   on failure.	Upon success, it should	return just the	CSS code, if
	   it has been decoded (and is in Unicode), or,	if it has not been
	   decoded, the	CSS code followed by "decode =>	1".  See "STYLE	SHEET
	   ENCODING" in	CSS::DOM for details on	when you should	or should not
	   decode it.  (Note that HTML::DOM automatically provides an encoding
	   hint	based on the HTML document.)

	   HTML::DOM passes the	result of the url fetcher to CSS::DOM and
	   turns it into a style sheet object accessible via the link
	   element's "sheet" method.

       $tree->write(...) (DOM method)
	   This	parses the HTML	code passed to it, adding it to	the end	of the
	   document. It	assumes	that its input is a normal Perl	Unicode
	   string.  Like HTML::TreeBuilder's "parse" method, it	can take a

	   When	it is called from an an	element	handler	(see "elem_handler",
	   above), the value passed to it will be inserted into	the HTML code
	   after the current element when the element handler returns.	(In
	   this	case a coderef won't do--maybe that will be added later.)

	   If the "close" method has been called, "write" will call "open"
	   before parsing the HTML code	passed to it.

       $tree->writeln(...) (DOM	method)
	   Just	like "write" except that it appends "\n" to its	argument and
	   does	not work with code refs.  (Rather pointless, if	you ask	me.

       $tree->close() (DOM method)
	   Call	this method to signal to the parser that the end of the	HTML
	   code	has been reached.  It will then	parse any residual HTML	that
	   happens to be buffered.  It also makes the next "write" call

       $tree->open (DOM	method)
	   Deletes the HTML tree, resetting it so that it has just an <html>
	   element, and	a parser hungry	for HTML code.

	   This	method takes a file name or handle and parses the content,
	   (effectively) calling "close" afterwards.  In the former case (a
	   file	name), HTML::Encoding will be used to detect the encoding.  In
	   the latter (a file handle), you'll have to "binmode"	it yourself.
	   This	could be considered a bug.  If you have	a solution to this
	   (how	to make	HTML::Encoding detect an encoding from a file handle),
	   please let me know.

	   As of version 0.12, this method returns true	upon success, or
	   undef/empty list on failure.

	   This	method returns the name	of the character set that was passed
	   to "new", or, if that was not given,	that which "parse_file"	used.

	   It returns undef if "new" was not given a charset and if
	   "parse_file"	was not	used or	was passed a file handle.

	   You can also	set the	charset	by passing an argument,	in which case
	   the old value is returned.

   Other DOM Methods
	   Returns nothing

	   Returns the HTML::DOM::Implementation object.

	   Returns the <html> element.

       createElement ( $tag )
       createTextNode (	$text )
       createComment ( $text )
       createAttribute ( $name )
	   Each	of these creates a node	of the appropriate type.

	   These two throw an exception.

       getElementsByTagName ( $name )
	   $name can be	the name of the	tag, or	'*', to	match all tag names.
	   This	returns	a node list object in scalar context, or a list	in
	   list	context.

       importNode ( $node, $deep )
	   Clones the $node, setting its "ownerDocument" attribute to the
	   document with which this method is called.  If $deep	is true, the
	   $node will be cloned	recursively.

	   These six methods return (optionally	set) the corresponding
	   attributes of the body element.  Note that most of the names	do not
	   map directly	to the names of	the attributes.	 "fgColor" refers to
	   the "text" attribute.  Those	that end with 'linkColor' refer	to the
	   attributes of the same name but without the 'Color' on the end.

	   Returns (or optionally sets)	the title of the page.

	   Returns the page's referrer.

	   Returns the domain name portion of the document's URL.

       URL Returns the document's URL.

	   Returns the body element, or	the outermost frame set	if the
	   document has	frames.	 You can set the body by passing an element as
	   an argument,	in which case the old body element is returned.

	   These five methods each return a list of the	appropriate elements
	   in list context, or an HTML::DOM::Collection	object in scalar
	   context.  In	this latter case, the object will update automatically
	   when	the document is	modified.

	   In the case of "forms" you can access those by using	the HTML::DOM
	   object itself as a hash.  I.e., you can write "$doc->{f}" instead
	   of "$doc->forms->{f}".

	   This	returns	a string containing the	document's cookies (the	format
	   may still change).  If you pass an argument,	it will	set a cookie
	   as well.  Both Netscape-style and RFC2965-style cookie headers are

	   These three do what their names imply.  The last two	will return a
	   list	in list	context, or a node list	object in scalar context.
	   Calling them	in list	context	is probably more efficient.

       createEvent ( $category )
	   Creates a new event object, believe it or not.

	   The $category is the	DOM event category, which determines what type
	   of event object will	be returned. The currently supported event
	   categories are MouseEvents, UIEvents, HTMLEvents and

	   You can omit	the $category to create	an instance of the event base
	   class (not officially part of the DOM).

	   Returns the HTML::DOM::View object associated with the document.

	   There is no such object by default; you have	to put one there

	   Although it is supposed to be read-only according to	the DOM, you
	   can set this	attribute by passing an	argument to it.	 It is still
	   marked as read-only in %HTML::DOM::Interface.

	   If you do set it, it	is recommended that the	object be a subclass
	   of HTML::DOM::View.

	   This	attribute holds	a weak reference to the	object.

	   Returns a CSS::DOM::StyleSheetList of the document's	style sheets,
	   or a	simple list in list context.

	   Serialises and returns the HTML document.  If you pass an argument,
	   it will set the contents of the document via	"open",	"write"	and
	   "close", returning a	serialisation of the old contents.

       set_location_object (non-DOM)
	   "location" returns the location object, if you've put one there
	   with	"set_location_object". HTML::DOM doesn't actually implement
	   such	an object itself, but provides the appropriate magic to	make
	   "$doc->location($foo)" translate into "$doc->location->href($foo)".

	   BTW,	the location object had	better be true when used as a boolean,
	   or HTML::DOM	will think it doesn't exist.

	   This	method returns the document's modification date	as gleaned
	   from	the response object passed to the constructor, in MM/DD/YYYY
	   HH:MM:SS format.

	   If there is no modification date, an	empty string is	returned, but
	   this	may change in the future.

   Other (Non-DOM) Methods
       (See also "EVENT	HANDLING", below.)

	   Returns the base URL	of the page; either from a <base href=...>
	   tag,	from the response object passed	to "new", or the URL passed to

	   This	is mainly for internal use.  It	returns	a boolean indicating
	   whether the parser needed to	associate formies with a form that did
	   not contain them.  This happens when	a closing </form> tag is
	   missing and the form	is closed implicitly, but a formie is
	   encountered later.

       You can use an HTML::DOM	object as a hash ref to	access it's form
       elements	by name.  So "$doc->{yayaya}" is short for

       HTML::DOM supports both the DOM Level 2 event model and the HTML	4
       event model.

       Throughout this documentation, we make use of HTML 5's distinction
       between handlers	and listeners: An event	handler	is the result of an
       HTML element beginning with 'on', e.g. onsubmit.	 These are also
       accessible via the DOM.	(We also use the word 'handler'	in other
       contexts, such as the 'default event handler'.)	Event listeners	are
       registered solely with the "addEventListener" method and	can be removed
       with "removeEventListener".

       HTML::DOM accepts as an event handler a coderef,	an object with a
       "call_with" method, or an object	with "&{}" overloading.	 If the
       "call_with" method is present, it is called with	the current event
       target as the first argument and	the event object as the	second.	 This
       is to allow for objects that wrap JavaScript functions (which must be
       called with the event target as the this	value).

       An event	listener is a coderef, an object with a	"handleEvent" method
       or an object with "&{}" overloading.  HTML::DOM does not	implement any
       classes that provide a "handleEvent" method, but	will support any
       object that has one.

       Listeners and handlers differ in	one important aspect.  A listener has
       to call "preventDefault"	on the event object to cancel the default
       action.	A handler simply returns a defined false value (except for
       mouseover events, which must return a true value	to cancel the

   Default Actions
       Default actions that HTML::DOM is capable of handling internally	(such
       as triggering a DOMActivate event when an element is clicked, and
       triggering a form's submit event	when the submit	button is activated)
       are dealt with automatically.  You don't	have to	worry about those.
       For others, read	on....

       To specify the default actions associated with an event,	provide	a
       subroutine (in this case, it not	being part of the DOM, you can't use
       an object with a	"handleEvent" method) via the
       "default_event_handler_for" and "default_event_handler" methods.

       With the	former,	you can	specify	the default action to be taken when a
       particular type of event	occurs.	 The currently supported types are:

	 submit		when a form is submitted
	 link		called when a link is activated	(DOMActivate event)

       Pass the	type of	event as the first argument and	a code ref as the
       second argument.	 When the code ref is called, its sole argument	will
       be the event object.  For instance:

	 $dom_tree->default_event_handler_for( link => sub {
		my $event = shift;
		go_to( $event->target->href );
	 sub go_to { ... }

       "default_event_handler_for" with	just one argument returns the
       currently assigned coderef.  With two arguments it returns the old one
       after assigning the new one.

       Use "default_event_handler" (without the	"_for")	to specify a fallback
       subroutine that will be used for	events not in the list above, and for
       events in the list above	that do	not have subroutines assigned to them.
       Without any arguments it	will return the	currently assigned coderef.
       With an argument	it will	return the old one after assigning the new

   Dispatching Events
       HTML::DOM::Node's "dispatchEvent" method	triggers the appropriate event
       listeners, but does not call any	default	actions	associated with	it.
       The return value	is a boolean that indicates whether the	default	action
       should be taken.

       H:D:Node's "trigger_event" method will trigger the event	for real. It
       will call "dispatchEvent" and, provided it returns true,	will call the
       default event handler.

   HTML	Event Attributes
       The "event_attr_handler"	can be used to assign a	coderef	that will turn
       text assigned to	an event attribute (e.g., "onclick") into an event
       handler.	The arguments to the routine will be (0) the element, (1) the
       name (aka type) of the event (without the initial 'on'),	(2) the	value
       of the attribute	and (3)	the offset within the source of	the
       attribute's value. (Actually, if	the value is within quotes, it is the
       offset of the first quotation mark.  Also, it will be "undef" for
       generated HTML [source code passed to the "write" method	by an element
       handler].)  As with "default_event_handler", you	can replace an
       existing	handler	with a new one,	in which case the old handler is
       returned. If you	call this method without arguments, it returns the
       current handler.	Here is	an example of its use, that assumes that
       handlers	are Perl code:

	 $dom_tree->event_attr_handler(sub {
		 my($elem, $name, $code, $offset) = @_;
		 my $sub = eval	"sub { $code }";
		 return	sub {
			 local *_ = \$elem;

       The event attribute handler will	be called whenever an element
       attribute whose name begins with	'on' (case-tolerant) is	modified. (For
       efficiency's sake, I may	change it to call the event attribute handler
       only when the event is triggered, so it is not called unnecessarily.)

   When	an Event Handler Dies
       Use "error_handler" to assign a coderef that will be called whenever an
       event listener (or handler) raises an error. The	error will be
       contained in $@.

   Other Event-Related Methods
       $tree->event_parent( $new_val )
	   This	method lets you	provide	an object that is added	to the top of
	   the event dispatch chain. E.g., if you want the view	object (the
	   value of "defaultView", aka the window) to have event handlers
	   called before the document in the capture phase, and	after it in
	   the bubbling	phase, you can set it like this	(see also
	   "defaultView", above):

	     $tree->event_parent( $tree->defaultView );

	   This	holds a	weak reference.

       $tree->event_listeners_enabled( $new_val	)
	   This	attribute, which is true by default, can be used to disable
	   event handlers and listeners. (Default event	handlers [see above]
	   still run, though.)

       Here are	the inheritance	hierarchy of HTML::DOM's various classes and
       the DOM interfaces those	classes	implement. The classes in the left
       column all begin	with 'HTML::DOM::', which is omitted for brevity,
       except for HTML::DOM itself, which is listed with its full name.	Items
       in brackets have	not yet	been implemented. (See also
       HTML::DOM::Interface for	a machine-readable list	of standard methods.)

	 Class Inheritance Hierarchy		 Interfaces
	 ---------------------------		 ----------

	 Exception				 DOMException, EventException
	 Implementation				 DOMImplementation,
	 Node					 Node, EventTarget
	     DocumentFragment			 DocumentFragment
	     HTML::DOM				 Document, HTMLDocument,
						   DocumentEvent, DocumentView,
						   DocumentStyle, [DocumentCSS]
	     CharacterData			 CharacterData
		 Text				 Text
		 Comment			 Comment
	     Element				 Element, HTMLElement,
		 Element::HTML			 HTMLHtmlElement
		 Element::Head			 HTMLHeadElement
		 Element::Link			 HTMLLinkElement, LinkStyle
		 Element::Title			 HTMLTitleElement
		 Element::Meta			 HTMLMetaElement
		 Element::Base			 HTMLBaseElement
		 Element::IsIndex		 HTMLIsIndexElement
		 Element::Style			 HTMLStyleElement, LinkStyle
		 Element::Body			 HTMLBodyElement
		 Element::Form			 HTMLFormElement
		 Element::Select		 HTMLSelectElement
		 Element::OptGroup		 HTMLOptGroupElement
		 Element::Option		 HTMLOptionElement
		 Element::Input			 HTMLInputElement
		 Element::TextArea		 HTMLTextAreaElement
		 Element::Button		 HTMLButtonElement
		 Element::Label			 HTMLLabelElement
		 Element::FieldSet		 HTMLFieldSetElement
		 Element::Legend		 HTMLLegendElement
		 Element::UL			 HTMLUListElement
		 Element::OL			 HTMLOListElement
		 Element::DL			 HTMLDListElement
		 Element::Dir			 HTMLDirectoryElement
		 Element::Menu			 HTMLMenuElement
		 Element::LI			 HTMLLIElement
		 Element::Div			 HTMLDivElement
		 Element::P			 HTMLParagraphElement
		 Element::Heading		 HTMLHeadingElement
		 Element::Quote			 HTMLQuoteElement
		 Element::Pre			 HTMLPreElement
		 Element::Br			 HTMLBRElement
		 Element::BaseFont		 HTMLBaseFontElement
		 Element::Font			 HTMLFontElement
		 Element::HR			 HTMLHRElement
		 Element::Mod			 HTMLModElement
		 Element::A			 HTMLAnchorElement
		 Element::Img			 HTMLImageElement
		 Element::Object		 HTMLObjectElement
		 Element::Param			 HTMLParamElement
		 Element::Applet		 HTMLAppletElement
		 Element::Map			 HTMLMapElement
		 Element::Area			 HTMLAreaElement
		 Element::Script		 HTMLScriptElement
		 Element::Table			 HTMLTableElement
		 Element::Caption		 HTMLTableCaptionElement
		 Element::TableColumn		 HTMLTableColElement
		 Element::TableSection		 HTMLTableSectionElement
		 Element::TR			 HTMLTableRowElement
		 Element::TableCell		 HTMLTableCellElement
		 Element::FrameSet		 HTMLFrameSetElement
		 Element::Frame			 HTMLFrameElement
		 Element::IFrame		 HTMLIFrameElement
	 NodeList				 NodeList
	 NodeList::Magic			 NodeList
	 NamedNodeMap				 NamedNodeMap
	 Attr					 Node, Attr, EventTarget
	 Collection				 HTMLCollection
	 Event					 Event
	     Event::UI				 UIEvent
		 Event::Mouse			 MouseEvent
	     Event::Mutation			 MutationEvent
	 View					 AbstractView, ViewCSS

       The EventListener interface is not implemented by HTML::DOM, but	is
       supported.  See "EVENT HANDLING", above.

       Not listed above	is HTML::DOM::EventTarget, which is a base class both
       for HTML::DOM::Node and HTML::DOM::Attr.	The format I'm using above
       doesn't allow for multiple inheritance, so I probably need to redo it.

       HTML::DOM::Node also implements the HTML::Element interface, but	with a
       few differences.	In particular:

       o   Any methods that expect text	nodes to be just strings are
	   unreliable. See the note under "objectify_text" in HTML::Element.

       o   HTML::Element's tree-manipulation methods don't trigger mutation

       o   HTML::Element's "delete" method is not necessary, because HTML::DOM
	   uses	weak references	(for 'upward' references in the	object tree).

       o   Objects' attributes are accessed via	methods	of the same name. When
	   the method is invoked, the current value is returned. If an
	   argument is supplied, the attribute is set (unless it is read-only)
	   and its old value returned.

       o   Where the DOM spec. says to use null, undef or an empty list	is

       o   Instead of UTF-16 strings, HTML::DOM	uses Perl's Unicode strings
	   (which happen to be stored as UTF-8 internally). The	only
	   significant difference this makes is	to "length", "substringData"
	   and other methods of	Text and Comment nodes.	These methods behave
	   in a	Perlish	way (i.e., the offsets and lengths are specified in
	   Unicode characters, not in UTF-16 bytes). The alternate methods
	   "length16", "substringData16" et al.	use UTF-16 for offsets and are
	   standards-compliant in that regard (but the string returned by
	   "substringData16" is	still a	regular	Perl string).

       o   Each	method that returns a NodeList will return a NodeList object
	   in scalar context, or a simple list in list context.	You can	use
	   the object as an array ref in addition to calling its "item"	and
	   "length" methods.

       o   In cases where a method is supposed to return something
	   implementing	the DOMTimeStamp interface, a simple Perl scalar is
	   returned, containing	the time as returned by	Perlas built-in	"time"

       Much of the code	was stolen from	HTML::Tree.  In	fact, HTML::DOM	used
       to extend HTML::Tree, but the two were merged to	allow a	whole pile of
       hacks to	be removed.

       perl 5.8.3 or later

       Exporter	5.57 or	later

       LWP 5.13	or later

       CSS::DOM	0.06 or	later

       Scalar::Util 1.14 or later

       HTML::Tagset 3.02 or later

       HTML::Parser 3.46 or later

       HTML::Encoding is required if a file name is passed to "parse_file".

       Tie::RefHash::Weak 0.08 or higher, if you are using perl	5.8.x

       -   Element handlers are	not currently called during assignments	to

       -   HTML::DOM::View's "getComputedStyle"	does not currently return a
	   read-only style object; nor are lengths converted to	absolute
	   values.  Currently there is no way to specify the medium. Any style
	   rules that apply to specific	media are ignored.

       To report bugs, please e-mail the author.

       Copyright (C) 2007-16 Father Chrysostomos

	 $text = new HTML::DOM ->createTextNode('sprout');
	 print $text->data, "\n";

       This program is free software; you may redistribute it and/or modify it
       under the same terms as perl.

       Each of the classes listed above	"CLASSES AND DOM INTERFACES"

       HTML::DOM::Exception, HTML::DOM::Node, HTML::DOM::Event,

       HTML::Tree, HTML::TreeBuilder, HTML::Element, HTML::Parser, LWP,
       WWW::Mechanize, HTTP::Cookies, WWW::Mechanize::Plugin::JavaScript,
       HTML::Form, HTML::Encoding

       The DOM Level 1 specification at	<>

       The DOM Level 2 Core specification at

       The DOM Level 2 Events specification at


       Hey! The	above document had some	coding errors, which are explained

       Around line 1405:
	   Non-ASCII character seen before =encoding in	'Iave'.	Assuming UTF-8

perl v5.32.0			  2018-02-02			  HTML::DOM(3)


Want to link to this manual page? Use this URL:

home | help