Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
SGML::Parser::OpenSP(3User Contributed Perl DocumentatiSGML::Parser::OpenSP(3)

NAME
       SGML::Parser::OpenSP - Parse SGML documents using OpenSP

SYNOPSIS
	 use SGML::Parser::OpenSP;

	 my $p = SGML::Parser::OpenSP->new;
	 my $h = ExampleHandler->new;

	 $p->catalogs(qw(xhtml.soc));
	 $p->warnings(qw(xml valid));
	 $p->handler($h);

	 $p->parse("example.xhtml");

DESCRIPTION
       This module provides an interface to the	OpenSP SGML parser. OpenSP and
       this module are event based. As the parser recognizes parts of the
       document	(say the start or end of an element), then any handlers
       registered for that type	of an event are	called with suitable
       parameters.

COMMON METHODS
       new()
	   Returns a new SGML::Parser::OpenSP object. Takes no arguments.

       parse($file)
	   Parses the file passed as an	argument. Note that this must be a
	   filename and	not a filehandle. See "PROCESSING FILES" below for
	   details.

       parse_string($data)
	   Parses the data passed as an	argument. See "PROCESSING FILES" below
	   for details.

       halt()
	   Halts processing before parsing the entire document.	Takes no
	   arguments.

       split_message()
	   Splits OpenSP's error messages into their component parts.  See
	   "POST-PROCESSING ERROR MESSAGES" below for details.

       get_location()
	   See "POSITIONING INFORMATION" below for details.

CONFIGURATION
   BOOLEAN OPTIONS
       $p->handler([$handler])
	   Report events to the	blessed	reference $handler.

   ERROR MESSAGE FORMAT
       $p->show_open_entities([$bool])
	   Describe open entities in error messages. Error messages always
	   include the position	of the most recently opened external entity.
	   The default is false.

       $p->show_open_elements([$bool])
	   Show	the generic identifiers	of open	elements in error messages.
	   The default is false.

       $p->show_error_numbers([$bool])
	   Show	message	numbers	in error messages.

   GENERATED EVENTS
       $p->output_comment_decls([$bool])
	   Generate "comment_decl" events. The default is false.

       $p->output_marked_sections([$bool])
	   Generate marked section events ("marked_section_start",
	   "marked_section_end", "ignored_chars"). The default is false.

       $p->output_general_entities([$bool])
	   Generate "general_entity" events. The default is false.

   IO SETTINGS
       $p->map_catalog_document([$bool])
	   "parse" arguments specify catalog files rather than the document
	   entity.  The	document entity	is specified by	the first DOCUMENT
	   entry in the	catalog	files. The default is false.

       $p->restrict_file_reading([$bool])
	   Restrict file reading to the	specified directories (see the
	   "search_dirs" method	and the	"SGML_SEARCH_PATH" environment
	   variable). You should turn this option on and configure the search
	   paths accordingly if	you intend to process untrusted	resources. The
	   default is false.

       $p->catalogs([@catalogs])
	   Map public identifiers and entity names to system identifiers using
	   the specified catalog entry files. Multiple catalogs	are allowed.
	   If there is a catalog entry file called "catalog" in	the same place
	   as the document entity, it will be searched for immediately after
	   those specified.

       $p->search_dirs([@search_dirs])
	   Search the specified	directories for	files specified	in system
	   identifiers.	 Multiple values options are allowed. See the
	   description of the osfile storage manager in	the OpenSP
	   documentation for more information about file searching.

       $p->pass_file_descriptor([$bool])
	   Instruct "parse_string" to pass the input data down to the guts of
	   OpenSP using	the "OSFD" storage manager (if true) or	the "OSFILE"
	   storage manager (if false). This amounts to the difference between
	   passing a file descriptor and a (temporary) file name.

	   The default is true except on platforms, such as Win32, which are
	   known to not	support	passing	file descriptors around	in this
	   manner. On platforms	which support it you can call this method with
	   a false parameter to	force use of temporary file names instead.

	   In general, this will do the	right thing on its own so it's best to
	   consider this an internal method. If	your platform is such that you
	   have	to force use of	the OSFILE storage manager, please report it
	   as a	bug and	include	the values of $^O, $Config{archname}, and a
	   description of the platform (e.g. "Windows Vista Service Pack 42").

   PROCESSING OPTIONS
       $p->include_params([@include_params])
	   For each name in @include_params pretend that

	     <!ENTITY %	name "INCLUDE">

	   occurs at the start of the document type declaration	subset in the
	   SGML	document entity. Since repeated	definitions of an entity are
	   ignored, this definition will take precedence over any other
	   definitions of this entity in the document type declaration.
	   Multiple names are allowed.	If the SGML declaration	replaces the
	   reserved name INCLUDE then the new reserved name will be the
	   replacement text of the entity. Typically the document type
	   declaration will contain

	     <!ENTITY %	name "IGNORE">

	   and will use	%name; in the status keyword specification of a	marked
	   section declaration.	In this	case the effect	of the option will be
	   to cause the	marked section not to be ignored.

       $p->active_links([@active_links])
	   ???

   ENABLING WARNINGS
       Additional warnings can be enabled using

	 $p->warnings([@warnings])

       The following values can	be used	to enable warnings:

       xml Warn	about constructs that are not allowed by XML.

       mixed
	   Warn	about mixed content models that	do not allow #pcdata anywhere.

       sgmldecl
	   Warn	about various dubious constructions in the SGML	declaration.

       should
	   Warn	about various recommendations made in ISO 8879 that the
	   document does not comply with. (Recommendations are expressed with
	   ``should'', as distinct from	requirements which are usually
	   expressed with ``shall''.)

       default
	   Warn	about defaulted	references.

       duplicate
	   Warn	about duplicate	entity declarations.

       undefined
	   Warn	about undefined	elements: elements used	in the DTD but not
	   defined.

       unclosed
	   Warn	about unclosed start and end-tags.

       empty
	   Warn	about empty start and end-tags.

       net Warn	about net-enabling start-tags and null end-tags.

       min-tag
	   Warn	about minimized	start and end-tags. Equivalent to combination
	   of unclosed,	empty and net warnings.

       unused-map
	   Warn	about unused short reference maps: maps	that are declared with
	   a short reference mapping declaration but never used	in a short
	   reference use declaration in	the DTD.

       unused-param
	   Warn	about parameter	entities that are defined but not used in a
	   DTD.	 Unused	internal parameter entities whose text is "INCLUDE" or
	   "IGNORE" won't get the warning.

       notation-sysid
	   Warn	about notations	for which no system identifier could be
	   generated.

       all Warn	about conditions that should usually be	avoided	(in the
	   opinion of the author). Equivalent to: "mixed", "should",
	   "default", "undefined", "sgmldecl", "unused-map", "unused-param",
	   "empty" and "unclosed".

   DISABLING WARNINGS
       A warning can be	disabled by using its name prefixed with "no-".	 Thus
       calling warnings(qw(all no-duplicate)) will enable all warnings except
       those about duplicate entity declarations.

       The following values for	"warnings()" disable errors:

       no-idref
	   Do not give an error	for an ID reference value which	no element has
	   as its ID. The effect will be as if each attribute declared as an
	   ID reference	value had been declared	as a name.

       no-significant
	   Do not give an error	when a character that is not a significant
	   character in	the reference concrete syntax occurs in	a literal in
	   the SGML declaration. This may be useful in conjunction with
	   certain buggy test suites.

       no-valid
	   Do not require the document to be type-valid. This has the effect
	   of changing the SGML	declaration to specify "VALIDITY NOASSERT" and
	   "IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has	the
	   effect of changing the SGML declaration to specify "VALIDITY	TYPE"
	   and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
	   "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF"
	   specified in	the SGML declaration will be used.

   XML WARNINGS
       The following warnings are turned on for	the "xml" warning described
       above:

       inclusion
	   Warn	about inclusions in element type declarations.

       exclusion
	   Warn	about exclusions in element type declarations.

       rcdata-content
	   Warn	about RCDATA declared content in element type declarations.

       cdata-content
	   Warn	about CDATA declared content in	element	type declarations.

       ps-comment
	   Warn	about comments in parameter separators.

       attlist-group-decl
	   Warn	about name groups in attribute declarations.

       element-group-decl
	   Warn	about name groups in element type declarations.

       pi-entity
	   Warn	about PI entities.

       internal-sdata-entity
	   Warn	about internal SDATA entities.

       internal-cdata-entity
	   Warn	about internal CDATA entities.

       external-sdata-entity
	   Warn	about external SDATA entities.

       external-cdata-entity
	   Warn	about external CDATA entities.

       bracket-entity
	   Warn	about bracketed	text entities.

       data-atts
	   Warn	about attribute	definition list	declarations for notations.

       missing-system-id
	   Warn	about external identifiers without system identifiers.

       conref
	   Warn	about content reference	attributes.

       current
	   Warn	about current attributes.

       nutoken-decl-value
	   Warn	about attributes with a	declared value of NUTOKEN or NUTOKENS.

       number-decl-value
	   Warn	about attributes with a	declared value of NUMBER or NUMBERS.

       name-decl-value
	   Warn	about attributes with a	declared value of NAME or NAMES.

       named-char-ref
	   Warn	about named character references.

       refc
	   Warn	about ommitted refc delimiters.

       temp-ms
	   Warn	about TEMP marked sections.

       rcdata-ms
	   Warn	about RCDATA marked sections.

       instance-include-ms
	   Warn	about INCLUDE marked sections in the document instance.

       instance-ignore-ms
	   Warn	about IGNORE marked sections in	the document instance.

       and-group
	   Warn	about AND connectors in	model groups.

       rank
	   Warn	about ranked elements.

       empty-comment-decl
	   Warn	about empty comment declarations.

       att-value-not-literal
	   Warn	about attribute	values which are not literals.

       missing-att-name
	   Warn	about ommitted attribute names in start	tags.

       comment-decl-s
	   Warn	about spaces before the	MDC in comment declarations.

       comment-decl-multiple
	   Warn	about comment declarations containing multiple comments.

       missing-status-keyword
	   Warn	about marked sections without a	status keyword.

       multiple-status-keyword
	   Warn	about marked sections with multiple status keywords.

       instance-param-entity
	   Warn	about parameter	entities in the	document instance.

       min-param
	   Warn	about minimization parameters in element type declarations.

       mixed-content-xml
	   Warn	about cases of mixed content which are not allowed in XML.

       name-group-not-or
	   Warn	about name groups with a connector different from OR.

       pi-missing-name
	   Warn	about processing instructions which don't start	with a name.

       instance-status-keyword-s
	   Warn	about spaces between DSO and status keyword in marked
	   sections.

       external-data-entity-ref
	   Warn	about references to external data entities in the content.

       att-value-external-entity-ref
	   Warn	about references to external data entities in attribute
	   values.

       data-delim
	   Warn	about occurances of `<'	and `&'	as data.

       explicit-sgml-decl
	   Warn	about an explicit SGML declaration.

       internal-subset-ms
	   Warn	about marked sections in the internal subset.

       default-entity
	   Warn	about a	default	entity declaration.

       non-sgml-char-ref
	   Warn	about numeric character	references to non-SGML characters.

       internal-subset-ps-param-entity
	   Warn	about parameter	entity references in parameter separators in
	   the internal	subset.

       internal-subset-ts-param-entity
	   Warn	about parameter	entity references in token separators in the
	   internal subset.

       internal-subset-literal-param-entity
	   Warn	about parameter	entity references in parameter literals	in the
	   internal subset.

PROCESSING FILES
       In order	to start processing of a document and recieve events, the
       "parse" method must be called. It takes one argument specifying the
       path to a file (not a file handle). You must set	an event handler using
       the "handler" method prior to using this	method.	The return value of
       "parse" is currently undefined.

EVENT HANDLERS
       In order	to receive data	from the parser	you need to write an event
       handler.	For example,

	 package ExampleHandler;

	 sub new { bless {}, shift }

	 sub start_element
	 {
	     my	($self,	$elem) = @_;
	     printf "  * %s\n",	$elem->{Name};
	 }

       This handler would print	all the	element	names as they are found	in the
       document, for a typical XHTML document this might result	in something
       like

	 * html
	 * head
	 * title
	 * body
	 * p
	 * ...

       The events closely match	those in the generic interface to OpenSP, see
       <http://openjade.sf.net/doc/generic.htm>	for more information.

       The event names have been changed to lowercase and underscores to
       separate	words and properties are capitalized. Arrays are represented
       as Perl array references. "Position" information	is not passed to the
       handler but made	available through the "get_location" method which can
       be called from event handlers. Some redundant information has also been
       stripped	and the	generic	identifier of an element is stored in the
       "Name" hash entry.

       For example, for	an EndElementEvent the "end_element" handler gets
       called with a hash reference

	 {
	   Name	=> 'gi'
	 }

       The following events are	defined:

	 * appinfo
	 * processing_instruction
	 * start_element
	 * end_element
	 * data
	 * sdata
	 * external_data_entity_ref
	 * subdoc_entity_ref
	 * start_dtd
	 * end_dtd
	 * end_prolog
	 * general_entity	# set $p->output_general_entities(1)
	 * comment_decl		# set $p->output_comment_decls(1)
	 * marked_section_start	# set $p->output_marked_sections(1)
	 * marked_section_end	# set $p->output_marked_sections(1)
	 * ignored_chars	# set $p->output_marked_sections(1)
	 * error
	 * open_entity_change

       If the documentation of the generic interface to	OpenSP states that
       certain data is not valid, it will not be available through this
       interface (i.e.,	the respective key does	not exist in the hash ref).

POSITIONING INFORMATION
       Event handlers can call the "get_location" method on the	parser object
       to retrieve positioning information, the	get_location method will
       return a	hash reference with the	following properties:

	 LineNumber   => ..., #	line number
	 ColumnNumber => ..., #	column number
	 ByteOffset   => ..., #	number of preceding bytes
	 EntityOffset => ..., #	number of preceding bit	combinations
	 EntityName   => ..., #	name of	the external entity
	 FileName     => ..., #	name of	the file

       These can be "undef" or an empty	string.

POST-PROCESSING	ERROR MESSAGES
       OpenSP returns error messages in	form of	a string rather	than
       individual components of	the message like line numbers or message text.
       The "split_message" method on the parser	object can be used to post-
       process these error message strings as reliable as possible. It can be
       used e.g.  from an error	event handler if the parser object is
       accessible like

	 sub error
	 {
	   my $self = shift;
	   my $erro = shift;
	   my $mess = $self->{parser}->split_message($erro);
	 }

       See the documentation of	"split_message"	in the
       SGML::Parser::OpenSP::Tools documentation.

UNICODE	SUPPORT
       All strings returned from event handlers	and helper routines are	UTF-8
       encoded with the	UTF-8 flag turned on, helper functions like
       "split_message" expect (but don't check)	that string arguments are
       UTF-8 encoded and have the UTF-8	flag turned on.	Behavior of helper
       functions is undefined when you pass unexpected input and should	be
       avoided.

       "parse" has limited support for binary input, but the binary input must
       be compatible with OpenSP's generic interface requirements and you must
       specify the encoding through means available to OpenSP to enable	it to
       properly	decode the binary input. Any encoding meta data	about such
       binary input specific to	Perl (such as encoding disciplines for file
       handles when you	pass a file descriptor)	will be	ignored. For more
       specific	information refer to the OpenSP	manual.

       o   <http://openjade.sourceforge.net/doc/sysid.htm>

       o   <http://openjade.sourceforge.net/doc/charset.htm>

ENVIRONMENT VARIABLES
       OpenSP supports a number	of environment variables to control specific
       processing aspects such as "SGML_SEARCH_PATH" or	"SP_CHARSET_FIXED".
       Portable	applications need to ensure that these are set prior to
       loading the OpenSP library into memory which happens when the XS	code
       is loaded. This means you need to wrap the code into a "BEGIN" block:

	 BEGIN { $ENV{SP_CHARSET_FIXED}	= 1; }
	 use SGML::Parser::OpenSP;
	 # ...

       Otherwise changes to the	environment might not propagate	to OpenSP.
       This applies specifically to Win32 systems.

       SGML_SEARCH_PATH
	   See <http://openjade.sourceforge.net/doc/sysid.htm>.

       SP_HTTP_USER_AGENT
	   The "User-Agent" header for HTTP requests.

       SP_HTTP_ACCEPT
	   The "Accept"	header for HTTP	requests.

       SP_MESSAGE_FORMAT
	   Enable run time selection of	message	format,	Value is one of	"XML",
	   "NONE", "TRADITIONAL". Whether this will have an effect depends on
	   a compile time setting which	might not be enabled in	your OpenSP
	   build. This module assumes that no such support was compiled	in.

       SGML_CATALOG_FILES
       SP_USE_DOCUMENT_CATALOG
	   See <http://openjade.sourceforge.net/doc/catalog.htm>.

       SP_SYSTEM_CHARSET
       SP_CHARSET_FIXED
       SP_BCTF
       SP_ENCODING
	   See <http://openjade.sourceforge.net/doc/charset.htm>.

       Note that you can use the "search_dirs" method instead of using
       "SGML_SEARCH_PATH" and the "catalogs" method instead of using
       "SGML_CATALOG_FILES" and	attributes on storage object specifications
       for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
       "SP_CHARSET_FIXED" is set to 1 you can use

	 $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");

       to process "example.xhtml" using	the "UTF-8" character encoding.

KNOWN ISSUES
       OpenSP must be compiled with "SP_MULTI_BYTE" defined and	with
       "SP_WIDE_SYSTEM"	undefined, this	module will otherwise break at runtime
       or not compile.

BUG REPORTS
       Please report bugs in this module via
       <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>

       Please report bugs in OpenSP via
       <http://sf.net/tracker/?group_id=2115&atid=102115>

       Please send comments and	questions to the spo-devel mailing list, see
       <http://lists.sf.net/lists/listinfo/spo-devel> for details.

SEE ALSO
       o   <http://openjade.sf.net/doc/generic.htm>

       o   <http://openjade.sf.net/>

       o   <http://sf.net/projects/spo/>

AUTHORS
	 Terje Bless <link@cpan.org> wrote version 0.01.
	 Bjoern	Hoehrmann <bjoern@hoehrmann.de>	wrote version 0.02+.

COPYRIGHT AND LICENSE
	 Copyright (c) 2006-2008 Bjoern	Hoehrmann <bjoern@hoehrmann.de>.
	 This module is	licensed under the same	terms as Perl itself.

perl v5.24.1			  2017-07-02	       SGML::Parser::OpenSP(3)

NAME | SYNOPSIS | DESCRIPTION | COMMON METHODS | CONFIGURATION | PROCESSING FILES | EVENT HANDLERS | POSITIONING INFORMATION | POST-PROCESSING ERROR MESSAGES | UNICODE SUPPORT | ENVIRONMENT VARIABLES | KNOWN ISSUES | BUG REPORTS | SEE ALSO | AUTHORS | COPYRIGHT AND LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=SGML::Parser::OpenSP&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help