Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
LaTeX::TOM(3)	      User Contributed Perl Documentation	 LaTeX::TOM(3)

       LaTeX::TOM - A module for parsing, analyzing, and manipulating LaTeX

	use LaTeX::TOM;

	$parser	= LaTeX::TOM->new;

	$document = $parser->parseFile('mypaper.tex');

	$latex = $document->toLaTeX;

	$specialnodes =	$document->getNodesByCondition(sub {
	    my $node = shift;
	    return (
	      $node->getNodeType eq 'TEXT'
		&& $node->getNodeText =~ /magic	string/

	$sections = $document->getNodesByCondition(sub {
	    my $node = shift;
	    return (
	      $node->getNodeType eq 'COMMAND'
		&& $node->getCommandName =~ /section$/

	$indexme = $document->getIndexableText;


       This module provides a parser which parses and interprets (though not
       fully) LaTeX documents and returns a tree-based representation of what
       it finds.  This tree is a "LaTeX::TOM::Tree".  The tree contains
       "LaTeX::TOM::Node" nodes.

       This module should be especially	useful to anyone who wants to do
       processing of LaTeX documents that requires extraction of plain-text
       information, or altering	of the plain-text components (or
       alternatively, the math-text components).

       The parser recognizes 3 parameters upon creation.  The parameters, in
       order, are

       parse error handling (= 0 || 1 || 2)
	   Determines what happens when	a parse	error is encountered.  0
	   results in a	warning.  1 results in a die.  2 results in silence.
	   Note	that particular	groupings in LaTeX (i.e. newcommands and the
	   like) contain invalid TeX or	LaTeX, so you nearly always need this
	   parameter to	be 0 or	2 to completely	parse the document.

       read inputs flag	(= 0 ||	1)
	   This	flag determines	whether	a scan for "\input" and	"\input-like"
	   commands is performed, and the resulting called files parsed	and
	   added to the	parent parse tree.  0 means no,	1 means	do it.	Note
	   that	this will happen recursively if	it is turned on.  Also,
	   bibliographies (.bbl	files) are detected and	included.

       apply mappings flag (= 0	|| 1)
	   This	flag determines	whether	(most) user-defined mappings are
	   applied.  This means	"\defs", "\newcommands", and
	   "\newenvironments".	This is	critical for properly analyzing	the
	   content of the document, as this must be phrased in terms of	the
	   semantics of	the original TeX and LaTeX commands, not ad hoc	user
	   macros.  So,	for instance, do not expect plain-text extraction to
	   work	properly with this option off.

       The parser returns a "LaTeX::TOM::Tree" ($document in the SYNOPSIS).

       Nodes may be of the following types:

	   "TEXT" nodes	can be thought of as representing the plain-text
	   portions of the LaTeX document.  This includes math and anything
	   else	that is	not a recognized TeX or	LaTeX command, or user-defined
	   command.  In	reality, "TEXT"	nodes contain commands that this
	   parser does not yet recognize the semantics of.

	   A "COMMAND" node represents a TeX command.  It always has child
	   nodes in a tree, though the tree might be empty if the command
	   operates on zero parameters.	An example of a	command	is


	   This	would parse into a "COMMAND" node for "textbf",	which would
	   have	a subtree containing the "TEXT"	node with text ``blah.''

	   Similarly, TeX environments parse into "ENVIRONMENT"	nodes, which
	   have	metadata about the environment,	along with a subtree
	   representing	what is	contained in the environment.  For example,

	      r	= \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}

	   Would parse into an "ENVIRONMENT" node of the class ``equation''
	   with	a child	tree containing	the result of parsing ``r = \frac{-b
	   \pm \sqrt{b^2 - 4ac}}{2a}.''

	   A "GROUP" is	like an	anonymous "COMMAND".  Since you	can put
	   whatever you	want in	curly-braces ("{}") in TeX in order to make
	   semantically	isolated regions, this separation is preserved by the
	   parser.  A "GROUP" is just the subtree of the parsed	contents of
	   plain curly-braces.

	   It is important to note that	currently only the first "GROUP" in a
	   series of "GROUP"s following	a LaTeX	command	will actually be
	   parsed into a "COMMAND" node.  The reason is	that, for the initial
	   purposes of this module, it was not necessary to recognize
	   additional "GROUP"s as additional parameters	to the "COMMAND".
	   However, this is something that this	module really should do
	   eventually.	Currently if you want all the parameters to a multi-
	   parametered command,	you'll need to pick out	all the	following
	   "GROUP" nodes yourself.

	   Eventually this will	become something like a	list which is stored
	   in the "COMMAND" node, much like XML::DOM's treatment of
	   attributes.	These are, in a	sense, apart from the rest of the
	   document tree.  Then	"GROUP"	nodes will become much more rare.

	   A "COMMENT" node is very similar to a "TEXT"	node, except it	is
	   specifically	for lines beginning with ``%'' (the TeX	comment
	   delimeter) or the right-hand	portion	of a line that has ``%'' at
	   some	internal point.

       As mentioned before, the	Tree is	the return result of a parse.

       The tree	is nothing more	than an	arrayref of Nodes, some	of which may
       contain their own trees.	 This is useful	knowledge at this point, since
       the user	isn't provided with a full suite of convenient tree-
       modification methods.  However, Trees do	already	have some very
       convenient methods, described in	the next section.


       ""  Instantiate a new parser object.

       In this section all of the methods for each of the components are
       listed and described.

       The methods for the parser (aside from the constructor, discussed
       above) are :

       parseFile (filename)

       ""  Read	in the contents	of filename and	parse them, returning a

       parse (string)

       ""  Parse the string string and return a	"LaTeX::TOM::Tree".

       This section contains methods for the Trees returned by the parser.


       ""  Duplicate a tree into new memory.


       ""  A debug print of the	structure of the tree.


       ""  Returns an arrayref which is	a list of strings representing the
	   text	of all "getNodePlainTextFlag = 1" "TEXT" nodes,	in an inorder


       ""  A method like the above but which goes one step further; it cleans
	   all of the returned text and	concatenates it	into a single string
	   which one could consider having all of the standard information
	   retrieval value for the document, making it useful for indexing.


       ""  Return a string representing	the LaTeX encoded by the tree.	This
	   is especially useful	to get a normal	document again,	after
	   modifying nodes of the tree.


       ""  Return a list of "LaTeX::TOM::Nodes"	at the top level of the	Tree.


       ""  Return an arrayref with all nodes of	the tree.  This	"flattens" the

       getCommandNodesByName (name)

       ""  Return an arrayref with all "COMMAND" nodes in the tree which have
	   a name matching name.

       getEnvironmentsByName (name)

       ""  Return an arrayref with all "ENVIRONMENT" nodes in the tree which
	   have	a class	matching name.

       getNodesByCondition (code reference)

       ""  This	is a catch-all search method which can be used to pull out
	   nodes that match pretty much	any perl expression, without manually
	   having to traverse the tree.	 code reference	is a perl code
	   reference which receives as its first argument the node of the tree
	   that	is currently scrutinized and is	expected to return a boolean
	   value. See the SYNOPSIS for examples.


       ""  Returns the first node of the tree.	This is	useful if you want to
	   walk	the tree yourself, starting with the first node.

       This section contains the methods for nodes of the parsed Trees.


       ""  Returns the type, one of "TEXT", "COMMAND", "ENVIRONMENT", "GROUP",
	   or "COMMENT", as described above.


       ""  Applicable for "TEXT" or "COMMENT" nodes; this returns the document
	   text	they contain.  This is undef for other node types.


       ""  Set the node	text, also for "TEXT" and "COMMENT" nodes.


       ""  Get the starting character position in the document of this node.
	   For "TEXT" and "COMMENT" nodes, this	will be	where the text begins.
	   For "ENVIRONMENT", "COMMAND", or "GROUP" nodes, this	will be	the
	   position of the last	character of the opening identifier.


       ""  Same	as above, but for last character.  For "GROUP",	"ENVIRONMENT",
	   or "COMMAND"	nodes, this will be the	first character	of the closing


       ""  Same	as getNodeStartingPosition, but	for "GROUP", "ENVIRONMENT", or
	   "COMMAND" nodes, this returns the first character of	the opening


       ""  Same	as getNodeEndingPosition, but for "GROUP", "ENVIRONMENT", or
	   "COMMAND" nodes, this returns the last character of the closing


       ""  This	applies	to any node type.  It is 1 if the node sets, or	is
	   contained within, a math mode region.  0 otherwise.	"TEXT" nodes
	   which have this flag	as 1 can be assumed to be the actual
	   mathematics contained in the	document.


       ""  This	applies	only to	"TEXT" nodes.  It is 1 if the node is non-math
	   and is visible (in other words, will	end up being a part of the
	   output document). One would only want to index "TEXT" nodes with
	   this	property, for information retrieval purposes.


       ""  This	applies	only to	"ENVIRONMENT" nodes.  Returns what class of
	   environment the node	represents (the	"X" in "\begin{X}" and


       ""  This	applies	only to	"COMMAND" nodes.  Returns the name of the
	   command (the	"X" in "\X{...}").


       ""  This	applies	only to	"COMMAND", "ENVIRONMENT", and "GROUP" nodes:
	   it returns the "LaTeX::TOM::Tree" which is ``under''	the calling


       ""  This	applies	only to	"COMMAND", "ENVIRONMENT", and "GROUP" nodes:
	   it returns the first	node from the first level of the child


       ""  Same	as above, but for the last node	of the first level.


       ""  Return the prior node on the	same level of the tree.


       ""  Same	as above, but for following node.


       ""  Get the parent node of this node in the tree.


       ""  This	is an interesting function, and	kind of	a hack because of the
	   way the parser makes	the current tree.  Basically it	will give you
	   the next sibling that is a "GROUP" node, until it either hits the
	   end of the tree level, a "TEXT" node	which doesn't match "/^\s*$/",
	   or a	"COMMAND" node.

	   This	is useful for finding all "GROUP"ed parameters after a
	   "COMMAND" node (see comments	for "GROUP" in the "COMPONENTS"	/
	   "LaTeX::TOM::Node" section).	 You can just have a while loop	that
	   calls this method until it gets "undef", and	you'll know you've
	   found all the parameters to a command.

	   Note: this may be bad, but "TEXT" Nodes matching
	   "/^\s*\[[0-9]+\]$/" (optional parameter groups) are treated as if
	   they	were 'blank'.

       Due to the lack of tree-modification methods, currently this module is
       mostly useful for minor modifications to	the parsed document, for
       instance, altering the text of "TEXT" nodes but not deleting the	nodes.
       Of course, the user can still do	this by	breaking abstraction and
       directly	modifying the Tree.

       Also note that the parsing is not complete.  This module	was not
       written with the	intention of being able	to produce output documents
       the way ``latex'' does.	The intent was instead to be able to analyze
       and modify the document on a logical level with regards to the content;
       it doesn't care about the document formatting and outputting side of

       There is	much work still	to be done.  See the TODO list in the

       Probably	plenty.	 However, this module has performed fairly well	on a
       set of ~1000 research publications from the Computing Research
       Repository, so I	deemed it ``good enough'' to use for purposes similar
       to mine.

       Please let the maintainer know of parser	errors if you discover any.

       Thanks to (in order of appearance) who have contributed valuable
       suggestions and patches:

	Otakar Smrz
	Moritz Lenz
	James Bowlin
	Jesse S. Bangs

       Written by Aaron	Krowne <>

       Maintained by Steven Schubiger <>

       This program is free software; you may redistribute it and/or modify it
       under the same terms as Perl itself.

       See <>

perl v5.32.1			  2011-12-23			 LaTeX::TOM(3)


Want to link to this manual page? Use this URL:

home | help