Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Regexp::Grammars(3)   User Contributed Perl Documentation  Regexp::Grammars(3)

       Regexp::Grammars	- Add grammatical parsing features to Perl 5.10

       This document describes Regexp::Grammars	version	1.026

	   use Regexp::Grammars;

	   my $parser =	qr{
		   <Verb>		# Parse	and save a Verb	in a scalar
		   <.ws>		# Parse	but don't save whitespace
		   <Noun>		# Parse	and save a Noun	in a scalar

		   <type=(?{ rand > 0.5	? 'VN' : 'VerbNoun' })>
					# Save result of expression in a scalar
		       <[Noun]>		# Parse	a Noun and save	result in a list
					    (saved under the key 'Noun')
		       <[PostNoun=ws]>	# Parse	whitespace, save it in a list
					#   (saved under the key 'PostNoun')

		   <Verb>		# Parse	a Verb and save	result in a scalar
					    (saved under the key 'Verb')

		   <type=(?{ 'VN' })>	# Save a literal in a scalar
		   <debug: match>	# Turn on the integrated debugger here
		   <.Cmd= (?: mv? )>	# Parse	but don't capture a subpattern
					    (name it 'Cmd' for debugging purposes)
		   <[File]>+		# Parse	1+ Files and save them in a list
					    (saved under the key 'File')
		   <debug: off>		# Turn off the integrated debugger here
		   <Dest=File>		# Parse	a File and save	it in a	scalar
					    (saved under the key 'Dest')


	       <token: File>		  # Define a subrule named File
		   <.ws>		  #  - Parse but don't capture whitespace
		   <MATCH= ([\w-]+) >	  #  - Parse the subpattern and	capture
					  #    matched text as the result of the
					  #    subrule

	       <token: Noun>		  # Define a subrule named Noun
		   cat | dog | fish	  #  - Match an	alternative (as	usual)

	       <rule: Verb>		  # Define a whitespace-sensitive subrule
		   eats			  #  - Match a literal (after any space)
		   <Object=Noun>?	  #  - Parse optional subrule Noun and
					  #    save result under the key 'Object'
	       |			  #  Or	else...
		   <AUX>		  #  - Parse subrule AUX and save result
		   <part= (eaten|seen) >  #  - Match a literal,	save under 'part'

	       <token: AUX>		  # Define a whitespace-insensitive subrule
		   (has	| is)		  #  - Match an	alternative and	capture
		   (?{ $MATCH =	uc $^N }) #  - Use captured text as subrule result


	   # Match the grammar against some text...
	   if ($text =~	$parser) {
	       # If successful,	the hash %/ will have the hierarchy of results...
	       process_data_in(	%/ );

   In your program...
	   use Regexp::Grammars;    Allow enhanced regexes in lexical scope
	   %/			    Result-hash	for successful grammar match

   Defining and	using named grammars...
	   <grammar:  GRAMMARNAME>  Define a named grammar that	can be inherited
	   <extends:  GRAMMARNAME>  Current grammar inherits named grammar's rules

   Defining rules in your grammar...
	   <rule:     RULENAME>	    Define rule	with magic whitespace
	   <token:    RULENAME>	    Define rule	without	magic whitespace

	   <objrule:  CLASS= NAME>  Define rule	that blesses return-hash into class
	   <objtoken: CLASS= NAME>  Define token that blesses return-hash into class

	   <objrule:  CLASS>	    Shortcut for above (rule name derived from class)
	   <objtoken: CLASS>	    Shortcut for above (token name derived from	class)

   Matching rules in your grammar...
	   <RULENAME>		    Call named subrule (may be fully qualified)
				    save result	to $MATCH{RULENAME}

	   <RULENAME(...)>	    Call named subrule,	passing	args to	it

	   <!RULENAME>		    Call subrule and fail if it	matches
	   <!RULENAME(...)>	    (shorthand for (?!<.RULENAME>) )

	   <:IDENT>		    Match contents of $ARG{IDENT} as a pattern
	   <\:IDENT>		    Match contents of $ARG{IDENT} as a literal
	   </:IDENT>		    Match closing delimiter for	$ARG{IDENT}

	   <%HASH>		    Match longest possible key of hash
	   <%HASH {PAT}>	    Match any key of hash that also matches PAT

	   </IDENT>		    Match closing delimiter for	$MATCH{IDENT}
	   <\_IDENT>		    Match the literal contents of $MATCH{IDENT}

	   <ALIAS= RULENAME>	    Call subrule, save result in $MATCH{ALIAS}
	   <ALIAS= %HASH>	    Match a hash key, save key in $MATCH{ALIAS}
	   <ALIAS= ( PATTERN )>	    Match pattern, save	match in $MATCH{ALIAS}
	   <ALIAS= (?{ CODE })>	    Execute code, save value in	$MATCH{ALIAS}
	   <ALIAS= 'STR' >	    Save specified string in $MATCH{ALIAS}
	   <ALIAS= 42 >		    Save specified number in $MATCH{ALIAS}
	   <ALIAS= /IDENT>	    Match closing delim, save as $MATCH{ALIAS}
	   <ALIAS= \_IDENT>	    Match '$MATCH{IDENT}', save	as $MATCH{ALIAS}

	   <.SUBRULE>		    Call subrule (one of the above forms),
				    but	don't save the result in %MATCH

	   <[SUBRULE]>		    Call subrule (one of the above forms), but
				    append result instead of overwriting it

	   <SUBRULE1>+ % <SUBRULE2> Match one or more repetitions of SUBRULE1
				    as long as they're separated by SUBRULE2
	   <SUBRULE1> ** <SUBRULE2> Same (only for backwards compatibility)

	   <SUBRULE1>* % <SUBRULE2> Match zero or more repetitions of SUBRULE1
				    as long as they're separated by SUBRULE2

   In your grammar's code blocks...
	   $CAPTURE    Alias for $^N (the most recent paren capture)
	   $CONTEXT    Another alias for $^N
	   $INDEX      Current index of	next matching position in string
	   %MATCH      Current rule's result-hash
	   $MATCH      Magic override value (returned instead of result-hash)
	   %ARG	       Current rule's argument hash
	   $DEBUG      Current match-time debugging mode

	   <require: (?{ CODE })   >  Fail if code evaluates false
	   <timeout: INT	   >  Fail if matching takes too long
	   <debug:   COMMAND	   >  Change match-time	debugging mode
	   <logfile: LOGFILE	   >  Change debugging log file	(default: STDERR)
	   <fatal:   TEXT|(?{CODE})>  Queue error message and fail parse
	   <error:   TEXT|(?{CODE})>  Queue error message and backtrack
	   <warning: TEXT|(?{CODE})>  Queue warning message and	continue
	   <log:     TEXT|(?{CODE})>  Explicitly add a message to debugging log
	   <ws:	     PATTERN	   >  Override automatic whitespace matching
	   <minimize:>		      Simplify the result of a subrule match
	   <context:>		      Switch on	context	substring retention
	   <nocontext:>		      Switch off context substring retention

       This module adds	a small	number of new regex constructs that can	be
       used within Perl	5.10 patterns to implement complete recursive-descent

       Perl 5.10 already supports recursive=descent matching, via the new
       "(?<name>...)" and "(?&name)" constructs. For example, here is a	simple
       matcher for a subset of the LaTeX markup	language:

	   $matcher = qr{

		   (?<File>	(?&Element)* )

		   (?<Element>	\s* (?&Command)
			     |	\s* (?&Literal)

		   (?<Command>	\\ \s* (?&Literal) \s* (?&Options)? \s*	(?&Args)? )

		   (?<Options>	\[ \s* (?:(?&Option) (?:\s*,\s*	(?&Option) )*)?	\s* \])

		   (?<Args>	\{ \s* (?&Element)* \s*	\}  )

		   (?<Option>	\s* [^][\$&%#_{}~^\s,]+	    )

		   (?<Literal>	\s* [^][\$&%#_{}~^\s]+	    )

       This technique makes it possible	to use regexes to recognize complex,
       hierarchical--and even recursive--textual structures. The problem is
       that Perl 5.10 doesn't provide any support for extracting that
       hierarchical data into nested data structures. In other words, using
       Perl 5.10 you can match complex data, but not parse it into an
       internally useful form.

       An additional problem when using	Perl 5.10 regexes to match complex
       data formats is that you	have to	make sure you remember to insert
       whitespace-matching constructs (such as "\s*") at every possible
       position	where the data might contain ignorable whitespace. This
       reduces the readability of such patterns, and increases the chance of
       errors (typically caused	by overlooking a location where	whitespace
       might appear).

       The Regexp::Grammars module solves both those problems.

       If you import the module	into a particular lexical scope, it
       preprocesses any	regex in that scope, so	as to implement	a number of
       extensions to the standard Perl 5.10 regex syntax. These	extensions
       simplify	the task of defining and calling subrules within a grammar,
       and allow those subrule calls to	capture	and retain the components of
       they match in a proper hierarchical manner.

       For example, the	above LaTeX matcher could be converted to a full LaTeX
       parser (and considerably	tidied up at the same time), like so:

	   use Regexp::Grammars;
	   $parser = qr{

	       <rule: File>	  <[Element]>*

	       <rule: Element>	  <Command> | <Literal>

	       <rule: Command>	  \\  <Literal>	 <Options>?  <Args>?

	       <rule: Options>	  \[  <[Option]>+ % (,)	 \]

	       <rule: Args>	  \{  <[Element]>*  \}

	       <rule: Option>	  [^][\$&%#_{}~^\s,]+

	       <rule: Literal>	  [^][\$&%#_{}~^\s]+

       Note that there is no need to explicitly	place "\s*" subpatterns
       throughout the rules; that is taken care	of automatically.

       If the Regexp::Grammars version of this regex were successfully matched
       against some appropriate	LaTeX document,	each rule would	call the
       subrules	specified within it, and then return a hash containing
       whatever	result each of those subrules returned,	with each result
       indexed by the subrule's	name.

       That is,	if the rule named "Command" were invoked, it would first try
       to match	a backslash, then it would call	the three subrules
       "<Literal>", "<Options>", and "<Args>" (in that sequence). If they all
       matched successfully, the "Command" rule	would then return a hash with
       three keys: 'Literal', 'Options', and 'Args'. The value for each	of
       those hash entries would	be whatever result-hash	the subrules
       themselves had returned when matched.

       In this way, each level of the hierarchical regex can generate hashes
       recording everything its	own subrules matched, so when the entire
       pattern matches,	it produces a tree of nested hashes that represent the
       structured data the pattern matched.

       For example, if the previous regex grammar were matched against a
       string containing:

	   \author{D. Conway}

       it would	automatically extract a	data structure equivalent to the
       following (but with several extra "empty" keys, which are described in
       "Subrule	results"):

	       'file' => {
		   'element' =>	[
			   'command' =>	{
			       'literal' => 'documentclass',
			       'options' => {
				   'option'  =>	[ 'a4paper', '11pt' ],
			       'args'	 => {
				   'element' =>	[ 'article' ],
			   'command' =>	{
			       'literal' => 'author',
			       'args' => {
				   'element' =>	[
					   'literal' =>	'D.',
					   'literal' =>	'Conway',

       The data	structure that Regexp::Grammars	produces from a	regex match is
       available to the	surrounding program in the magic variable "%/".

       Regexp::Grammars	provides many features that simplify the extraction of
       hierarchical data via a regex match, and	also some features that	can
       simplify	the processing of that data once it has	been extracted.	The
       following sections explain each of those	features, and some of the
       parsing techniques they support.

   Setting up the module
       Just add:

	   use Regexp::Grammars;

       to any lexical scope. Any regexes within	that scope will	automatically
       now implement the new parsing constructs:

	   use Regexp::Grammars;

	   my $parser =	qr/ regex with $extra <chocolatey> grammar bits	/x;

       Note that you will need to use the "/x" modifier	when declaring a regex
       grammar.	Otherwise, the default "a whitespace character matches exactly
       that whitespace character" behaviour of Perl regexes will mess up your
       grammar's parsing.

       Once the	grammar	has been processed, you	can then match text against
       the extended regexes, in	the usual manner (i.e. via a "=~" match):

	   if ($input_text =~ $parser) {

       After a successful match, the variable "%/" will	contain	a series of
       nested hashes representing the structured hierarchical data captured
       during the parse.

   Structure of	a Regexp::Grammars grammar
       A Regexp::Grammars specification	consists of a start-pattern (which may
       include both standard Perl 5.10 regex syntax, as	well as	special
       Regexp::Grammars	directives), followed by one or	more rule or token

       For example:

	   use Regexp::Grammars;
	   my $balanced_brackets = qr{

	       # Start-pattern...
	       <paren_pair> | <brace_pair>

	       # Rule definition...
	       <rule: paren_pair>
		   \(  (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*  \)

	       # Rule definition...
	       <rule: brace_pair>
		   \{  (?: <escape> | <paren_pair> | <brace_pair> | [^{}] )*  \}

	       # Token definition...
	       <token: escape>
		   \\ .

       The start-pattern at the	beginning of the grammar acts like the "top"
       token of	the grammar, and must be matched completely for	the grammar to

       This pattern is treated like a token for	whitespace matching behaviour
       (see "Tokens vs rules (whitespace handling)").  That is,	whitespace in
       the start-pattern is treated like whitespace in any normal Perl regex.

       The rules and tokens are	declarations only and they are not directly
       matched.	 Instead, they act like	subroutines, and are invoked by	name
       from the	initial	pattern	(or from within	a rule or token).

       Each rule or token extends from the directive that introduces it	up to
       either the next rule or token directive,	or (in the case	of the final
       rule or token) to the end of the	grammar.

   Tokens vs rules (whitespace handling)
       The difference between a	token and a rule is that a token treats	any
       whitespace within it exactly as a normal	Perl regular expression	would.
       That is,	a sequence of whitespace in a token is ignored if the "/x"
       modifier	is in effect, or else matches the same literal sequence	of
       whitespace characters (if "/x" is not in	effect).

       In a rule, any sequence of whitespace (except those at the very start
       and the very end	of the rule) is	treated	as matching the	implicit
       subrule "<.ws>",	which is automatically predefined to match optional
       whitespace (i.e.	"\s*").

       You can explicitly define a "<ws>" token	to change that default
       behaviour. For example, you could alter the definition of "whitespace"
       to include Perlish comments, by adding an explicit "<token: ws>":

	   <token: ws>
	       (?: \s+ | #[^\n]* )*

       But be careful not to define "<ws>" as a	rule, as this will lead	to all
       kinds of	infinitely recursive unpleasantness.

       Per-rule	whitespace handling

       Redefining the "<ws>" token changes its behaviour throughout the	entire
       grammar,	within every rule definition. Usually that's appropriate, but
       sometimes you need finer-grained	control	over whitespace	handling.

       So Regexp::Grammars provides the	"<ws:>"	directive, which allows	you to
       override	the implicit whitespace-matches-whitespace behaviour only
       within the current rule.

       Note that this directive	does not redefined "<ws>" within the rule; it
       simply specifies	what to	replace	each whitespace	sequence with (instead
       of replacign each with a	"<ws>" call).

       For example, if a language allows one kind of comment between
       statements and another within statements, you could parse it with:

	   <rule: program>
	       # One type of comment between...
	       <ws: (\s++ | \# .*? \n)*	>

	       # ...colon-separated statements...
	       <[statement]>+ %	( ; )

	   <rule: statement>
	       # Another type of comment...
	       <ws: (\s*+ | \#{	.*? }\#	)* >

	       # ...between comma-separated commands...
	       <cmd>  <[arg]>+ % ( , )

       Note that each directive	only applies to	the rule in which it is
       specified. In every other rule in the grammar, whitespace would still
       match the usual "<ws>" subrule.

   Calling subrules
       To invoke a rule	to match at any	point, just enclose the	rule's name in
       angle brackets (like in Perl 6).	There must be no space between the
       opening bracket and the rulename. For example::

	       file:		 # Match literal sequence 'f' 'i' 'l' 'e' ':'
	       <name>		 # Call	<rule: name>
	       <options>?	 # Call	<rule: options>	(it's okay if it fails)

	       <rule: name>
		   # etc.

       If you need to match a literal pattern that would otherwise look	like a
       subrule call, just backslash-escape the leading angle:

	       file:		 # Match literal sequence 'f' 'i' 'l' 'e' ':'
	       \<name>		 # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
	       <options>?	 # Call	<rule: options>	(it's okay if it fails)

	       <rule: name>
		   # etc.

   Subrule results
       If a subrule call successfully matches, the result of that match	is a
       reference to a hash. That hash reference	is stored in the current
       rule's own result-hash, under the name of the subrule that was invoked.
       The hash	will, in turn, contain the results of any more deeply nested
       subrule calls, each stored under	the name by which the nested subrule
       was invoked.

       In other	words, if the rule "sentence" is defined:

	   <rule: sentence>
	       <noun> <verb> <object>

       then successfully calling the rule:


       causes a	new hash entry at the current nesting level. That entry's key
       will be 'sentence' and its value	will be	a reference to a hash, which
       in turn will have keys: 'noun', 'verb', and 'object'.

       In addition each	result-hash has	one extra key: the empty string. The
       value for this key is whatever substring	the entire subrule call
       matched.	 This value is known as	the context substring.

       So, for example,	a successful call to "<sentence>" might	add something
       like the	following to the current result-hash:

	   sentence => {
	       ""     => 'I saw	a dog',
	       noun   => 'I',
	       verb   => 'saw',
	       object => {
		   ""	   => 'a dog',
		   article => 'a',
		   noun	   => 'dog',

       Note, however, that if the result-hash at any level contains only the
       empty-string key	(i.e. the subrule did not call any sub-subrules	or
       save any	of their nested	result-hashes),	then the hash is "unpacked"
       and just	the context substring itself is	returned.

       For example, if "<rule: sentence>" had been defined:

	   <rule: sentence>
	       I see dead people

       then a successful call to the rule would	only add:

	   sentence => 'I see dead people'

       to the current result-hash.

       This is a useful	feature	because	it prevents a series of	nested subrule
       calls from producing very unwieldy data structures. For example,
       without this automatic unpacking, even the simple earlier example:

	   <rule: sentence>
	       <noun> <verb> <object>

       would produce something needlessly complex, such	as:

	   sentence => {
	       ""     => 'I saw	a dog',
	       noun   => {
		   "" => 'I',
	       verb   => {
		   "" => 'saw',
	       object => {
		   ""	   => 'a dog',
		   article => {
		       "" => 'a',
		   noun	   => {
		       "" => 'dog',

       Turning off the context substring

       The context substring is	convenient for debugging and for generating
       error messages but, in a	large grammar, or when parsing a long string,
       the capture and storage of many nested substrings may quickly become
       prohibitively expensive.

       So Regexp::Grammars provides a directive	to prevent context substrings
       from being retained. Any	rule or	token that includes the	directive
       "<nocontext:>" anywhere in the rule's body will not retain any context
       substring it matches...unless that substring would be the only entry in
       its result hash (which only happens within objrules and objtokens).

       If a "<nocontext:>" directive appears before the	first rule or token
       definition (i.e.	as part	of the main pattern), then the entire grammar
       will discard all	context	substrings from	every one of its rules and

       However,	you can	override this universal	prohibition with a second
       directive: "<context:>".	If this	directive appears in any rule or
       token, that rule	or token will save its context substring, even if a
       global "<nocontext:>" is	in effect.

       This means that this grammar:


	       <rule: Command>
		   <Keyword> <arg=(\S+)>+ % <.ws>

	       <token: Keyword>
		   <Move> | <Copy> | <Delete>

	       # etc.

       and this	grammar:


	       <rule: Command>
		   <Keyword> <arg=(\S+)>+ % <.ws>

	       <token: Keyword>
		   <Move> | <Copy> | <Delete>

	       # etc.

       will behave identically (saving context substrings for keywords,	but
       not for commands), except that the first	version	will also retain the
       global context substring	(i.e. $/{""}), whereas the second version will

       Note that "<context:>" and "<nocontext:>" have no effect	on, or even
       any interaction with, the various result	distillation mechanisms, which
       continue	to work	in the usual way when either or	both of	the directives
       is used.

   Renaming subrule results
       It is not always	convenient to have subrule results stored under	the
       same name as the	rule itself. Rule names	should be optimized for
       understanding the behaviour of the parser, whereas result names should
       be optimized for	understanding the structure of the data. Often those
       two goals are identical,	but not	always;	sometimes rule names need to
       describe	what the data looks like, while	result names need to describe
       what the	data means.

       For example, sometimes you need to call the same	rule twice, to match
       two syntactically identical components whose positions give then
       semantically distinct meanings:

	   <rule: copy_cmd>
	       copy <file> <file>

       The problem here	is that, if the	second call to "<file>"	succeeds, its
       result-hash will	be stored under	the key	'file',	clobbering the data
       that was	returned from the first	call to	"<file>".

       To avoid	such problems, Regexp::Grammars	allows you to alias any
       subrule call, so	that it	is still invoked by the	original name, but its
       result-hash is stored under a different key. The	syntax for that	is:
       "<alias=rulename_". For example:

	   <rule: copy_cmd>
	       copy <from=file>	<to=file>

       Here, "<rule: file>" is called twice, with the first result-hash	being
       stored under the	key 'from', and	the second result-hash being stored
       under the key 'to'.

       Note, however, that the alias before the	"=" must be a proper
       identifier (i.e.	a letter or underscore,	followed by letters, digits,
       and/or underscores). Aliases that start with an underscore and aliases
       named "MATCH" have special meaning (see "Private	subrule	calls" and
       "Result distillation" respectively).

       Aliases can also	be useful for normalizing data that may	appear in
       different formats and sequences.	For example:

	   <rule: copy_cmd>
	       copy <from=file>	       <to=file>
	     | dup    <to=file>	 as  <from=file>
	     |	    <from=file>	 ->    <to=file>
	     |	      <to=file>	 <-  <from=file>

       Here, regardless	of which order the old and new files are specified,
       the result-hash always gets:

	   copy_cmd => {
	       from => 'oldfile',
		 to => 'newfile',

   List-like subrule calls
       If a subrule call is quantified with a repetition specifier:

	   <rule: file_sequence>

       then each repeated match	overwrites the corresponding entry in the
       surrounding rule's result-hash, so only the result of the final
       repetition will be retained. That is, if	the above example matched the
       string " baz.php", then the	result-hash would contain:

	   file_sequence {
	       ""   => ' baz.php',
	       file => 'baz.php',

       Usually,	that's not the desired outcome,	so Regexp::Grammars provides
       another mechanism by which to call a subrule; one that saves all
       repetitions of its results.

       A regular subrule call consists of the rule's name surrounded by	angle
       brackets. If, instead, you surround the rule's name with	"<[...]>"
       (angle and square brackets) like	so:

	   <rule: file_sequence>

       then the	rule is	invoked	in exactly the same way, but the result	of
       that submatch is	pushed onto an array nested inside the appropriate
       result-hash entry. In other words, if the above example matched the
       same " baz.php" string, the	result-hash would contain:

	   file_sequence {
	       ""   => ' baz.php',
	       file => [ '', '', 'baz.php' ],

       This "listifying	subrule	call" can also be useful for non-repeated
       subrule calls, if the same subrule is invoked in	several	places in a
       grammar.	For example if a cmdline option	could be given either one or
       two values, you might parse it:

	   <rule: size_option>
	       -size <[size]> (?: x <[size]> )?

       The result-hash entry for 'size'	would then always contain an array,
       with either one or two elements,	depending on the input being parsed.

       Listifying subrules can also be given aliases, just like	ordinary
       subrules. The alias is always specified inside the square brackets:

	   <rule: size_option>
	       -size <[size=pos_integer]> (?: x	<[size=pos_integer]> )?

       Here, the sizes are parsed using	the "pos_integer" rule,	but saved in
       the result-hash in an array under the key 'size'.

   Parametric subrules
       When a subrule is invoked, it can be passed a set of named arguments
       (specified as key"=>"values pairs). This	argument list is placed	in a
       normal Perl regex code block and	must appear immediately	after the
       subrule name, before the	closing	angle bracket.

       Within the subrule that has been	invoked, the arguments can be accessed
       via the special hash %ARG. For example:

	   <rule: block>
	       <end_tag(?{ tag=>$MATCH{tag} })>	 # subrule with	argument

	   <token: end_tag>
	       end_ (??{ quotemeta $ARG{tag} })

       Here the	"block"	rule first matches a "<tag>", and the corresponding
       substring is saved in $MATCH{tag}. It then matches any number of	nested
       blocks. Finally it invokes the "<end_tag>" subrule, passing it an
       argument	whose name is 'tag' and	whose value is the current value of
       $MATCH{tag} (i.e. the original opening tag).

       When it is thus invoked,	the "end_tag" token first matches 'end_', then
       interpolates the	literal	value of the 'tag' argument and	attempts to
       match it.

       Any number of named arguments can be passed when	a subrule is invoked.
       For example, we could generalize	the "end_tag" rule to allow any	prefix
       (not just 'end_'), and also to allow for	''-style	reversed tags,
       like so:

	   <rule: block>
	       <end_tag	(?{ prefix=>'end', tag=>$MATCH{tag} })>

	   <token: end_tag>
	       (??{ $ARG{prefix} // q{(?!)} })	    # ...prefix	as pattern
	       (??{ quotemeta $ARG{tag}	})	    # ...tag as	literal
	       (??{ quotemeta reverse $ARG{tag}	})  # ...reversed tag

       Note that, if you do not	need to	interpolate values (such as
       $MATCH{tag}) into a subrule's argument list, you	can use	simple
       parentheses instead of "(?{...})", like so:

	       <end_tag( prefix=>'end',	tag=>'head' )>

       The only	types of values	you can	use in this simplified syntax are
       numbers and single-quote-delimited strings.  For	anything more complex,
       put the argument	list in	a full "(?{...})".

       As the earlier examples show, the single	most common type of argument
       is one of the form: IDENTIFIER "=> $MATCH{"IDENTIFIER"}". That is, it's
       a common	requirement to pass an element of %MATCH into a	subrule, named
       with its	own key.

       Because this is such a common usage, Regexp::Grammars provides a
       shortcut. If you	use simple parentheses (instead	of "(?{...})"
       parentheses) then instead of a pair, you	can specify an argument	using
       a colon followed	by an identifier.  This	argument is replaced by	a
       named argument whose name is the	identifier and whose value is the
       corresponding item from %MATCH. So, for example,	instead	of:

	       <end_tag(?{ prefix=>'end', tag=>$MATCH{tag} })>

       you can just write:

	       <end_tag( prefix=>'end',	:tag )>

       Accessing subrule arguments more	cleanly

       As the preceding	examples illustrate, using subrule arguments
       effectively generally requires the use of run-time interpolated
       subpatterns via the "(??{...})" construct.

       This produces ugly rule bodies such as:

	   <token: end_tag>
	       (??{ $ARG{prefix} // q{(?!)} })	    # ...prefix	as pattern
	       (??{ quotemeta $ARG{tag}	})	    # ...tag as	literal
	       (??{ quotemeta reverse $ARG{tag}	})  # ...reversed tag

       To simplify these common	usages,	Regexp::Grammars provides three
       convenience constructs.

       A subrule call of the form "<:"identifier">" is equivalent to:

	   (??{	$ARG{'identifier'} // q{(?!)} })

       Namely: "Match the contents of $ARG{'identifier'}, treating those
       contents	as a pattern."

       A subrule call of the form "<\:"identifier">" (that is: a matchref with
       a colon after the backslash) is equivalent to:

	   (??{	defined	$ARG{'identifier'}
		   ? quotemeta($ARG{'identifier'})
		   : '(?!)'

       Namely: "Match the contents of $ARG{'identifier'}, treating those
       contents	as a literal."

       A subrule call of the form "</:"identifier">" (that is: an invertref
       with a colon after the forward slash) is	equivalent to:

	   (??{	defined	$ARG{'identifier'}
		   ? quotemeta(reverse $ARG{'identifier'})
		   : '(?!)'

       Namely: "Match the closing delimiter corresponding to the contents of
       $ARG{'identifier'}, as if it were a literal".

       The availability	of these three constructs mean that we could rewrite
       the above "<end_tag>" token much	more cleanly as:

	   <token: end_tag>
	       <:prefix>      #	...prefix as pattern
	       <\:tag>	      #	...tag as a literal
	       </:tag>	      #	...reversed tag

       In general these	constructs mean	that, within a subrule,	if you want to
       match an	argument passed	to that	subrule, you use "<:"ARGNAME">"	(to
       match the argument as a pattern)	or "<\:"ARGNAME">" (to match the
       argument	as a literal).

       Note the	consistent mnemonic in these various subrule-like
       interpolations of named arguments: the name is always prefixed by a

       In other	words, the "<:ARGNAME>"	form works just	like a "<RULENAME>",
       except that the leading colon tells Regexp::Grammars to use the
       contents	of $ARG{'ARGNAME'} as the subpattern, instead of the contents
       of "(?&RULENAME)"

       Likewise, the "<\:ARGNAME>" and "</:ARGNAME>" constructs	work exactly
       like "<\_MATCHNAME>" and	"</INVERTNAME>"	respectively, except that the
       leading colon indicates that the	matchref or invertref should be	taken
       from %ARG instead of from %MATCH.

       Aliases can also	be given to standard Perl subpatterns, as well as to
       code blocks within a regex. The syntax for subpatterns is:


       In other	words, the syntax is exactly like an aliased subrule call,
       except that the rule name is replaced with a set	of parentheses
       containing the subpattern. Any parentheses--capturing or
       non-capturing--will do.

       The effect of aliasing a	standard subpattern is to cause	whatever that
       subpattern matches to be	saved in the result-hash, using	the alias as
       its key.	For example:

	   <rule: file_command>

	       <cmd=(mv|cp|ln)>	 <from=file>  <to=file>

       Here, the "<cmd=(mv|cp|ln)>" is treated exactly like a regular
       "(mv|cp|ln)", but whatever substring it matches is saved	in the result-
       hash under the key 'cmd'.

       The syntax for aliasing code blocks is:

	   <ALIAS= (?{ your($code->here) }) >

       Note, however, that the code block must be specified in the standard
       Perl 5.10 regex notation: "(?{...})". A common mistake is to write:

	   <ALIAS= { your($code->here }	>

       instead,	which will attempt to interpolate $code	before the regex is
       even compiled, as such variables	are only "protected" from
       interpolation inside a "(?{...})".

       When correctly specified, this construct	executes the code in the block
       and saves the result of that execution in the result-hash, using	the
       alias as	its key. Aliased code blocks are useful	for adding semantic
       information based on which branch of a rule is executed.	For example,
       consider	the "copy_cmd" alternatives shown earlier:

	   <rule: copy_cmd>
	       copy <from=file>	       <to=file>
	     | dup    <to=file>	 as  <from=file>
	     |	    <from=file>	 ->    <to=file>
	     |	      <to=file>	 <-  <from=file>

       Using aliased code blocks, you could add	an extra field to the result-
       hash to describe	which form of the command was detected,	like so:

	   <rule: copy_cmd>
	       copy <from=file>	       <to=file>  <type=(?{ 'std' })>
	     | dup    <to=file>	 as  <from=file>  <type=(?{ 'rev' })>
	     |	    <from=file>	 ->    <to=file>  <type=(?{  +1	  })>
	     |	      <to=file>	 <-  <from=file>  <type=(?{  -1	  })>

       Now, if the rule	matched, the result-hash would contain something like:

	   copy_cmd => {
	       from => 'oldfile',
		 to => 'newfile',
	       type => 'fwd',

       Note that, in addition to the semantics described above,	aliased
       subpatterns and code blocks also	become visible to Regexp::Grammars'
       integrated debugger (see	Debugging).

   Aliased literals
       As the previous example illustrates, it is inconveniently verbose to
       assign constants	via aliased code blocks. So Regexp::Grammars provides
       a short-cut. It is possible to directly alias a numeric literal or a
       single-quote delimited literal string, without putting either inside a
       code block. For example,	the previous example could also	be written:

	   <rule: copy_cmd>
	       copy <from=file>	       <to=file>  <type='std'>
	     | dup    <to=file>	 as  <from=file>  <type='rev'>
	     |	    <from=file>	 ->    <to=file>  <type= +1  >
	     |	      <to=file>	 <-  <from=file>  <type= -1  >

       Note that only these two	forms of literal are supported in this
       abbreviated syntax.

   Amnesiac subrule calls
       By default, every subrule call saves its	result into the	result-hash,
       either under its	own name, or under an alias.

       However,	sometimes you may want to refactor some	literal	part of	a rule
       into one	or more	subrules, without having those submatches added	to the
       result-hash. The	syntax for calling a subrule, but ignoring its return
       value is:


       (which is stolen	directly from Perl 6).

       For example, you	may prefer to rewrite a	rule such as:

	   <rule: paren_pair>

		   (?: <escape>	| <paren_pair> | <brace_pair> |	[^()] )*

       without any literal matching, like so:

	   <rule: paren_pair>

		   (?: <escape>	| <paren_pair> | <brace_pair> |	<.non_paren> )*

	   <token: left_paren>	 \(
	   <token: right_paren>	 \)
	   <token: non_paren>	 [^()]

       Moreover, as the	individual components inside the parentheses probably
       aren't being captured for any useful purpose either, you	could further
       optimize	that to:

	   <rule: paren_pair>

		   (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren>	)*

       Note that you can also use the dot modifier on an aliased subpattern:

	   <.Alias= (SUBPATTERN) >

       This seemingly contradictory behaviour (of giving a subpattern a	name,
       then deliberately ignoring that name) actually does make	sense in one
       situation. Providing the	alias makes the	subpattern visible to the
       debugger, while using the dot stops it from affecting the result-hash.
       See "Debugging non-grammars" for	an example of this usage.

   Private subrule calls
       If a rule name (or an alias) begins with	an underscore:


       then matching proceeds as normal, and any result	that is	returned is
       stored in the current result-hash in the	usual way.

       However,	when any rule finishes (and just before	it returns) it first
       filters its result-hash,	removing any entries whose keys	begin with an
       underscore. This	means that any subrule with an underscored name	(or
       with an underscored alias) remembers its	result,	but only until the end
       of the current rule. Its	results	are effectively	private	to the current

       This is especially useful in conjunction	with result distillation.

   Lookahead (zero-width) subrules
       Non-capturing subrule calls can be used in normal lookaheads:

	   <rule: qualified_typename>
	       # A valid typename and has a :: in it...
	       (?= <.typename> )  [^\s:]+ :: \S+

	   <rule: identifier>
	       # An alpha followed by alnums (but not a	valid typename)...
	       (?! <.typename> )    [^\W\d]\w*

       but the syntax is a little unwieldy. More importantly, an internal
       problem with backtracking causes	positive lookaheads to mess up the
       module's	named capturing	mechanism.

       So Regexp::Grammars provides two	shorthands:

	   <!typename>	      same as: (?! <.typename> )
	   <?typename>	      same as: (?= <.typename> ) ...but	works correctly!

       These two constructs can	also be	called with arguments, if necessary:

	   <rule: Command>
		   <!Terminator(:Keyword)>  <Args=(\S+)>

       Note that, as the above equivalences imply, neither of these forms of a
       subroutine call ever captures what it matches.

   Matching separated lists
       One of the commonest tasks in text parsing is to	match a	list of
       unspecified length, in which items are separated	by a fixed token.
       Things like:

	   1, 2, 3 , 4 ,13, 91	      #	Numbers	separated by commas and	spaces

	   g-c-a-g-t-t-a-c-a	      #	DNA bases separated by dashes

	   /usr/local/bin	      #	Names separated	by directory markers

	   /usr:/usr/local:bin	      #	Directories separated by colons

       The usual construct required to parse these kinds of structures is

	   <rule: list>

	       <item> <separator> <list>     # recursive definition
	     | <item>			     # base case

       or, if you want to allow	zero-or-more items instead of requiring	one-

	   <rule: list_opt>
	       <list>?			     # entire list may be missing

	   <rule: list>			     # as before...
	       <item> <separator> <list>     #	 recursive definition
	     | <item>			     #	 base case

       Or, more	efficiently, but less prettily:

	   <rule: list>
	       <[item]>	(?: <separator>	<[item]> )*	      #	one-or-more

	   <rule: list_opt>
	       (?: <[item]> (?:	<separator> <[item]> )*	)?    #	zero-or-more

       Because separated lists are such	a common component of grammars,
       Regexp::Grammars	provides cleaner ways to specify them:

	   <rule: list>
	       <[item]>+ % <separator>	    # one-or-more

	   <rule: list_zom>
	       <[item]>* % <separator>	    # zero-or-more

       Note that these are just	regular	repetition qualifiers (i.e. "+"	and
       "*") applied to a subriule ("<[item]>"),	with a "%" modifier after them
       to specify the required separator between the repeated matches.

       The number of repetitions matched is controlled both by the nature of
       the qualifier ("+" vs "*") and by the subrule specified after the "%".
       The qualified subrule will be repeatedly	matched	for as long as its
       qualifier allows, provided that the second subrule also matches between
       those repetitions.

       For example, you	can match a parenthesized sequence of one-or-more
       numbers separated by commas, such as:

	   (1, 2, 3, 4,	13, 91)	       # Numbers separated by commas (and spaces)


	   <rule: number_list>

	       \(  <[number]>+ % <comma>  \)

	   <token: number>  \d+
	   <token: comma>   ,

       Note that any spaces round the commas will be ignored because
       "<number_list>" is specified as a rule and the "+%" specifier has
       spaces within and around	it. To disallow	spaces around the commas, make
       sure there are no spaces	in or around the "+%":

	   <rule: number_list_no_spaces>

	       \( <[number]>+%<comma> \)

       (or else	specify	the rule as a token instead).

       Because the "%" is a modifier applied to	a qualifier, you can modify
       any other repetition qualifier in the same way. For example:

	   <[item]>{2,4} % <sep>   # two-to-four items,	separated

	   <[item]>{7}	 % <sep>   # exactly 7 items, separated

	   <[item]>{10,}? % <sep>   # minimum of 10 or more items, separated

       You can even do this:

	   <[item]>? % <sep>	   # one-or-zero items,	(theoretically)	separated

       though the separator specification is, of course, meaningless in	that
       case as it will never be	needed to separate a maximum of	one item.

       If a "%"	appears	anywhere else in a grammar (i.e. not immediately after
       a repetition qualifier),	it is treated normally (i.e. as	a self-
       matching	literal	character):

	   <token: perl_hash>
	       % <ident>		# match	"%foo",	"%bar",	etc.

	   <token: perl_mod>
	       <expr> %	<expr>		# match	"$n % 2", "($n+3) % ($n-1)", etc.

       If you need to match a literal "%" immediately after a repetition,
       either quote it:

	   <token: percentage>
	       \d{1,3} \% solution		    # match "7%	solution", etc.

       or refactor the "%" character:

	   <token: percentage>
	       \d{1,3} <percent_sign> solution	    # match "7%	solution", etc.

	   <token: percent_sign>

       Note that it's usually necessary	to use the "<[...]>" form for the
       repeated	items being matched, so	that all of them are saved in the
       result hash. You	can also save all the separators (if they're
       important) by specifying	them as	a list-like subrule too:

	   \(  <[number]>* % <[comma]>	\)  # save numbers *and* separators

       The repeated item must be specified as a	subrule	call of	some kind
       (i.e. in	angles), but the separators may	be specified either as a
       subrule or as a raw bracketed pattern. For example:

	   <[number]>* % ( , | : )    #	Numbers	separated by commas or colons

	   <[number]>* % [,:]	      #	Same, but more efficiently matched

       The separator should always be specified	within matched delimiters of
       some kind: either matching "<...>" or matching "(...)" or matching
       "[...]".	Simple,	non-bracketed separators will sometimes	also work:

	   <[number]>+ % ,

       but not always:

	   <[number]>+ % ,\s+	  # Oops! Separator is just: ,

       This is because of the limited way in which the module internally
       parses ordinary regex components	(i.e. without full understanding of
       their implicit precedence). As a	consequence, consistently placing
       brackets	around any separator is	a much safer approach:

	   <[number]>+ % (,\s+)

       You can also use	a simple pattern on the	left of	the "%"	as the item
       matcher,	but in this case it must always	be aliased into	a list-
       collecting subrule, like	so:

	   <[item=(\d+)]>* % [,]

       Note that, for backwards	compatibility with earlier versions of
       Regexp::Grammars, the "+%" operator can also be written:	"**".
       However,	there can be no	space between the two asterisks	of this
       variant.	That is:

	   <[item]> ** <sep>	  # same as <[item]>* %	<sep>

	   <[item]>* * <sep>	  # error (two * qualifiers in a row)

   Matching hash keys
       In some situations a grammar may	need a rule that matches dozens,
       hundreds, or even thousands of one-word alternatives. For example, when
       matching	command	names, or valid	userids, or English words. In such
       cases it	is often impractical (and always inefficient) to list all the
       alternatives between "|"	alterators:

	   <rule: shell_cmd>
	       a2p | ac	| apply	| ar | automake	| awk |	...
	       # ...and	400 lines later
	       ... | zdiff | zgrep | zip | zmore | zsh

	   <rule: valid_word>
	       a | aa |	aal | aalii | aam | aardvark | aardwolf	| aba |	...
	       # ...and	40,000 lines later...
	       ... | zymotize |	zymotoxic | zymurgy | zythem | zythum

       To simplify such	cases, Regexp::Grammars	provides a special construct
       that allows you to specify all the alternatives as the keys of a	normal
       hash. The syntax	for that construct is simply to	put the	hash name
       inside angle brackets (with no space between the	angles and the hash

       Which means that	the rules in the previous example could	also be

	   <rule: shell_cmd>

	   <rule: valid_word>

       provided	that the two hashes (%cmds and %dict) are visible in the scope
       where the grammar is created.

       Matching	a hash key in this way is typically significantly faster than
       matching	a large	set of alternations. Specifically, it is O(length of
       longest potential key) ^	2, instead of O(number of keys).

       Internally, the construct is converted to something equivalent to:

	   <rule: shell_cmd>
	       (<.hk>)	<require: (?{ exists $cmds{$CAPTURE} })>

	   <rule: valid_word>
	       (<.hk>)	<require: (?{ exists $dict{$CAPTURE} })>

       The special "<hk>" rule is created automatically, and defaults to
       "\S+", but you can also define it explicitly to handle other kinds of
       keys. For example:

	   <rule: hk>
	       [^\n]+	     # Key may be any number of	chars on a single line

	   <rule: hk>
	       [ACGT]{10,}   # Key is a	base sequence of at least 10 pairs

       Alternatively, you can specify a	different key-matching pattern for
       each hash you're	matching, by placing the required pattern in braces
       immediately after the hash name.	For example:

	   <rule: client_name>
	       # Valid keys match <.hk>	(default or explicitly specified)

	   <rule: shell_cmd>
	       # Valid keys contain only word chars, hyphen, slash, or dot...
	       <%cmds {	[\w-/.]+ }>

	   <rule: valid_word>
	       # Valid keyss contain only alphas or internal hyphen or apostrophe...
	       <%dict{ (?i: (?:[a-z]+[-'])* [a-z]+ ) }>

	   <rule: DNA_sequence>
	       # Valid keys are	base sequences of at least 10 pairs...

       This second approach to key-matching is preferred, because it localizes
       any non-standard	key-matching behaviour to each individual hash.

   Rematching subrule results
       Sometimes it is useful to be able to rematch a string that has
       previously been matched by some earlier subrule.	For example, consider
       a rule to match shell-like control blocks:

	   <rule: control_block>
		 for   <expr> <[command]>+ endfor
	       | while <expr> <[command]>+ endwhile
	       | if    <expr> <[command]>+ endif
	       | with  <expr> <[command]>+ endwith

       This would be much tidier if we could factor out	the command names
       (which are the only differences between the four	alternatives). The
       problem is that the obvious solution:

	   <rule: control_block>
	       <keyword> <expr>

       doesn't work, because it	would also match an incorrect input like:

	   for 1..10
	       echo $n
	       ls subdir/$n

       We need some way	to ensure that the "<keyword>" matched immediately
       after "end" is the same "<keyword>" that	was initially matched.

       That's not difficult, because the first "<keyword>" will	have captured
       what it matched into $MATCH{keyword}, so	we could just write:

	   <rule: control_block>
	       <keyword> <expr>
	       end(??{quotemeta	$MATCH{keyword}})

       This is such a useful technique,	yet so ugly, scary, and	prone to
       error, that Regexp::Grammars provides a cleaner equivalent:

	   <rule: control_block>
	       <keyword> <expr>

       A directive of the form "<\_IDENTIFIER_"	is known as a "matchref" (an
       abbreviation of "%MATCH-supplied	backreference").  Matchrefs always
       attempt to match, as a literal, the current value of

       By default, a matchref does not capture what it matches,	but you	can
       have it do so by	giving it an alias:

	   <token: delimited_string>
	       <ldelim=str_delim>  .*?	<rdelim=\_ldelim>

	   <token: str_delim> ["'`]

       At first	glance this doesn't seem very useful as, by definition,
       $MATCH{ldelim} and $MATCH{rdelim} must necessarily always end up	with
       identical values. However, it can be useful if the rule also has	other
       alternatives and	you want to create a consistent	internal
       representation for those	alternatives, like so:

	   <token: delimited_string>
		 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
	       | <ldelim=( \[ )	     .*?  <rdelim=( \] )
	       | <ldelim=( \{ )	     .*?  <rdelim=( \} )
	       | <ldelim=( \( )	     .*?  <rdelim=( \) )
	       | <ldelim=( \< )	     .*?  <rdelim=( \> )

       You can also force a matchref to	save repeated matches as a nested
       array, in the usual way:

	   <token: marked_text>
	       <marker>	<text> <[endmarkers=\_marker]>+

       Be careful though, as the following will	not do as you may expect:

	       <[marker]>+ <text> <[endmarkers=\_marker]>+

       because the value of $MATCH{marker} will	be an array reference, which
       the matchref will flatten and concatenate, then match the resulting
       string as a literal, which will mean the	previous example will match
       endmarkers that are exact multiples of the complete start marker,
       rather than endmarkers that consist of any number of repetitions	of the
       individual start	marker delimiter. So:

	       ""text here""
	       ""text here""""
	       ""text here""""""

       but not:

	       ""text here"""
	       ""text here"""""

       Uneven start and	end markers such as these are extremely	unusual, so
       this problem rarely arises in practice.

       Note: Prior to Regexp::Grammars version 1.020, the syntax for matchrefs
       was "_\IDENTIFIER_" instead of "_\_IDENTIFIER_".	This created problems
       when the	identifier started with	any of "l", "u", "L", "U", "Q",	or
       "E", so the syntax has had to be	altered	in a backwards incompatible
       way. It will not	be altered again.

   Rematching balanced delimiters
       Consider	the example in the previous section:

	   <token: delimited_string>
		 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
	       | <ldelim=( \[ )	     .*?  <rdelim=( \] )
	       | <ldelim=( \{ )	     .*?  <rdelim=( \} )
	       | <ldelim=( \( )	     .*?  <rdelim=( \) )
	       | <ldelim=( \< )	     .*?  <rdelim=( \> )

       The repeated pattern of the last	four alternatives is gauling, but we
       can't just refactor those delimiters as well:

	   <token: delimited_string>
		 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
	       | <ldelim=bracket>    .*?  <rdelim=\_ldelim>

       because that would incorrectly match:

	   { delimited content here {

       while failing to	match:

	   { delimited content here }

       To refactor balanced delimiters like those, we need a second kind of
       matchref; one that's a little smarter.

       Or, preferably, a lot smarter...because there are many other kinds of
       balanced	delimiters, apart from single brackets.	For example:

	     {{{ delimited content here	}}}
	      /* delimited content here	*/
	      (* delimited content here	*)
	      `` delimited content here	''
	      if delimited content here	fi

       The common characteristic of these delimiter pairs is that the closing
       delimiter is the	inverse	of the opening delimiter: the sequence of
       characters is reversed and certain characters (mainly brackets, but
       also single-quotes/backticks) are mirror-reflected.

       Regexp::Grammars	supports the parsing of	such delimiters	with a
       construct known as an invertref,	which is specified using the
       "</IDENT_" directive. An	invertref acts very like a matchref, except
       that it does not	convert	to:

	   (??{	quotemeta( $MATCH{I<IDENT>} ) })

       but rather to:

	   (??{	quotemeta( inverse( $MATCH{I<IDENT> ))}	})

       With this directive available, the balanced delimiters of the previous
       example can be refactored to:

	   <token: delimited_string>
		 <ldelim=str_delim>  .*?  <rdelim=\_ldelim>
	       | <ldelim=( [[{(<] )  .*?  <rdelim=/ldelim>

       Like matchrefs, invertrefs come in the usual range of flavours:

	   </ident>	       # Match the inverse of $MATCH{ident}
	   <ALIAS=/ident>      # Match inverse and capture to $MATCH{ident}
	   <[ALIAS=/ident]>    # Match inverse and push	on @{$MATCH{ident}}

       The character pairs that	are reversed during mirroring are: "{" and
       "}", "["	and "]", "(" and ")", "<" and ">", "AX"	and "AX", "`" and "'".

       The following mnemonics may be useful in	distinguishing inverserefs
       from backrefs: a	backref	starts with a "\" (just	like the standard Perl
       regex backrefs "\1" and "\g{-2}"	and "\k<name>"), whereas an inverseref
       starts with a "/" (like an HTML or XML closing tag). Or just remember
       that "<\_IDENT>"	is "match the same again", and if you want "the	same
       again, only mirrored" instead, just mirror the "\" to get "</IDENT>".

   Rematching parametric results and delimiters
       The "<\IDENTIFIER_" and "</IDENTIFIER_" mechanisms normally locate the
       literal to be matched by	looking	in $MATCH{IDENTIFIER}.

       However,	you can	cause them to look in $ARG{IDENTIFIER} instead,	by
       prefixing the identifier	with a single ":". This	is especially useful
       when refactoring	subrules. For example, instead of:

	   <rule: Command>
	       <Keyword>  <CommandBody>	 end_ <\_Keyword>

	   <rule: Placeholder>
	       <Keyword>    \.\.\.   end_ <\_Keyword>

       you could parameterize the Terminator rule, like	so:

	   <rule: Command>
	       <Keyword>  <CommandBody>	 <Terminator(:Keyword)>

	   <rule: Placeholder>
	       <Keyword>    \.\.\.   <Terminator(:Keyword)>

	   <token: Terminator>
	       end_ <\:Keyword>

   Tracking and	reporting match	positions
       Regexp::Grammars	automatically predefines a special token that makes it
       easy to track exactly where in its input	a particular subrule matches.
       That token is: "<matchpos>".

       The "<matchpos>"	token implements a zero-width match that never fails.
       It always returns the current index within the string that the grammar
       is matching.

       So, for example you could have your "<delimited_text>" subrule detect
       and report unterminated text like so:

	   <token: delimited_text>
	       qq? <delim> <text=(.*?)>	</delim>
	       <matchpos> qq? <delim>
	       <error: (?{"Unterminated	string starting	at index $MATCH{matchpos}"})>

       Matching	"<matchpos>" in	the second alternative causes $MATCH{matchpos}
       to contain the position in the string at	which the "<matchpos>" subrule
       was matched (in this example: the start of the unterminated text).

       If you want the line number instead of the string index,	use the
       predefined "<matchline>"	subrule	instead:

	   <token: delimited_text>
		     qq? <delim> <text=(.*?)> </delim>
	   |   <matchline> qq? <delim>
	       <error: (?{"Unterminated	string starting	at line	$MATCH{matchline}"})>

       Note that the line numbers returned by "<matchline>" start at 1 (not at
       zero, as	with "<matchpos>").

       The "<matchpos>"	and "<matchline>" subrules are just like any other
       subrules; you can alias them ("<started_at=matchpos>") or match them
       repeatedly ( "(?: <[matchline]> <[item]>	)++"), etc.

       The module also supports	event-based parsing. You can specify a grammar
       in the usual way	and then, for a	particular parse, layer	a collection
       of call-backs (known as "autoactions") over the grammar to handle the
       data as it is parsed.

       Normally, a grammar rule	returns	the result hash	it has accumulated (or
       whatever	else was aliased to "MATCH=" within the	rule). However,	you
       can specify an autoaction object	before the grammar is matched.

       Once the	autoaction object is specified,	every time a rule succeeds
       during the parse, its result is passed to the object via	one of its
       methods;	specifically it	is passed to the method	whose name is the same
       as the rule's.

       For example, suppose you	had a grammar that recognizes simple algebraic

	   my $expr_parser = do{
	       use Regexp::Grammars;

		   <rule: Expr>	      <[Operand=Mult]>+	% <[Op=(\+|\-)]>

		   <rule: Mult>	      <[Operand=Pow]>+	% <[Op=(\*|/|%)]>

		   <rule: Pow>	      <[Operand=Term]>+	% <Op=(\^)>

		   <rule: Term>		 <MATCH=Literal>
			      |	      \( <MATCH=Expr> \)

		   <token: Literal>   <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>

       You could convert this grammar to a calculator, by installing a set of
       autoactions that	convert	each rule's result hash	to the corresponding
       value of	the sub-expression that	the rule just parsed. To do that, you
       would create a class with methods whose names match the rules whose
       results you want	to change. For example:

	   package Calculator;
	   use List::Util qw< reduce >;

	   sub new {
	       my ($class) = @_;

	       return bless {},	$class

	   sub Answer {
	       my ($self, $result_hash)	= @_;

	       my $sum = shift @{$result_hash->{Operand}};

	       for my $term (@{$result_hash->{Operand}}) {
		   my $op = shift @{$result_hash->{Op}};
		   if ($op eq '+') { $sum += $term; }
		   else		   { $sum -= $term; }

	       return $sum;

	   sub Mult {
	       my ($self, $result_hash)	= @_;

	       return reduce { eval($a . shift(@{$result_hash->{Op}}) .	$b) }

	   sub Pow {
	       my ($self, $result_hash)	= @_;

	       return reduce { $b ** $a	} reverse @{$result_hash->{Operand}};

       Objects of this class (and indeed the class itself) now have methods
       corresponding to	some of	the rules in the expression grammar. To	apply
       those methods to	the results of the rules (as they parse) you simply
       install an object as the	"autoaction" handler, immediately before you
       initiate	the parse:

	   if ($text ~=	$expr_parser->with_actions(Calculator->new)) {
	       say $/{Answer};	 # Now prints the result of the	expression

       The "with_actions()" method expects to be passed	an object or
       classname. This object or class will be installed as the	autoaction
       handler for the next match against any grammar. After that match, the
       handler will be uninstalled. "with_actions()" returns the grammar it's
       called on, making it easy to call it as part of a match (which is the
       recommended idiom).

       With a "Calculator" object set as the autoaction	handler, whenever the
       "Answer", "Mult", or "Pow" rule of the grammar matches, the
       corresponding "Answer", "Mult", or "Pow"	method of the "Calculator"
       object will be called (with the rule's result value passed as it's only
       argument), and the result of the	method will be used as the result of
       the rule.

       Note that nothing new happens when a "Term" or "Literal"	rule matches,
       because the "Calculator"	object doesn't have methods with those names.

       The overall effect, then, is to allow you to specify a grammar without
       rule-specific bahaviours	and then, later, specify a set of final
       actions (as methods) for	some or	all of the rules of the	grammar.

       Note that, if a particular callback method returns "undef", the result
       of the corresponding rule will be passed	through	without	modification.

Named grammars
       All the grammars	shown so far are confined to a single regex. However,
       Regexp::Grammars	also provides a	mechanism that allows you to defined
       named grammars, which can then be imported into other regexes. This
       gives the a way of modularizing common grammatical components.

   Defining a named grammar
       You can create a	named grammar using the	"<grammar:...>"	directive.
       This directive must appear before the first rule	definition in the
       grammar,	and instead of any start-rule. For example:

	       <grammar: List::Generic>

	       <rule: List>
		   <MATCH=[Item]>+ % <Separator>

	       <rule: Item>

	       <token: Separator>
		   \s* , \s*

       This creates a grammar named "List::Generic", and installs it in	the
       module's	internal caches, for future reference.

       Note that there is no need (or reason) to assign	the resulting regex to
       a variable, as the named	grammar	cannot itself be matched against.

   Using a named grammar
       To make use of a	named grammar, you need	to incorporate it into another
       grammar,	by inheritance.	To do that, use	the "<extends:...>" directive,
       like so:

	   my $parser =	qr{
	       <extends: List::Generic>


       The "<extends:...>" directive incorporates the rules defined in the
       specified grammar into the current regex. You can then call any of
       those rules in the start-pattern.

   Overriding an inherited rule	or token
       Subrule dispatch	within a grammar is always polymorphic.	That is, when
       a subrule is called, the	most-derived rule of the same name within the
       grammar's hierarchy is invoked.

       So, to replace a	particular rule	within grammar,	you simply need	to
       inherit that grammar and	specify	new, more-specific versions of any
       rules you want to change. For example:

	   my $list_of_integers	= qr{

	       # Inherit rules from base grammar...
	       <extends: List::Generic>

	       # Replace Item rule from	List::Generic...
	       <rule: Item>
		   [+-]? \d++

       You can also use	"<extends:...>"	in other named grammars, to create

	       <grammar: List::Integral>
	       <extends: List::Generic>

	       <token: Item>
		   [+-]? <MATCH=(<.Digit>+)>

	       <token: Digit>

	       <grammar: List::ColonSeparated>
	       <extends: List::Generic>

	       <token: Separator>
		   \s* : \s*

	       <grammar: List::Integral::ColonSeparated>
	       <extends: List::Integral>
	       <extends: List::ColonSeparated>

       As shown	in the previous	example, Regexp::Grammars allows you to
       multiply	inherit	two (or	more) base grammars. For example, the
       "List::Integral::ColonSeparated"	grammar	takes the definitions of
       "List" and "Item" from the "List::Integral" grammar, and	the definition
       of "Separator" from "List::ColonSeparated".

       Note that grammars dispatch subrule calls using C3 method lookup,
       rather than Perl's older	DFS lookup. That's why
       "List::Integral::ColonSeparated"	correctly gets the more-specific
       "Separator" rule	defined	in "List::ColonSeparated", rather than the
       more-generic version defined in "List::Generic" (via "List::Integral").
       See "perldoc mro" for more discussion of	the C3 dispatch	algorithm.

   Augmenting an inherited rule	or token
       Instead of replacing an inherited rule, you can augment it.

       For example, if you need	a grammar for lists of hexademical numbers,
       you could inherit the behaviour of "List::Integral" and add the hex
       digits to its "Digit" token:

	   my $list_of_hexadecimal = qr{

	       <extends: List::Integral>

	       <token: Digit>
		 | [A-Fa-f]

       If you call a subrule using a fully qualified name (such	as
       "<List::Integral::Digit>"), the grammar calls that version of the rule,
       rather than the most-derived version.

   Debugging named grammars
       Named grammars are independent of each other, even when inherited. This
       means that, if debugging	is enabled in a	derived	grammar, it will not
       be active in any	rules inherited	from a base grammar, unless the	base
       grammar also included a "<debug:...>" directive.

       This is a deliberate design decision, as	activating the debugger	adds a
       significant amount of code to each grammar's implementation, which is
       detrimental to the matching performance of the resulting	regexes.

       If you need to debug a named grammar, the best approach is to include a
       "<debug:	same>" directive at the	start of the grammar. The presence of
       this directive will ensure the necessary	extra debugging	code is
       included	in the regex implementing the grammar, while setting "same"
       mode will ensure	that the debugging mode	isn't altered when the matcher
       uses the	inherited rules.

Common parsing techniques
   Result distillation
       Normally, calls to subrules produce nested result-hashes	within the
       current result-hash. Those nested hashes	always have at least one
       automatically supplied key (""),	whose value is the entire substring
       that the	subrule	matched.

       If there	are no other nested captures within the	subrule, there will be
       no other	keys in	the result-hash. This would be annoying	as a typical
       nested grammar would then produce results consisting of hashes of
       hashes, with each nested	hash having only a single key (""). This in
       turn would make postprocessing the result-hash (in "%/")	far more
       complicated than	it needs to be.

       To avoid	this behaviour,	if a subrule's result-hash doesn't contain any
       keys except "", the module "flattens" the result-hash, by replacing it
       with the	value of its single key.

       So, for example,	the grammar:

	   mv \s* <from> \s* <to>

	   <rule: from>	  [\w/.-]+
	   <rule: to>	  [\w/.-]+

       doesn't return a	result-hash like this:

	       ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
	       'from' => { "" => '/usr/local/lib/libhuh.dylib' },
	       'to'   => { "" => '/dev/null/badlib'	       },

       Instead,	it returns:

	       ""     => 'mv /usr/local/lib/libhuh.dylib  /dev/null/badlib',
	       'from' => '/usr/local/lib/libhuh.dylib',
	       'to'   => '/dev/null/badlib',

       That is,	because	the 'from' and 'to' subhashes each have	only a single
       entry, they are each "flattened"	to the value of	that entry.

       This flattening also occurs if a	result-hash contains only "private"
       keys (i.e. keys starting	with underscores). For example:

	   mv \s* <from> \s* <to>

	   <rule: from>	  <_dir=path>? <_file=filename>
	   <rule: to>	  <_dir=path>? <_file=filename>

	   <token: path>      [\w/.-]*/
	   <token: filename>  [\w.-]+

       Here, the "from"	rule produces a	result like this:

	   from	=> {
		 "" => '/usr/local/bin/perl',
	       _dir => '/usr/local/bin/',
	      _file => 'perl',

       which is	automatically stripped of "private" keys, leaving:

	   from	=> {
		 "" => '/usr/local/bin/perl',

       which is	then automatically flattened to:

	   from	=> '/usr/local/bin/perl'

       List result distillation

       A special case of result	distillation occurs in a separated list, such

	   <rule: List>

	       <[Item]>+ % <[Sep=(,)]>

       If this construct matches just a	single item, the result	hash will
       contain a single	entry consisting of a nested array with	a single
       value, like so:

	   { Item => [ 'data' ]	}

       Instead of returning this annoyingly nested data	structure, you can
       tell Regexp::Grammars to	flatten	it to just the inner data with a
       special directive:

	   <rule: List>

	       <[Item]>+ % <[Sep=(,)]>


       The "<minimize:>" directive examines the	result hash (i.e.  %MATCH). If
       that hash contains only a single	entry, which is	a reference to an
       array with a single value, then the directive assigns that single value
       directly	to $MATCH, so that it will be returned instead of the usual
       result hash.

       This means that a normal	separated list still results in	a hash
       containing all elements and separators, but a "degenerate" list of only
       one item	results	in just	that single item.

       Manual result distillation

       Regexp::Grammars	also offers full manual	control	over the distillation
       process.	If you use the reserved	word "MATCH" as	the alias for a
       subrule call:


       or a subpattern match:

	   <MATCH=( \w+	)>

       or a code block:

	   <MATCH=(?{ 42 })>

       then the	current	rule will treat	the return value of that subrule,
       pattern,	or code	block as its complete result, and return that value
       instead of the usual result-hash	it constructs. This is the case	even
       if the result has other entries that would normally also	be returned.

       For example, in a rule like:

	   <rule: term>
	       | <left_paren> <MATCH=expr> <right_paren>

       The use of "MATCH" aliases causes the rule to return either whatever
       "<literal>" returns, or whatever	"<expr>" returns (provided it's
       between left and	right parentheses).

       Note that, in this second case, even though "<left_paren>" and
       "<right_paren>" are captured to the result-hash,	they are not returned,
       because the "MATCH" alias overrides the normal "return the result-hash"
       semantics and returns only what its associated subrule (i.e. "<expr>")

       Programmatic result distillation

       It's also possible to control what a rule returns from within a code
       block.  Regexp::Grammars	provides a set of reserved variables that give
       direct access to	the result-hash.

       The result-hash itself can be accessed as %MATCH	within any code	block
       inside a	rule. For example:

	   <rule: sum>
	       <X=product> \+ <Y=product>
		   <MATCH=(?{ $MATCH{X}	+ $MATCH{Y} })>

       Here, the rule matches a	product	(aliased 'X' in	the result-hash), then
       a literal '+', then another product (aliased to 'Y' in the result-
       hash). The rule then executes the code block, which accesses the	two
       saved values (as	$MATCH{X} and $MATCH{Y}), adding them together.
       Because the block is itself aliased to "MATCH", the sum produced	by the
       block becomes the (only)	result of the rule.

       It is also possible to set the rule result from within a	code block
       (instead	of aliasing it). The special "override"	return value is
       represented by the special variable $MATCH. So the previous example
       could be	rewritten:

	   <rule: sum>
	       <X=product> \+ <Y=product>
		   (?{ $MATCH =	$MATCH{X} + $MATCH{Y} })

       Both forms are identical	in effect. Any assignment to $MATCH overrides
       the normal "return all subrule results" behaviour.

       Assigning to $MATCH directly is particularly handy if the result	may
       not always be "distillable", for	example:

	   <rule: sum>
	       <X=product> \+ <Y=product>
		   (?{ if (!ref	$MATCH{X} && !ref $MATCH{Y}) {
			   # Reduce to sum, if both terms are simple scalars...
			   $MATCH = $MATCH{X} +	$MATCH{Y};
		       else {
			   # Return full syntax	tree for non-simple case...
			   $MATCH{op} =	'+';

       Note that you can also partially	override the subrule return behaviour.
       Normally, the subrule returns the complete text it matched as its
       context substring (i.e. under the "empty	key") in its result-hash. That
       is, of course, $MATCH{""}, so you can override just that	behaviour by
       directly	assigning to that entry.

       For example, if you have	a rule that matches key/value pairs from a
       configuration file, you might prefer that any trailing comments not be
       included	in the "matched	text" entry of the rule's result-hash. You
       could hide such comments	like so:

	   <rule: config_line>
	       <key> : <value>	<comment>?
		       # Edit trailing comments	out of "matched	text" entry...
		       $MATCH =	"$MATCH{key} : $MATCH{value}";

       Some more examples of the uses of $MATCH:

	   <rule: FuncDecl>
	     # Keyword	Name		   Keep	return the name	(as a string)...
	       func	<Identifier> ;	   (?{ $MATCH =	$MATCH{'Identifier'} })

	   <rule: NumList>
	     # Numbers in square brackets...
		   ( \d+ (?: , \d+)* )

	     # Return only the numbers...
	       (?{ $MATCH = $CAPTURE })

	   <token: Cmd>
	     # Match standard variants then standardize	the keyword...
	       (?: mv |	move | rename )	     (?{ $MATCH	= 'mv';	})

   Parse-time data processing
       Using code blocks in rules, it's	often possible to fully	process	data
       as you parse it.	For example, the "<sum>" rule shown in the previous
       section might be	part of	a simple calculator, implemented entirely in a
       single grammar. Such a calculator might look like this:

	   my $calculator = do{
	       use Regexp::Grammars;

		   <rule: Answer>
		       ( <.Mult>+ % <.Op=([+-])> )
			   <MATCH= (?{ eval $CAPTURE })>

		   <rule: Mult>
		       ( <.Pow>+ % <.Op=([*/%])> )
			   <MATCH= (?{ eval $CAPTURE })>

		   <rule: Pow>
		       <X=Term>	\^ <Y=Pow>
			   <MATCH= (?{ $MATCH{X} ** $MATCH{Y}; })>

		   <rule: Term>
		     | \(  <MATCH=Answer>  \)

		   <token: Literal>
			   <MATCH= ( [+-]? \d++	(?: \. \d++ )?+	)>

	   while (my $input = <>) {
	       if ($input =~ $calculator) {
		   say "--> $/{Answer}";

       Because every rule computes a value using the results of	the subrules
       below it, and aliases that result to its	"MATCH", each rule returns a
       complete	evaluation of the subexpression	it matches, passing that back
       to higher-level rules, which then do the	same.

       Hence, the result returned to the very top-level	rule (i.e. to
       "<Answer>") is the complete evaluation of the entire expression that
       was matched. That means that, in	the very process of having matched a
       valid expression, the calculator	has also computed the value of that
       expression, which can then simply be printed directly.

       It is often possible to have a grammar fully (or	sometimes at least
       partially) evaluate or transform	the data it is parsing,	and this
       usually leads to	very efficient and easy-to-maintain implementations.

       The main	limitation of this technique is	that the data has to be	in a
       well-structured form, where subsets of the data can be evaluated	using
       only local information. In cases	where the meaning of the data is
       distributed through that	data non-hierarchically, or relies on global
       state, or on external information, it is	often better to	have the
       grammar simply construct	a complete syntax tree for the data first, and
       then evaluate that syntax tree separately, after	parsing	is complete.
       The following section describes a feature of Regexp::Grammars that can
       make this second	style of data processing simpler and more

   Object-oriented parsing
       When a grammar has parsed successfully, the "%/"	variable will contain
       a series	of nested hashes (and possibly arrays) representing the
       hierarchical structure of the parsed data.

       Typically, the next step	is to walk that	tree, extracting or converting
       or otherwise processing that information. If the	tree has nodes of many
       different types,	it can be difficult to build a recursive subroutine
       that can	navigate it easily.

       A much cleaner solution is possible if the nodes	of the tree are	proper
       objects.	 In that case, you just	define a "process()" or	"traverse()"
       method for eah of the classes, and have every node call that method on
       each of its children. For example, if the parser	were to	return a tree
       of nodes	representing the contents of a LaTeX file, then	you could
       define the following methods:

	   sub Latex::file::explain
	       my ($self, $level) = @_;
	       for my $element (@{$self->{element}}) {

	   sub Latex::element::explain {
	       my ($self, $level) = @_;
	       (  $self->{command} || $self->{literal})->explain($level)

	   sub Latex::command::explain {
	       my ($self, $level) = @_;
	       say "\t"x$level,	"Command:";
	       say "\t"x($level+1), "Name: $self->{name}";
	       if ($self->{options}) {
		   say "\t"x$level, "\tOptions:";

	       for my $arg (@{$self->{arg}}) {
		   say "\t"x$level, "\tArg:";

	   sub Latex::options::explain {
	       my ($self, $level) = @_;
	       $_->explain($level) foreach @{$self->{option}};

	   sub Latex::literal::explain {
	       my ($self, $level, $label) = @_;
	       $label //= 'Literal';
	       say "\t"x$level,	"$label: ", $self->{q{}};

       and then	simply write:

	   if ($text =~	$LaTeX_parser) {

       and the chain of	"explain()" calls would	cascade	down the nodes of the
       tree, each one invoking the appropriate "explain()" method according to
       the type	of node	encountered.

       The only	problem	is that, by default, Regexp::Grammars returns a	tree
       of plain-old hashes, not	LaTeX::Whatever	objects. Fortunately, it's
       easy to request that the	result hashes be automatically blessed into
       the appropriate classes,	using the "<objrule:...>" and "<objtoken:...>"

       These directives	are identical to the "<rule:...>" and "<token:...>"
       directives (respectively), except that the rule or token	they create
       will also convert the hash it normally returns into an object of	a
       specified class.	This conversion	is done	by passing the result hash to
       the class's constructor:


       if the class has	a constructor method named "new()", or else (if	the
       class doesn't provide a constructor) by directly	blessing the result

	   bless \%result_hash,	$class

       Note that, even if object is constructed	via its	own constructor, the
       module still expects the	new object to be hash-based, and will fail if
       the object is anything but a blessed hash. The module issues an error
       in this case.

       The generic syntax for these types of rules and tokens is:

	   <objrule:  CLASS::NAME = RULENAME  >
	   <objtoken: CLASS::NAME = TOKENNAME >

       For example:

	   <objrule: LaTeX::Element=component>
	       # ...Defines a rule that	can be called as <component>
	       # ...and	which returns a	hash-based LaTeX::Element object

	   <objtoken: LaTex::Literal=atom>
	       # ...Defines a token that can be	called as <atom>
	       # ...and	which returns a	hash-based LaTeX::Literal object

       Note that, just as in aliased subrule calls, the	name by	which
       something is referred to	outside	the grammar (in	this case, the class
       name) comes before the "=", whereas the name that it is referred	to
       inside the grammar comes	after the "=".

       You can freely mix object-returning and plain-old-hash-returning	rules
       and tokens within a single grammar, though you have to be careful not
       to subsequently try to call a method on any of the unblessed nodes.

       An important caveat regarding OO	rules

       Prior to	Perl 5.14.0, Perl's regex engine was not fully re-entrant.
       This means that in older	versions of Perl, it is	not possible to	re-
       invoke the regex	engine when already inside the regex engine.

       This means that you need	to be careful that the "new()" constructors
       that are	called by your object-rules do not themselves use regexes in
       any way,	unless you're running under Perl 5.14 or later (in which case
       you can ignore what follows).

       The two ways this is most likely	to happen are:

       1.  If you're using a class built on Moose, where one or	more of	the
	   "has" uses a	type constraint	(such as 'Int')	that is	implemented
	   via regex matching. For example:

	       has 'id'	=> (is => 'rw',	isa => 'Int');

	   The workaround (for pre-5.14	Perls) is to replace the type
	   constraint with one that doesn't use	a regex. For example:

	       has 'id'	=> (is => 'rw',	isa => 'Num');

	   Alternatively, you could define your	own type constraint that
	   avoids regexes:

	       use Moose::Util::TypeConstraints;

	       subtype 'Non::Regex::Int',
		    as 'Num',
		 where { int($_) == $_ };

	       no Moose::Util::TypeConstraints;

	       # and later...

	       has 'id'	=> (is => 'rw',	isa => 'Non::Regex::Int');

       2.  If your class uses an "AUTOLOAD()" method to	implement its
	   constructor and that	method uses the	typical:

	       $AUTOLOAD =~ s/.*://;

	   technique. The workaround here is to	achieve	the same effect
	   without a regex. For	example:

	       my $last_colon_pos = rindex($AUTOLOAD, ':');
	       substr $AUTOLOAD, 0, $last_colon_pos+1, q{};

       Note that this caveat against using nested regexes also applies to any
       code blocks executed inside a rule or token (whether or not those rules
       or tokens are object-oriented).

       A naming	shortcut

       If an "<objrule:...>" or	"<objtoken:...>" is defined with a class name
       that is not followed by "=" and a rule name, then the rule name is
       determined automatically	from the classname.  Specifically, the final
       component of the	classname (i.e.	after the last "::", if	any) is	used.

       For example:

	   <objrule: LaTeX::Element>
	       # ...Defines a rule that	can be called as <Element>
	       # ...and	which returns a	hash-based LaTeX::Element object

	   <objtoken: LaTex::Literal>
	       # ...Defines a token that can be	called as <Literal>
	       # ...and	which returns a	hash-based LaTeX::Literal object

	   <objtoken: Comment>
	       # ...Defines a token that can be	called as <Comment>
	       # ...and	which returns a	hash-based Comment object

       Regexp::Grammars	provides a number of features specifically designed to
       help debug both grammars	and the	data they parse.

       All debugging messages are written to a log file	(which,	by default, is
       just STDERR). However, you can specify a	disk file explicitly by
       placing a "<logfile:...>" directive at the start	of your	grammar:

	   $grammar = qr{

	       <logfile: LaTeX_parser_log >

	       \A <LaTeX_file> \Z    # Pattern to match

	       <rule: LaTeX_file>
		   # etc.

       You can also explicitly specify that messages go	to the terminal:

	       <logfile: - >

   Debugging grammar creation with "<logfile:...>"
       Whenever	a log file has been directly specified,	Regexp::Grammars
       automatically does verbose static analysis of your grammar.  That is,
       whenever	it compiles a grammar containing an explicit "<logfile:...>"
       directive it logs a series of messages explaining how it	has
       interpreted the various components of that grammar. For example,	the
       following grammar:

	   <logfile: parser_log	>


	   <rule: cmd>
	       mv <from=file> <to=file>
	     | cp <source> <[file]>  <.comment>?

       would produce the following analysis in the 'parser_log'	file:

	   info	| Processing the main regex before any rule definitions
		|    |
		|    |...Treating <cmd>	as:
		|    |	    |  match the subrule <cmd>
		|    |	     \ saving the match	in $MATCH{'cmd'}
		|    |
		|     \___End of main regex
	   info	| Defining a rule: <cmd>
		|    |...Returns: a hash
		|    |
		|    |...Treating ' mv ' as:
		|    |	     \ normal Perl regex syntax
		|    |
		|    |...Treating <from=file> as:
		|    |	    |  match the subrule <file>
		|    |	     \ saving the match	in $MATCH{'from'}
		|    |
		|    |...Treating <to=file> as:
		|    |	    |  match the subrule <file>
		|    |	     \ saving the match	in $MATCH{'to'}
		|    |
		|    |...Treating ' | cp ' as:
		|    |	     \ normal Perl regex syntax
		|    |
		|    |...Treating <source> as:
		|    |	    |  match the subrule <source>
		|    |	     \ saving the match	in $MATCH{'source'}
		|    |
		|    |...Treating <[file]> as:
		|    |	    |  match the subrule <file>
		|    |	     \ appending the match to $MATCH{'file'}
		|    |
		|    |...Treating <.comment>? as:
		|    |	    |  match the subrule <comment> if possible
		|    |	     \ but don't save anything
		|    |
		|     \___End of rule definition

       This kind of static analysis is a useful	starting point in debugging a
       miscreant grammar, because it enables you to see	what you actually
       specified (as opposed to	what you thought you'd specified).

   Debugging grammar execution with "<debug:...>"
       Regexp::Grammars	also provides a	simple interactive debugger, with
       which you can observe the process of parsing and	the data being
       collected in any	result-hash.

       To initiate debugging, place a "<debug:...>" directive anywhere in your
       grammar.	When parsing reaches that directive the	debugger will be
       activated, and the command specified in the directive immediately
       executed. The available commands	are:

	   <debug: on>	  - Enable debugging, stop when	a rule matches
	   <debug: match> - Enable debugging, stop when	a rule matches
	   <debug: try>	  - Enable debugging, stop when	a rule is tried
	   <debug: run>	  - Enable debugging, run until	the match completes
	   <debug: same>  - Continue debugging (or not)	as currently
	   <debug: off>	  - Disable debugging and continue parsing silently

	   <debug: continue> - Synonym for <debug: run>
	   <debug: step>     - Synonym for <debug: try>

       These directives	can be placed anywhere within a	grammar	and take
       effect when that	point is reached in the	parsing. Hence,	adding a
       "<debug:step>" directive	is very	much like setting a breakpoint at that
       point in	the grammar. Indeed, a common debugging	strategy is to turn
       debugging on and	off only around	a suspect part of the grammar:

	   <rule: tricky>   # This is where we think the problem is...
	       <preamble> <text> <postscript>

       Once the	debugger is active, it steps through the parse,	reporting
       rules that are tried, matches and failures, backtracking	and restarts,
       and the parser's	location within	both the grammar and the text being
       matched.	That report looks like this:

	   ===============> Trying <grammar> from position 0
	   > cp	file1 file2 |...Trying <cmd>
			    |	|...Trying <cmd=(cp)>
			    |	|    \FAIL <cmd=(cp)>
			    |	 \FAIL <cmd>
			     \FAIL <grammar>
	   ===============> Trying <grammar> from position 1
	    cp file1 file2  |...Trying <cmd>
			    |	|...Trying <cmd=(cp)>
	    file1 file2	    |	|    \_____<cmd=(cp)> matched 'cp'
	   file1 file2	    |	|...Trying <[file]>+
	    file2	    |	|    \_____<[file]>+ matched 'file1'
			    |	|...Trying <[file]>+
	   [eos]	    |	|    \_____<[file]>+ matched ' file2'
			    |	|...Trying <[file]>+
			    |	|    \FAIL <[file]>+
			    |	|...Trying <target>
			    |	|   |...Trying <file>
			    |	|   |	 \FAIL <file>
			    |	|    \FAIL <target>
	    <~~~~~~~~~~~~~~ |	|...Backtracking 5 chars and trying new	match
	   file2	    |	|...Trying <target>
			    |	|   |...Trying <file>
			    |	|   |	 \____ <file> matched 'file2'
	   [eos]	    |	|    \_____<target> matched 'file2'
			    |	 \_____<cmd> matched ' cp file1	file2'
			     \_____<grammar> matched ' cp file1	file2'

       The first column	indicates the point in the input at which the parser
       is trying to match, as well as any backtracking or forward searching it
       may need	to do. The remainder of	the columns track the parser's
       hierarchical traversal of the grammar, indicating which rules are
       tried, which succeed, and what they match.

       Provided	the logfile is a terminal (as it is by default), the debugger
       also pauses at various points in	the parsing process--before trying a
       rule, after a rule succeeds, or at the end of the parse--according to
       the most	recent command issued. When it pauses, you can issue a new
       command by entering a single letter:

	   m	   - to	continue until the next	subrule	matches
	   t or	s  - to	continue until the next	subrule	is tried
	   r or	c  - to	continue to the	end of the grammar
	   o	   - to	switch off debugging

       Note that these are the first letters of	the corresponding
       "<debug:...>" commands, listed earlier. Just hitting ENTER while	the
       debugger	is paused repeats the previous command.

       While the debugger is paused you	can also type a	'd', which will
       display the result-hash for the current rule. This can be useful	for
       detecting which rule isn't returning the	data you expected.

       Resizing	the context string

       By default, the first column of the debugger output (which shows	the
       current matching	position within	the string) is limited to a width of
       20 columns.

       However,	you can	change that limit calling the
       "Regexp::Grammars::set_context_width()" subroutine. You have to specify
       the fully qualified name, however, as Regexp::Grammars does not export
       this (or	any other) subroutine.

       "set_context_width()" expects a single argument:	a positive integer
       indicating the maximal allowable	width for the context column. It
       issues a	warning	if an invalid value is passed, and ignores it.

       If called in a void context, "set_context_width()" changes the context
       width permanently throughout your application. If called	in a scalar or
       list context, "set_context_width()" returns an object whose destructor
       will cause the context width to revert to its previous value. This
       means you can temporarily change	the context width within a given block
       with something like:

	       my $temporary = Regexp::Grammars::set_context_width(50);

	       if ($text =~ $parser) {
		   do_stuff_with( %/ );

	   } # <--- context width automagically	reverts	at this	point

       and the context width will change back to its previous value when
       $temporary goes out of scope at the end of the block.

   User-defined	logging	with "<log:...>"
       Both static and interactive debugging send a series of predefined log
       messages	to whatever log	file you have specified. It is also possible
       to send additional, user-defined	messages to the	log, using the
       "<log:...>" directive.

       This directive expects either a simple text or a	codeblock as its
       single argument.	If the argument	is a code block, that code is expected
       to return the text of the message; if the argument is anything else,
       that something else is the literal message. For example:

	   <rule: ListElem>

	       <Elem=	( [a-z]\d+) >
		   <log: Checking for a	suffix,	too...>

	       <Suffix=	( : \d+	  ) >?
		   <log: (?{ "ListElem:	$MATCH{Elem} and $MATCH{Suffix}" })>

       User-defined log	messages implemented using a codeblock can also
       specify a severity level. If the	codeblock of a "<log:...>" directive
       returns two or more values, the first is	treated	as a log message
       severity	indicator, and the remaining values as separate	lines of text
       to be logged. For example:

	   <rule: ListElem>
	       <Elem=	( [a-z]\d+) >
	       <Suffix=	( : \d+	  ) >?

		   <log: (?{
		       warn => "Elem was: $MATCH{Elem}",
			       "Suffix was $MATCH{Suffix}",

       When they are encountered, user-defined log messages are	interspersed
       between any automatic log messages (i.e.	from the debugger), at the
       correct level of	nesting	for the	current	rule.

   Debugging non-grammars
       [Note that, with	the release in 2012 of the Regexp::Debugger module (on
       CPAN) the techniques described below are	unnecessary. If	you need to
       debug plain Perl	regexes, use Regexp::Debugger instead.]

       It is possible to use Regexp::Grammars without creating any subrule
       definitions, simply to debug a recalcitrant regex. For example, if the
       following regex wasn't working as expected:

	   my $balanced_brackets = qr{
	       \(	      #	left delim
		   \\	      #	escape or
	       |   (?R)	      #	recurse	or
	       |   .	      #	whatever
	       \)	      #	right delim

       you could instrument it with aliased subpatterns	and then debug it
       step-by-step, using Regexp::Grammars:

	   use Regexp::Grammars;

	   my $balanced_brackets = qr{

	       <.left_delim=  (	 \(  )>
		   <.escape=  (	 \\  )>
	       |   <.recurse= (	(?R) )>
	       |   <.whatever=(	 .   )>
	       <.right_delim= (	 \)  )>

	   while (<>) {
	       say 'matched' if	/$balanced_brackets/;

       Note the	use of amnesiac	aliased	subpatterns to avoid needlessly
       building	a result-hash. Alternatively, you could	use listifying aliases
       to preserve the matching	structure as an	additional debugging aid:

	   use Regexp::Grammars;

	   my $balanced_brackets = qr{

	       <[left_delim=  (	 \(  )]>
		   <[escape=  (	 \\  )]>
	       |   <[recurse= (	(?R) )]>
	       |   <[whatever=(	 .   )]>
	       <[right_delim= (	 \)  )]>

	   if (	'(a(bc)d)' =~ /$balanced_brackets/) {
	       use Data::Dumper	'Dumper';
	       warn Dumper \%/;

Handling errors	when parsing
       Assuming	you have correctly debugged your grammar, the next source of
       problems	will likely be invalid input (especially if that input is
       being provided interactively). So Regexp::Grammars also provides	some
       support for detecting when a parse is likely to fail...and informing
       the user	why.

       The "<require:...>" directive is	useful for testing conditions that
       it's not	easy (or even possible)	to check within	the syntax of the the
       regex itself. For example:

	   <rule: IPV4_Octet_Decimal>
	       # Up three digits...
	       <MATCH= ( \d{1,3}+ )>

	       # ...but	less that 256...
	       <require: (?{ $MATCH <= 255 })>

       A require expects a regex codeblock as its argument and succeeds	if the
       final value of that codeblock is	true. If the final value is false, the
       directive fails and the rule starts backtracking.

       Note, in	this example that the digits are matched with "	\d{1,3}+ ".
       The trailing "+"	prevents the "{1,3}" repetition	from backtracking to a
       smaller number of digits	if the "<require:...>" fails.

   Handling failure
       The module has limited support for error	reporting from within a
       grammar,	in the form of the "<error:...>" and "<warning:...>"
       directives and their shortcuts: "<...>",	"<!!!>", and "<???>"

       Error messages

       The "<error: MSG>" directive queues a conditional error message within
       "@!" and	then fails to match (that is, it is equivalent to a "(?!)"
       when matching). For example:

	   <rule: ListElem>
	     | <ClientName>
	     | <error: (?{ $errcount++ . ': Missing list element' })>

       So a common code	pattern	when using grammars that do this kind of error
       detection is:

	   if ($text =~	$grammar) {
	       # Do something with the data collected in %/
	   else	{
	       say {*STDERR} $_	for @!;	  # i.e. report	all errors

       Each error message is conditional in the	sense that, if any surrounding
       rule subsequently matches, the message is automatically removed from
       "@!". This implies that you can queue up	as many	error messages as you
       wish, but they will only	remain in "@!" if the match ultimately fails.
       Moreover, only those error messages originating from rules that
       actually	contributed to the eventual failure-to-match will remain in

       If a code block is specified as the argument, the error message is
       whatever	final value is produced	when the block is executed. Note that
       this final value	does not have to be a string (though it	does have to
       be a scalar).

	   <rule: ListElem>
	     | <ClientName>
	     | <error: (?{
		   # Return a hash, with the error information...
		   { errnum => $errcount++, msg	=> 'Missing list element' }

       If anything else	is specified as	the argument, it is treated as a
       literal error string (and may not contain an unbalanced '<' or '>', nor
       any interpolated	variables).

       However,	if the literal error string begins with	"Expected " or
       "Expecting ", then the error string automatically has the following
       "context	suffix"	appended:

	   , but found '$CONTEXT' instead

       For example:

	   qr{ <Arithmetic_Expression>		      #	...Match arithmetic expression
	     |					      #	Or else
	       <error: Expected	a valid	expression>   #	...Report error, and fail

	       # Rule definitions here...

       On an invalid input this	example	might produce an error message like:

	   "Expected a valid expression, but found '(2+3]*7/' instead"

       The value of the	special	$CONTEXT variable is found by looking ahead in
       the string being	matched	against, to locate the next sequence of	non-
       blank characters	after the current parsing position. This variable may
       also be explicitly used within the "<error: (?{...})>" form of the

       As a special case, if you omit the message entirely from	the directive,
       it is supplied automatically, derived from the name of the current
       rule.  For example, if the following rule were to fail to match:

	   <rule: Arithmetic_expression>
		 <Multiplicative_Expression>+ %	([+-])
	       | <error:>

       the error message queued	would be:

	   "Expected arithmetic	expression, but	found 'one plus	two' instead"

       Note however, that it is	still essential	to include the colon in	the
       directive. A common mistake is to write:

	   <rule: Arithmetic_expression>
		 <Multiplicative_Expression>+ %	([+-])
	       | <error>

       which merely attempts to	call "<rule: error>" if	the first alternative

       Warning messages

       Sometimes, you want to detect problems, but not invalidate the entire
       parse as	a result. For those occasions, the module provides a "less
       stringent" form of error	reporting: the "<warning:...>" directive.

       This directive is exactly the same as an	"<error:...>" in every respect
       except that it does not induce a	failure	to match at the	point it

       The directive is, therefore, useful for reporting non-fatal problems in
       a parse.	For example:

	   qr{ \A	     # ...Match	only at	start of input
	       <ArithExpr>   # ...Match	a valid	arithmetic expression

		   # Should be at end of input...
		   \s* \Z
		   # If	not, report the	fact but don't fail...
		   <warning: Expected end-of-input>
		   <warning: (?{ "Extra	junk at	index $INDEX: $CONTEXT"	})>

	       # Rule definitions here...

       Note that, because they do not induce failure, two or more
       "<warning:...>" directives can be "stacked" in sequence,	as in the
       previous	example.


       The module also provides	three useful shortcuts,	specifically to	make
       it easy to declare, but not define, rules and tokens.

       The "<...>" and "<???>" directives are equivalent to the	directive:

	   <error: Cannot match	RULENAME (not implemented)>

       The "<???>" is equivalent to the	directive:

	   <warning: Cannot match RULENAME (not	implemented)>

       For example, in the following grammar:

	   <grammar: List::Generic>

	   <rule: List>
	       <[Item]>+ % (\s*,\s*)

	   <rule: Item>

       the "Item" rule is declared but not defined. That means the grammar
       will compile correctly, (the "List" rule	won't complain about a call to
       a non-existent "Item"), but if the "Item" rule isn't overridden in some
       derived grammar,	a match-time error will	occur when "List" tries	to
       match the "<...>" within	"Item".

       Localizing the (semi-)automatic error messages

       Error directives	of any of the following	forms:

	   <error: Expecting identifier>

	   <error: >



       or their	warning	equivalents:

	   <warning: Expecting identifier>

	   <warning: >


       each autogenerate part or all of	the actual error message they produce.
       By default, that	autogenerated message is always	produced in English.

       However,	the module provides a mechanism	by which you can intercept
       every error or warning that is queued to	"@!"  via these
       directives...and	localize those messages.

       To do this, you call "Regexp::Grammars::set_error_translator()" (with
       the full	qualification, since Regexp::Grammars does not export it...nor
       anything	else, for that matter).

       The "set_error_translator()" subroutine expect as single	argument,
       which must be a reference to another subroutine.	 This subroutine is
       then called whenever an error or	warning	message	is queued to "@!".

       The subroutine is passed	three arguments:

       o   the message string,

       o   the name of the rule	from which the error or	warning	was queued,

       o   the value of	$CONTEXT when the error	or warning was encountered

       The subroutine is expected to return the	final version of the message
       that is actually	to be appended to "@!".	To accomplish this it may make
       use of one of the many internationalization/localization	modules
       available in Perl, or it	may do the conversion entirely by itself.

       The first argument is always exactly what appeared as a message in the
       original	directive (regardless of whether that message is supposed to
       trigger autogeneration, or is just a "regular" error message).  That

	   Directive			     1st argument

	   <error: Expecting identifier>     "Expecting	identifier"
	   <warning: That's not	a moon!>     "That's not a moon!"
	   <error: >			     ""
	   <warning: >			     ""
	   <...>			     ""
	   <!!!>			     ""
	   <???>			     ""

       The second argument always contains the name of the rule	in which the
       directive was encountered. For example, when invoked from within
       "<rule: Frinstance>" the	following directives produce:

	   Directive			     2nd argument

	   <error: Expecting identifier>     "Frinstance"
	   <warning: That's not	a moon!>     "Frinstance"
	   <error: >			     "Frinstance"
	   <warning: >			     "Frinstance"
	   <...>			     "-Frinstance"
	   <!!!>			     "-Frinstance"
	   <???>			     "-Frinstance"

       Note that the "unimplemented" markers pass the rule name	with a
       preceding '-'. This allows your translator to distinguish between
       "empty" messages	(which should then be generated	automatically) and the
       "unimplemented" markers (which should report that the rule is not yet
       properly	defined).

       If you call "Regexp::Grammars::set_error_translator()" in a void
       context,	the error translator is	permanently replaced (at least,	until
       the next	call to	"set_error_translator()").

       However,	if you call "Regexp::Grammars::set_error_translator()" in a
       scalar or list context, it returns an object whose destructor will
       restore the previous translator.	This allows you	to install a
       translator only within a	given scope, like so:

	       my $temporary
		   = Regexp::Grammars::set_error_translator(\&my_translator);

	       if ($text =~ $parser) {
		   do_stuff_with( %/ );
	       else {
		   report_errors_in( @!	);

	   } # <--- error translator automagically reverts at this point

       Warning:	any error translation subroutine you install will be called
       during the grammar's parsing phase (i.e.	as the grammar's regex is
       matching). You should therefore ensure that your	translator does	not
       itself use regular expressions, as nested evaluations of	regexes	inside
       other regexes are extremely problematical (i.e. almost always
       disastrous) in Perl.

   Restricting how long	a parse	runs
       Like the	core Perl 5 regex engine on which they are built, the grammars
       implemented by Regexp::Grammars are essentially top-down	parsers. This
       means that they may occasionally	require	an exponentially long time to
       parse a particular input. This usually occurs if	a particular grammar
       includes	a lot of recursion or nested backtracking, especially if the
       grammar is then matched against a long string.

       The judicious use of non-backtracking repetitions (i.e. "x*+" and
       "x++") can significantly	improve	parsing	performance in many such
       cases. Likewise,	carefully reordering any high-level alternatives (so
       as to test simple common	cases first) can substantially reduce parsing

       However,	some languages are just	intrinsically slow to parse using top-
       down techniques (or, at least, may have slow-to-parse corner cases).

       To help cope with this constraint, Regexp::Grammars provides a
       mechanism by which you can limit	the total effort that a	given grammar
       will expend in attempting to match. The "<timeout:...>" directive
       allows you to specify how long a	grammar	is allowed to continue trying
       to match	before giving up. It expects a single argument,	which must be
       an unsigned integer, and	it treats this integer as the number of
       seconds to continue attempting to match.

       For example:

	   <timeout: 10>

       indicates that the grammar should keep attempting to match for another
       10 seconds from the point where the directive is	encountered during a
       parse. If the complete grammar has not matched in that time, the	entire
       match is	considered to have failed, the matching	process	is immediately
       terminated, and a standard error	message	('Internal error: Timed	out
       after 10	seconds	(as requested)') is returned in	"@!".

       A "<timeout:...>" directive can be placed anywhere in a grammar,	but is
       most usually placed at the very start, so that the entire grammar is
       governed	by the specified time limit. The second	most common
       alternative is to place the timeout at the start	of a particular
       subrule that is known to	be potentially very slow.

       A common	mistake	is to put the timeout specification at the top level
       of the grammar, but place it after the actual subrule to	be matched,
       like so:

	   my $grammar = qr{

	       <Text_Corpus>	  # Subrule to be matched
	       <timeout: 10>	  # Useless use	of timeout

	       <rule: Text_Corpus>
		   # et	cetera...

       Since the parser	will only reach	the "<timeout: 10>" directive after it
       has completely matched "<Text_Corpus>", the timeout is only initiated
       at the very end of the matching process and so does not limit that
       process in any useful way.

       Immediate timeouts

       As you might expect, a "<timeout: 0>" directive tells the parser	to
       keep trying for only zero more seconds, and therefore will immediately
       cause the entire	surrounding grammar to fail (no	matter how deeply
       within that grammar the directive is encountered).

       This can	occasionally be	exteremely useful. If you know that detecting
       a particular datum means	that the grammar will never match, no matter
       how many	other alternatives may subsequently be tried, you can short-
       circuit the parser by injecting a "<timeout: 0>"	immediately after the
       offending datum is detected.

       For example, if your grammar only accepts certain versions of the
       language	being parsed, you could	write:

	   <rule: Valid_Language_Version>
		   vers	= <%AcceptableVersions>
		   vers	= <bad_version=(\S++)>
		   <warning: (?{ "Cannot parse language	version	$MATCH{bad_version}" })>
		   <timeout: 0>

       In fact,	this "<warning:	MSG> <timeout: 0>" sequence is sufficiently
       useful, sufficiently complex, and sufficiently easy to get wrong, that
       Regexp::Grammars	provides a handy shortcut for it: the "<fatal:...>"
       directive. A "<fatal:...>" is exactly equivalent	to a "<warning:...>"
       followed	by a zero-timeout, so the previous example could also be

	   <rule: Valid_Language_Version>
		   vers	= <%AcceptableVersions>
		   vers	= <bad_version=(\S++)>
		   <fatal: (?{ "Cannot parse language version $MATCH{bad_version}" })>

       Like "<error:...>" and "<warning:...>", "<fatal:...>" also provides its
       own failure context in $CONTEXT,	so the previous	example	could be
       further simplified to:

	   <rule: Valid_Language_Version>
		   vers	= <%AcceptableVersions>
		   vers	= <fatal:(?{ "Cannot parse language version $CONTEXT" })>

       Also like "<error:...>",	"<fatal:...>" can autogenerate an error
       message if none is provided, so the example could be still further
       reduced to:

	   <rule: Valid_Language_Version>
		   vers	= <%AcceptableVersions>
		   vers	= <fatal:>

       In this last case, however, the error message returned in "@!" would no
       longer be:

	   Cannot parse	language version 0.95

       It would	now be:

	   Expected valid language version, but	found '0.95' instead

Scoping	considerations
       If you intend to	use a grammar as part of a larger program that
       contains	other (non-grammatical)	regexes, it is more efficient--and
       less error-prone--to avoid having Regexp::Grammars process those
       regexes as well.	So it's	often a	good idea to declare your grammar in a
       "do" block, thereby restricting the scope of the	module's effects.

       For example:

	   my $grammar = do {
	       use Regexp::Grammars;

		   <rule: file>

		   <rule: prelude>
		       # etc.

       Because the effects of Regexp::Grammars are lexically scoped, any
       regexes defined outside that "do" block will be unaffected by the

   Perl	API
       "use Regexp::Grammars;"
	   Causes all regexes in the current lexical scope to be compile-time
	   processed for grammar elements.

       "$str =~	$grammar"
       "$str =~	/$grammar/"
	   Attempt to match the	grammar	against	the string, building a nested
	   data	structure from it.

	   This	hash is	assigned the nested data structure created by any
	   successful match of a grammar regex.

	   This	array is assigned the queue of error messages created by any
	   unsuccessful	match attempt of a grammar regex.

   Grammar syntax

       "<rule: IDENTIFIER>"
	   Define a rule whose name is specified by the	supplied identifier.

	   Everything following	the "<rule:...>" directive (up to the next
	   "<rule:...>"	or "<token:...>" directive) is treated as part of the
	   rule	being defined.

	   Any whitespace in the rule is replaced by a call to the "<.ws>"
	   subrule (which defaults to matching "\s*", but may be explicitly

       "<token:	IDENTIFIER>"
	   Define a rule whose name is specified by the	supplied identifier.

	   Everything following	the "<token:...>" directive (up	to the next
	   "<rule:...>"	or "<token:...>" directive) is treated as part of the
	   rule	being defined.

	   Any whitespace in the rule is ignored (under	the "/x" modifier), or
	   explicitly matched (if "/x" is not used).

       "<objrule:  IDENTIFIER>"
       "<objtoken: IDENTIFIER>"
	   Identical to	a "<rule: IDENTIFIER>" or "<token: IDENTIFIER>"
	   declaration,	except that the	rule or	token will also	bless the hash
	   it normally returns,	converting it to an object of a	class whose
	   name	is the same as the rule	or token itself.

       "<require: (?{ CODE }) >"
	   The code block is executed and if its final value is	true, matching
	   continues from the same position. If	the block's final value	is
	   false, the match fails at that point	and starts backtracking.

       "<error:	(?{ CODE })  >"
       "<error:	LITERAL	TEXT >"
       "<error:	>"
	   This	directive queues a conditional error message within the	global
	   special variable "@!" and then fails	to match at that point (that
	   is, it is equivalent	to a "(?!)" or "(*FAIL)" when matching).

       "<fatal:	(?{ CODE })  >"
       "<fatal:	LITERAL	TEXT >"
       "<fatal:	>"
	   This	directive is exactly the same as an "<error:...>" in every
	   respect except that it immediately causes the entire	surrounding
	   grammar to fail, and	parsing	to immediate cease.

       "<warning: (?{ CODE })  >"
       "<warning: LITERAL TEXT >"
	   This	directive is exactly the same as an "<error:...>" in every
	   respect except that it does not induce a failure to match at	the
	   point it appears. That is, it is equivalent to a "(?=)" ["succeed
	   and continue	matching"], rather than	a "(?!)" ["fail	and

       "<debug:	COMMAND	>"
	   During the matching of grammar regexes send debugging and warning
	   information to the specified	log file (see "<logfile: LOGFILE>").

	   The available "COMMAND"'s are:

	       <debug: continue>    ___	Debug until end	of complete parse
	       <debug: run>	    _/

	       <debug: on>	    ___	Debug until next subrule match
	       <debug: match>	    _/

	       <debug: try>	    ___	Debug until next subrule call or match
	       <debug: step>	    _/

	       <debug: same>	    ___	Maintain current debugging mode

	       <debug: off>	    ___	No debugging

	   See also the	$DEBUG special variable.

       "<logfile: LOGFILE>"
       "<logfile:    -	 >"
	   During the compilation of grammar regexes, send debugging and
	   warning information to the specified	LOGFILE	(or to *STDERR if "-"
	   is specified).

	   If the specified LOGFILE name contains a %t,	it is replaced with a
	   (sortable) "YYYYMMDD.HHMMSS"	timestamp. For example:

	       <logfile: test-run-%t >

	   executed at around 9.30pm on	the 21st of March 2009,	would generate
	   a log file named: "test-run-20090321.213056"

       "<log: (?{ CODE })  >"
       "<log: LITERAL TEXT >"
	   Append a message to the log file. If	the argument is	a code block,
	   that	code is	expected to return the text of the message; if the
	   argument is anything	else, that something else is the literal

	   If the block	returns	two or more values, the	first is treated as a
	   log message severity	indicator, and the remaining values as
	   separate lines of text to be	logged.

       "<timeout: INT >"
	   Restrict the	match-time of the parse	to the specified number	of
	   seconds.  Queues a error message and	terminates the entire match
	   process if the parse	does not complete within the nominated time

       Subrule calls

	   Call	the subrule whose name is IDENTIFIER.

	   If it matches successfully, save the	hash it	returns	in the current
	   scope's result-hash,	under the key 'IDENTIFIER'.

	   Call	the subrule whose name is IDENTIFIER_1.

	   If it matches successfully, save the	hash it	returns	in the current
	   scope's result-hash,	under the key 'IDENTIFIER_2'.

	   In other words, the "IDENTIFIER_1=" prefix changes the key under
	   which the result of calling a subrule is stored.

	   Call	the subrule whose name is IDENTIFIER.  Don't save the hash it

	   In other words, the "dot" prefix disables saving of subrule

       "<IDENTIFIER= ( PATTERN )>"
	   Match the subpattern	PATTERN.

	   If it matches successfully, capture the substring it	matched	and
	   save	that substring in the current scope's result-hash, under the
	   key 'IDENTIFIER'.

       "<.IDENTIFIER= (	PATTERN	)>"
	   Match the subpattern	PATTERN.  Don't	save the substring it matched.

       "<IDENTIFIER= %HASH>"
	   Match a sequence of non-whitespace then verify that the sequence is
	   a key in the	specified hash

	   If it matches successfully, capture the sequence it matched and
	   save	that substring in the current scope's result-hash, under the
	   key 'IDENTIFIER'.

	   Match a key from the	hash.  Don't save the substring	it matched.

       "<IDENTIFIER= (?{ CODE })>"
	   Execute the specified CODE.

	   Save	the result (of the final expression that the CODE evaluates)
	   in the current scope's result-hash, under the key 'IDENTIFIER'.

	   Call	the subrule whose name is IDENTIFIER.

	   If it matches successfully, append the hash it returns to a nested
	   array within	the current scope's result-hash, under the key

	   Call	the subrule whose name is IDENTIFIER_1.

	   If it matches successfully, append the hash it returns to a nested
	   array within	the current scope's result-hash, under the key

       "<ANY_SUBRULE>+ % (PATTERN)"
       "<ANY_SUBRULE>* % (PATTERN)"
	   Repeatedly call the first subrule.  Keep matching as	long as	the
	   subrule matches, provided successive	matches	are separated by
	   matches of the second subrule or the	pattern.

	   In other words, match a list	of ANY_SUBRULE's separated by

	   Note	that, if a pattern is used to specify the separator, it	must
	   be specified	in some	kind of	matched	parentheses. These may be
	   capturing ["(...)"],	non-capturing ["(?:...)"], non-backtracking
	   ["(?>...)"],	or any other construct enclosed	by an opening and
	   closing paren.

   Special variables within grammar actions
	   These are both aliases for the built-in read-only $^N variable,
	   which always	contains the substring matched by the nearest
	   preceding "(...)"  capture. $^N still works perfectly well, but
	   these are provided to improve the readability of code blocks	and
	   error messages respectively.

	   This	variable contains the index at which the next match will be
	   attempted within the	string being parsed. It	is most	commonly used
	   in "<error:...>" or "<log:...>" directives:

	       <rule: ListElem>
		   <log: (?{ "Trying words at index $INDEX" })>
		   <MATCH=( \w++ )>
		   <log: (?{ "Trying digits at index $INDEX" })>
		   <MATCH=( \d++ )>
		   <error: (?{ "Missing	ListElem near index $INDEX" })>

	   This	variable contains all the saved	results	of any subrules	called
	   from	the current rule. In other words, subrule calls	like:

	       <ListElem>  <Separator= (,)>

	   stores their	respective match results in $MATCH{'ListElem'} and

	   This	variable is an alias for $MATCH{"="}. This is the %MATCH entry
	   for the special "override value". If	this entry is defined, its
	   value overrides the usual "return \%MATCH" semantics	of a
	   successful rule.

	   This	variable contains all the key/value pairs that were passed
	   into	a particular subrule call.

	       <Keyword>  <Command>  <Terminator(:Keyword)>

	   the "Terminator" rule could get access to the text matched by
	   "<Keyword>" like so:

	       <token: Terminator>
		   end_	(??{ $ARG{'Keyword'} })

	   Note	that to	match against the calling subrules 'Keyword' value,
	   it's	necessary to use either	a deferred interpolation ("(??{...})")
	   or a	qualified matchref:

	       <token: Terminator>
		   end_	<\:Keyword>

	   A common mistake is to attempt to directly interpolate the

	       <token: Terminator>
		   end_	$ARG{'Keyword'}

	   This	evaluates $ARG{'Keyword'} when the grammar is compiled,	rather
	   than	when the rule is matched.

       $_  At the start	of any code blocks inside any regex, the variable $_
	   contains the	complete string	being matched against. The current
	   matching position within that string	is given by: "pos($_)".

	   This	variable stores	the current debugging mode (which may be any
	   of: 'off', 'on', 'run', 'continue', 'match',	'step',	or 'try'). It
	   is set automatically	by the "<debug:...>" command, but may also be
	   set manually	in a code block	(which can be useful for conditional
	   debugging). For example:

	       <rule: ListElem>

		   # Conditionally debug if 'foobar' encountered...
		   (?{ $DEBUG =	$MATCH{Identifier} eq 'foobar' ? 'step'	: 'off'	})


	   See also: the "<log:	LOGFILE>" and "<debug: DEBUG_CMD>" directives.

       o   Prior to Perl 5.14, the Perl	5 regex	engine as not reentrant. So
	   any attempt to perform a regex match	inside a "(?{ ... })" or "(??{
	   ... })" under Perl 5.12 or earlier will almost certainly lead to
	   either weird	data corruption	or a segfault.

	   The same calamities can also	occur in any constructor called	by
	   "<objrule:>". If the	constructor invokes another regex in any way,
	   it will most	likely fail catastrophically. In particular, this
	   means that Moose constructors will frequently crash and burn	within
	   a Regex::Grammars grammar (for example, if the Moose-based class
	   declares an attribute type constraint such as 'Int',	which Moose
	   checks using	a regex).

       o   The additional regex	constructs this	module provides	are
	   implemented by rewriting regular expressions. This is a (safer)
	   form	of source filtering, but still subject to all the same
	   limitations and fallibilities of any	other macro-based solution.

       o   In particular, rewriting the	macros involves	the insertion of (a
	   lot of) extra capturing parentheses.	This means you can no longer
	   assume that particular capturing parens correspond to particular
	   numeric variables: i.e. to $1, $2, $3 etc. If you want to capture
	   directly use	Perl 5.10's named capture construct:

	       (?<name>	[^\W\d]\w* )

	   Better still, capture the data in its correct hierarchical context
	   using the module's "named subpattern" construct:

	       <name= ([^\W\d]\w*) >

       o   No recursive	descent	parser--including those	created	with
	   Regexp::Grammars--can directly handle left-recursive	grammars with
	   rules of the	form:

	       <rule: List>
		   <List> , <ListElem>

	   If you find yourself	attempting to write a left-recursive grammar
	   (which Perl 5.10 may	or may not complain about, but will never
	   successfully	parse with), then you probably need to use the
	   "separated list" construct instead:

	       <rule: List>
		   <[ListElem]>+ % (,)

       o   Grammatical parsing with Regexp::Grammars can fail if your grammar
	   places "non-backtracking" directives	(i.e. the "(?>...)" block or
	   the "?+", "*+", or "++" repetition specifiers) around a subrule
	   call.  The problem appears to be that preventing the	regex from
	   backtracking	through	the in-regex actions that Regexp::Grammars
	   adds	causes the module's internal stack to fall out of sync with
	   the regex match.

	   For the time	being, you need	to make	sure that grammar rules	don't
	   appear inside a "non-backtracking" directive.

       o   Similarly, parsing with Regexp::Grammars will fail if your grammar
	   places a subrule call within	a positive look-ahead, since these
	   don't play nicely with the data stack.

	   This	seems to be an internal	problem	with perl itself.
	   Investigations, and attempts	at a workaround, are proceeding.

	   For the time	being, you need	to make	sure that grammar rules	don't
	   appear inside a positive lookahead or use the "<?RULENAME>"
	   construct instead

       Note that (because the author cannot find a way to throw	exceptions
       from within a regex) none of the	following diagnostics actually throws
       an exception.

       Instead,	these messages are simply written to the specified parser
       logfile (or to *STDERR, if no logfile is	specified).

       However,	any fatal match-time message will immediately terminate	the
       parser matching and will	still set $@ (as if an exception had been
       thrown and caught at that point in the code). You then have the option
       to check	$@ immediately after matching with the grammar,	and rethrow if

	   if ($input =~ $grammar) {
	   else	{
	       die if $@;

       "Found call to %s, but no %s was	defined	in the grammar"
	   You specified a call	to a subrule for which there was no definition
	   in the grammar. Typically that's either because you forget to
	   define the rule, or because you misspelled either the definition or
	   the subrule call. For example:


	       <rule: fiel>	       <---- misspelled	rule
		   <lines>	       <---- used but never defined

	   Regexp::Grammars converts any such subrule call attempt to an
	   instant catastrophic	failure	of the entire parse, so	if your	parser
	   ever	actually tries to perform that call, Very Bad Things will

       "Entire parse terminated	prematurely while attempting to	call
       non-existent rule: %s"
	   You ignored the previous error and actually tried to	call to	a
	   subrule for which there was no definition in	the grammar. Very Bad
	   Things are now happening. The parser	got very upset,	took its ball,
	   and went home.  See the preceding diagnostic	for remedies.

	   This	diagnostic should throw	an exception, but can't. So it sets $@
	   instead, allowing you to trap the error manually if you wish.

       "Fatal error: <objrule: %s> returned a non-hash-based object"
	   An <objrule:> was specified and returned a blessed object that
	   wasn't a hash. This will break the behaviour	of the grammar,	so the
	   module immediately reports the problem and gives up.

	   The solution	is to use only hash-based classes with <objrule:>

       "Can't match against <grammar: %s>"
	   The regex you attempted to match against defined a pure grammar,
	   using the "<grammar:...>" directive.	Pure grammars have no start-
	   pattern and hence cannot be matched against directly.

	   You need to define a	matchable grammar that inherits	from your pure
	   grammar and then calls one of its rules. For	example, instead of:

	       my $greeting = qr{
		   <grammar: Greeting>

		   <rule: greet>
		       Hi there
		       | Hello
		       | Yo!

	   you need:

		   <grammar: Greeting>

		   <rule: greet>
		       Hi there
		     | Hello
		     | Yo!

	       my $greeting = qr{
		   <extends: Greeting>

       "Multiple definitions for <%s>"
	   You defined two or more rules or tokens with	the same name.	The
	   first one defined will be used, the rest will be ignored.

	   To get rid of the warning, get rid of the extra definitions (or, at
	   least, comment them out).

       "Possible invalid subrule call %s"
	   Your	grammar	contained something of the form:


	   which you might have	intended to be a subrule call, but which
	   didn't correctly parse as one. If it	was supposed to	be a
	   Regexp::Grammars subrule call, you need to check the	syntax you
	   used. If it wasn't supposed to be a subrule call, you can silence
	   the warning by rewriting it and quoting the leading angle:


       "Possible invalid directive: %s"
	   Your	grammar	contained something of the form:


	   but which wasn't a known directive like "<rule:...>"	or
	   "<debug:...>". If it	was supposed to	be a Regexp::Grammars
	   directive, check the	spelling of the	directive name.	If it wasn't
	   supposed to be a directive, you can silence the warning by
	   rewriting it	and quoting the	leading	angle:


       "Repeated subrule %s will only capture its final	match"
	   You specified a subrule call	with a repetition qualifier, such as:




	   Because each	subrule	call saves its result in a hash	entry of the
	   same	name, each repeated match will overwrite the previous ones, so
	   only	the last match will ultimately be saved. If you	want to	save
	   all the matches, you	need to	tell Regexp::Grammars to save the
	   sequence of results as a nested array within	the hash entry,	like




	   If you really did intend to throw away every	result but the final
	   one,	you can	silence	the warning by placing the subrule call	inside
	   any kind of parentheses. For	example:



	       (?: <ListElem> )+

       "Unable to open log file	'$filename' (%s)"
	   You specified a "<logfile:...>" directive but the file whose	name
	   you specified could not be opened for writing (for the reason given
	   in the parens).

	   Did you misspell the	filename, or get the permissions wrong
	   somewhere in	the filepath?

       "Non-backtracking subrule %s not	fully supported	yet"
	   Because of inherent limitations in the Perl 5.10 regex engine, non-
	   backtracking	constructs like	"++", "*+", "?+", and "(?>...)"	do not
	   always work correctly when applied to subrule calls.

	   If the grammar doesn't work properly, replace the offending
	   constructs with regular backtracking	versions instead. If the
	   grammar does	work, you can silence the warning by enclosing the
	   subrule call	in any kind of parentheses. For	example, change:



	       ( <[ListElem]> )++

       "Unexpected item	before first subrule specification in definition of
       <grammar: %s>"
	   Named grammar definitions must consist only of rule and token
	   definitions.	 They cannot have patterns before the first
	   definitions.	 You had some kind of pattern before the first
	   definition, which will be completely	ignored	within the grammar.

	   To silence the warning, either comment out or delete	whatever is
	   before the first rule/token definition.

       "Ignoring useless empty <ws:> directive"
	   The "<ws:...>" directive specifies what whitespace matches within
	   the current rule. An	empty "<ws:>" directive	would cause whitespace
	   to match nothing at all, which is what happens in a token
	   definition, not in a	rule definition.

	   Either put some subpattern inside the empty "<ws:...>" or, if you
	   really do want whitespace to	match nothing at all, remove the
	   directive completely	and change the rule definition to a token

       "Ignoring useless <ws: %s > directive in	a token	definition"
	   The "<ws:...>" directive is used to specify what whitespace matches
	   within a rule. Since	whitespace never matches anything inside
	   tokens, putting a "<ws:...>"	directive in a token is	a waste	of

	   Either remove the useless directive,	or else	change the surrounding
	   token definition to a rule definition.

       Regexp::Grammars	requires no configuration files	or environment

       This module only	works under Perl 5.10 or later.

       This module is likely to	be incompatible	with any other module that
       automagically rewrites regexes. For example it may conflict with
       Regexp::DefaultFlags, Regexp::DeferredExecution,	or Regexp::Extended.

       No bugs have been reported.

       Please report any bugs or feature requests to
       "", or through the web interface at

       Damian Conway  "<>"

       Copyright (c) 2009, Damian Conway "<>". All rights

       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself. See	perlartistic.



perl v5.32.1			  2021-03-01		   Regexp::Grammars(3)


Want to link to this manual page? Use this URL:

home | help