Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
PPIx::Regexp(3)	      User Contributed Perl Documentation      PPIx::Regexp(3)

NAME
       PPIx::Regexp - Represent	a regular expression of	some sort

SYNOPSIS
	use PPIx::Regexp;
	use PPIx::Regexp::Dumper;
	my $re = PPIx::Regexp->new( 'qr{foo}smx' );
	PPIx::Regexp::Dumper->new( $re )
	    ->print();

DEPRECATION NOTICE
       The postderef argument to new() is being	put through a deprecation
       cycle and retracted. After the retraction, postfix dereferences will
       always be recognized. This is the default behaviour now.

       Starting	with version 0.074_01, the first use of	this argument will
       warn. With the first release after April	1 2021,	all uses will warn.
       After a further six months, all uses will become	fatal.

INHERITANCE
       "PPIx::Regexp" is a PPIx::Regexp::Node.

       "PPIx::Regexp" has no descendants.

DESCRIPTION
       The purpose of the PPIx-Regexp package is to parse regular expressions
       in a manner similar to the way the PPI package parses Perl. This	class
       forms the root of the parse tree, playing a role	similar	to
       PPI::Document.

       This package shares with	PPI the	property of being round-trip safe.
       That is,

	my $expr = 's/ ( \d+ ) ( \D+ ) /$2$1/smxg';
	my $re = PPIx::Regexp->new( $expr );
	print $re->content() eq	$expr ?	"yes\n"	: "no\n"

       should print 'yes' for any valid	regular	expression.

       Navigation is similar to	that provided by PPI. That is to say, things
       like "children",	"find_first", "snext_sibling" and so on	all work
       pretty much the same way	as in PPI.

       The class hierarchy is also similar to PPI. Except for some utility
       classes (the dumper, the	lexer, and the tokenizer) all classes are
       descended from PPIx::Regexp::Element, which provides basic navigation.
       Tokens are descended from PPIx::Regexp::Token, which provides content.
       All containers are descended from PPIx::Regexp::Node, which provides
       for children, and all structure elements	are descended from
       PPIx::Regexp::Structure,	which provides beginning and ending
       delimiters, and a type.

       There are two features of PPI that this package does not	provide	-
       mutability and operator overloading. There are no plans for serious
       mutability, though something like PPI's "prune" functionality might be
       considered. Similarly there are no plans	for operator overloading,
       which appears to	the author to represent	a performance hit for little
       tangible	gain.

NOTICE
       The author will attempt to preserve the documented interface, but if
       the interface needs to change to	correct	some egregiously bad design or
       implementation decision,	then it	will change.  Any incompatible changes
       will go through a deprecation cycle.

       The goal	of this	package	is to parse well-formed	regular	expressions
       correctly. A secondary goal is not to blow up on	ill-formed regular
       expressions. The	correct	identification and characterization of ill-
       formed regular expressions is not a goal	of this	package, nor is	the
       consistent parsing of ill-formed	regular	expressions from release to
       release.

       This policy attempts to track features in development releases as well
       as public releases. However, features added in a	development release
       and then	removed	before the next	production release will	not be
       tracked,	and any	functionality relating to such features	will be
       removed.	The issue here is the potential	re-use (with different
       semantics) of syntax that did not make it into the production release.

       From time to time the Perl regular expression engine changes in ways
       that change the parse of	a given	regular	expression. When these changes
       occur, "PPIx::Regexp" will be changed to	produce	the more modern	parse.
       Known examples of this include:

       $( no longer interpolates as of Perl 5.005, per "perl5005delta".
	   Newer Perls seem to parse this as "qr{$}" (i.e. an end-of-string or
	   newline assertion) followed by an open parenthesis, and that	is
	   what	"PPIx::Regexp" does.

       $) and $| also seem to parse as the "$" assertion
	   followed by the relevant meta-character, though I have no
	   documentation reference for this.

       "@+" and	"@-" no	longer interpolate as of Perl 5.9.4
	   per "perl594delta". Subsequent Perls	treat "@+" as a	quantified
	   literal and "@-" as two literals, and that is what "PPIx::Regexp"
	   does. Note that subscripted references to these arrays do
	   interpolate,	and are	so parsed by "PPIx::Regexp".

       Only space and horizontal tab are whitespace as of Perl 5.23.4
	   when	inside a bracketed character class inside an extended
	   bracketed character class, per "perl5234delta". Formerly any	white
	   space character parsed as whitespace. This change in	"PPIx::Regexp"
	   will	be reverted if the change in Perl does not make	it into	Perl
	   5.24.0.

       Unescaped literal left curly brackets
	   These are being removed in positions	where quantifiers are legal,
	   so that they	can be used for	new functionality. Some	of them	are
	   gone	in 5.25.1, others will be removed in a future version of Perl.
	   In situations where they have been removed, perl_version_removed()
	   will	return the version in which they were removed. When the	new
	   functionality appears, the parse produced by	this software will
	   reflect the new functionality.

	   NOTE	that the situation with	a literal left curly after a literal
	   character is	complicated. It	was made an error in Perl 5.25.1, and
	   remained so through all 5.26	releases, but became a warning again
	   in 5.27.1 due to its	use in GNU Autoconf. Whether it	will ever
	   become illegal again	is not clear to	me based on the	contents of
	   perl5271delta. At the moment	perl_version_removed() returns
	   "undef", but	obviously that is not the whole	story, and methods
	   accepts_perl() and requirements_for_perl() were introduced to deal
	   with	this complication.

       "\o{...}"
	   is parsed as	the octal equivalent of	"\x{...}". This	is its meaning
	   as of perl 5.13.2. Before 5.13.2 it was simply literal 'o' and so
	   on.

       There are very probably other examples of this. When they come to light
       they will be documented as producing the	modern parse, and the code
       modified	to produce this	parse if necessary.

METHODS
       This class provides the following public	methods. Methods not
       documented here are private, and	unsupported in the sense that the
       author reserves the right to change or remove them without notice.

   new
	my $re = PPIx::Regexp->new('/foo/');

       This method instantiates	a "PPIx::Regexp" object	from a string, a
       PPI::Token::QuoteLike::Regexp, a	PPI::Token::Regexp::Match, or a
       PPI::Token::Regexp::Substitute.	Honestly, any PPI::Element will	work,
       but only	the three Regexp classes mentioned previously are likely to do
       anything	useful.

       Whatever	form the argument takes, it is assumed to consist entirely of
       a valid match, substitution, or "qr<>" string.

       Optionally you can pass one or more name/value pairs after the regular
       expression. The possible	options	are:

       default_modifiers array_reference
	   This	option specifies a reference to	an array of default modifiers
	   to apply to the regular expression being parsed. Each modifier is
	   specified as	a string. Any actual modifiers found supersede the
	   defaults.

	   When	applying the defaults, '?' and '/' are completely ignored, and
	   '^' is ignored unless it occurs at the beginning of the modifier.
	   The first dash ('-')	causes subsequent modifiers to be negated.

	   So, for example, if you wish	to produce a "PPIx::Regexp" object
	   representing	the regular expression in

	    use	re '/smx';
	    {
	       no re '/x';
	       m/ foo /;
	    }

	   you would (after some help from PPI in finding the relevant
	   statements),	do something like

	    my $re = PPIx::Regexp->new(	'm/ foo	/',
		default_modifiers => [ '/smx', '-/x' ] );

       encoding	name
	   This	option specifies the encoding of the regular expression. This
	   is passed to	the tokenizer, which will "decode" the regular
	   expression string before it tokenizes it. For example:

	    my $re = PPIx::Regexp->new(	'/foo/',
		encoding => 'iso-8859-1',
	    );

       index_locations Boolean
	   This	Boolean	option specifies whether the locations of the elements
	   in the regular expression should be indexed.

	   If unspecified or specified as "undef" a default value is used.
	   This	default	is true	if the argument	is a PPI::Element or the
	   "location" option was specified. Otherwise the default is false.

       location	array_reference
	   This	option specifies the location of the new object	in the
	   document from which it was created. It is a reference to a five-
	   element array compatible with that returned by the "location()"
	   method of PPI::Element.

	   If not specified, the location of the original string is used if it
	   was specified as a PPI::Element.

	   If no location can be determined, the various "location()" methods
	   will	return "undef".

       postderef Boolean
	   THIS	ARGUMENT IS DEPRECATED.	 See DEPRECATION NOTICE	above for the
	   details.

	   This	option is passed on to the tokenizer, where it specifies
	   whether postfix dereferences	are recognized in interpolations and
	   code. This experimental feature was introduced in Perl 5.19.5.

	   As of version 0.074_01, the default is true.	 Through release
	   0.074, the default was the value of
	   $PPIx::Regexp::Tokenizer::DEFAULT_POSTDEREF,	which was true.	When
	   originally introduced this was false, but was documented as
	   becoming true when and if postfix dereferencing became mainstream.
	   The	intent to mainstream was announced with	Perl 5.23.1, and
	   became official (so to speak) with Perl 5.24.0, so the default
	   became true with PPIx::Regexp 0.049_01.

	   Note	that if	PPI starts unconditionally recognizing postfix
	   dereferences, this argument will immediately	become ignored,	and
	   will	be put through a deprecation cycle and removed.

       strict Boolean
	   This	option is passed on to the tokenizer and lexer,	where it
	   specifies whether the parse should assume "use re 'strict'" is in
	   effect.

	   The 'strict'	pragma was introduced in Perl 5.22, and	its
	   documentation says that it is experimental, and that	there is no
	   commitment to backward compatibility. The same applies to the parse
	   produced when this option is	asserted. Also,	the usual caveat
	   applies: if "use re 'strict'" ends up being retracted, this option
	   and all related functionality will be also.

	   Given the nature of "use re 'strict'", you should expect that if
	   you assert this option, regular expressions that previously parsed
	   without error might no longer do so.	If an element ends up being
	   declared an error because this option is set, its
	   "perl_version_introduced()" will be the Perl	version	at which "use
	   re 'strict'"	started	rejecting these	elements.

	   The default is false.

       trace number
	   If greater than zero, this option causes trace output from the
	   parse.  The author reserves the right to change or eliminate	this
	   without notice.

       Passing optional	input other than the above is not an error, but
       neither is it supported.

   new_from_cache
       This static method wraps	"new" in a caching mechanism. Only one object
       will be generated for a given PPI::Element, no matter how many times
       this method is called. Calls after the first for	a given	PPI::Element
       simply return the same "PPIx::Regexp" object.

       When the	"PPIx::Regexp" object is returned from cache, the values of
       the optional arguments are ignored.

       Calls to	this method with the regular expression	in a string rather
       than a PPI::Element will	not be cached.

       Caveat: This method is provided for code	like Perl::Critic which	might
       instantiate the same object multiple times. The cache will persist
       until "flush_cache" is called.

   flush_cache
	$re->flush_cache();	       # Remove	$re from cache
	PPIx::Regexp->flush_cache();   # Empty the cache

       This method flushes the cache used by "new_from_cache". If called as a
       static method with no arguments,	the entire cache is emptied. Otherwise
       any objects specified are removed from the cache.

   capture_names
	foreach	my $name ( $re->capture_names()	) {
	    print "Capture name	'$name'\n";
	}

       This convenience	method returns the capture names found in the regular
       expression.

       This method is equivalent to

	$self->regular_expression()->capture_names();

       except that if "$self->regular_expression()" returns "undef" (meaning
       that something went terribly wrong with the parse) this method will
       simply return.

   delimiters
	print join("\t", PPIx::Regexp->new('s/foo/bar/')->delimiters());
	# prints '//	  //'

       When called in list context, this method	returns	either one or two
       strings,	depending on whether the parsed	expression has a replacement
       string. In the case of non-bracketed substitutions, the start delimiter
       of the replacement string is considered to be the same as its finish
       delimiter, as illustrated by the	above example.

       When called in scalar context, you get the delimiters of	the regular
       expression; that	is, element 0 of the array that	is returned in list
       context.

       Optionally, you can pass	an index value and the corresponding
       delimiters will be returned; index 0 represents the regular
       expression's delimiters,	and index 1 represents the replacement
       string's	delimiters, which may be undef.	For example,

	print PPIx::Regexp->new('s{foo}<bar>')->delimiters(1);
	# prints '<>'

       If the object was not initialized with a	valid regexp of	some sort, the
       results of this method are undefined.

   errstr
       This static method returns the error string from	the most recent
       attempt to instantiate a	"PPIx::Regexp".	It will	be "undef" if the most
       recent attempt succeeded.

   extract_regexps
	my $doc	= PPI::Document->new( $path );
	$doc->index_locations();
	my @res	= PPIx::Regexp->extract_regexps( $doc )

       This convenience	(well, sort-of)	static method takes as its argument a
       PPI::Document object and	returns	"PPIx::Regexp" objects corresponding
       to all regular expressions found	in it, in the order in which they
       occur in	the document. You will need to keep a reference	to the
       original	PPI::Document object if	you wish to be able to recover the
       original	PPI::Element objects via the PPIx::Regexp source() method.

   failures
	print "There were ", $re->failures(), "	parse failures\n";

       This method returns the number of parse failures. This is a count of
       the number of unknown tokens plus the number of unterminated structures
       plus the	number of unmatched right brackets of any sort.

   max_capture_number
	print "Highest used capture number ",
	    $re->max_capture_number(), "\n";

       This convenience	method returns the highest capture number used by the
       regular expression. If there are	no captures, the return	will be	0.

       This method is equivalent to

	$self->regular_expression()->max_capture_number();

       except that if "$self->regular_expression()" returns "undef" (meaning
       that something went terribly wrong with the parse) this method will
       too.

   modifier
	my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
	print $re->modifier()->content(), "\n";
	# prints 'smx'.

       This method retrieves the modifier of the object. This comes from the
       end of the initializing string or object	and will be a
       PPIx::Regexp::Token::Modifier.

       Note that this object represents	the actual modifiers present on	the
       regexp, and does	not take into account any that may have	been applied
       by default (i.e.	via the	"default_modifiers" argument to	"new()"). For
       something that takes account of default modifiers, see
       modifier_asserted(), below.

       In the event of a parse failure,	there may not be a modifier present,
       in which	case nothing is	returned.

   modifier_asserted
	my $re = PPIx::Regexp->new( '/ . /',
	    default_modifiers => [ 'smx' ] );
	print $re->modifier_asserted( 'x' ) ? "yes\n" :	"no\n";
	# prints 'yes'.

       This method returns true	if the given modifier is asserted for the
       regexp, whether explicitly or by	the modifiers passed in	the
       "default_modifiers" argument.

       Starting	with version 0.036_01, if the argument is a single-character
       modifier	followed by an asterisk	(intended as a wild card character),
       the return is the number	of times that modifier appears.	In this	case
       an exception will be thrown if you specify a multi-character modifier
       (e.g.  'ee*'), or if you	specify	one of the match semantics modifiers
       (e.g.  'a*').

   regular_expression
	my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
	print $re->regular_expression()->content(), "\n";
	# prints '/(foo)/'.

       This method returns that	portion	of the object which actually
       represents a regular expression.

   replacement
	my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
	print $re->replacement()->content(), "\n";
	# prints '${1}bar/'.

       This method returns that	portion	of the object which represents the
       replacement string. This	will be	"undef"	unless the regular expression
       actually	has a replacement string. Delimiters will be included, but
       there will be no	beginning delimiter unless the regular expression was
       bracketed.

   source
	my $source = $re->source();

       This method returns the object or string	that was used to instantiate
       the object.

   type
	my $re = PPIx::Regexp->new( 's/(foo)/${1}bar/smx' );
	print $re->type()->content(), "\n";
	# prints 's'.

       This method retrieves the type of the object. This comes	from the
       beginning of the	initializing string or object, and will	be a
       PPIx::Regexp::Token::Structure whose "content" is one of	's', 'm',
       'qr', or	''.

RESTRICTIONS
       By the nature of	this module, it	is never going to get everything
       right.  Many of the known problem areas involve interpolations one way
       or another.

   Ambiguous Syntax
       Perl's regular expressions contain cases	where the syntax is ambiguous.
       A particularly egregious	example	is an interpolation followed by	square
       or curly	brackets, for example $foo[...]. There is nothing in the
       syntax to say whether the programmer wanted to interpolate an element
       of array	@foo, or whether he wanted to interpolate scalar $foo, and
       then follow that	interpolation by a character class.

       The perlop documentation	notes that in this case	what Perl does is to
       guess. That is, it employs various heuristics on	the code to try	to
       figure out what the programmer wanted. These heuristics are documented
       as being	undocumented (!) and subject to	change without notice. As an
       example of the problems even perl faces in parsing Perl,	see
       <https://github.com/perl/perl5/issues/16478>.

       Given this situation, this module's chances of duplicating every	Perl
       version's interpretation	of every regular expression are	pretty much
       nil.  What it does now is to assume that	square brackets	containing
       only an integer or an interpolation represent a subscript; otherwise
       they represent a	character class. Similarly, curly brackets containing
       only a bareword or an interpolation are a subscript; otherwise they
       represent a quantifier.

   Changes in Syntax
       Sometimes the introduction of new syntax	changes	the way	a regular
       expression is parsed. For example, the "\v" character class was
       introduced in Perl 5.9.5. But it	did not	represent a syntax error prior
       to that version of Perl,	it was simply parsed as	"v". So

	$ perl -le 'print "v" =~ m/\v/ ? "yes" : "no"'

       prints "yes" under Perl 5.8.9, but "no" under 5.10.0. "PPIx::Regexp"
       generally assumes the more modern parse in cases	like this.

   Equivocation
       Very occasionally, a construction will be removed and then added	back
       -- and then, conceivably, removed again.	In this	case, the plan is for
       perl_version_introduced() to return the earliest	version	in which the
       construction appeared, and perl_version_removed() to return the version
       after the last version in which it appeared (whether production or
       development), or	"undef"	if it is in the	highest-numbered Perl.

       The constructions involved in this are:

       Un-escaped literal left curly after literal

       That is,	something like "qr<x{>".

       This was	made an	error in 5.25.1, and it	was an error in	5.26.0.	 But
       it became a warning again in 5.27.1. The	perl5271delta says it was re-
       instated	because	the changes broke GNU Autoconf,	and the	warning
       message says it will be removed in Perl 5.30.

       Accordingly, perl_version_introduced() returns 5.0. At the moment
       perl_version_removed() returns '5.025001'. But if it is present with or
       without warning in 5.28,	perl_version_removed() will become "undef". If
       you need	finer resolution than this, see	PPIx::Regexp::Element methods
       l<accepts_perl()|PPIx::Regexp::Element/accepts_perl> and
       l<requirements_for_perl()|PPIx::Regexp::Element/requirements_for_perl>

   Static Parsing
       It is well known	that Perl can not be statically	parsed.	That is, you
       can not completely parse	a piece	of Perl	code without executing that
       same code.

       Nevertheless, this class	is trying to statically	parse regular
       expressions. The	main problem with this is that there is	no way to know
       what is being interpolated into the regular expression by an
       interpolated variable. This is a	problem	because	the interpolated value
       can change the interpretation of	adjacent elements.

       This module deals with this by making assumptions about what is in an
       interpolated variable. These assumptions	will not be enumerated here,
       but in general the principal is to assume the interpolated value	does
       not change the interpretation of	the regular expression.	For example,

	my $foo	= 'a-z]';
	my $re = qr{[$foo};

       is fine with the	Perl interpreter, but will confuse the dickens out of
       this module. Similarly and more usefully, something like

	my $mods = 'i';
	my $re = qr{(?$mods:foo)};

       or maybe

	my $mods = 'i';
	my $re = qr{(?$mods)$foo};

       probably	sets a modifier	of some	sort, and that is how this module
       interprets it. If the interpolation is not about	modifiers, this	module
       will get	it wrong. Another such semi-benign example is

	my $foo	= $] >=	5.010 ?	'?<foo>' : '';
	my $re = qr{($foo\w+)};

       which will parse, but this module will never realize that it might be
       looking at a named capture.

   Non-Standard	Syntax
       There are modules out there that	alter the syntax of Perl. If the
       syntax of a regular expression is altered, this module has no way to
       understand that it has been altered, much less to adapt to the
       alteration. The following modules are known to cause problems:

       Acme::PerlML, which renders Perl	as XML.

       "Data::PostfixDeref", which causes Perl to interpret suffixed empty
       brackets	as dereferencing the thing they	suffix.	This module by Ben
       Morrow ("BMORROW") appears to have been retracted.

       Filter::Trigraph, which recognizes ANSI C trigraphs, allowing Perl to
       be written in the ISO 646 character set.

       Perl6::Pugs. Enough said.

       Perl6::Rules, which back-ports some of the Perl 6 regular expression
       syntax to Perl 5.

       Regexp::Extended, which extends regular expressions in various ways,
       some of which seem to conflict with Perl	5.010.

SEE ALSO
       Regexp::Parsertron, which uses Marpa::R2	to parse the regexp, and Tree
       for navigation. Unlike "PPIx::Regexp|PPIx::Regexp", Regexp::Parsertron
       supports	modification of	the parse tree.

       Regexp::Parser, which parses a bare regular expression (without
       enclosing "qr{}", "m//",	or whatever) and uses a	different navigation
       model. After a long hiatus, this	module has been	adopted, and is	again
       supported.

SUPPORT
       Support is by the author. Please	file bug reports at
       <https://rt.cpan.org>, or in electronic mail to the author.

AUTHOR
       Thomas R. Wyant,	III wyant at cpan dot org

COPYRIGHT AND LICENSE
       Copyright (C) 2009-2020 by Thomas R. Wyant, III

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl 5.10.0. For	more details, see the full
       text of the licenses in the directory LICENSES.

       This program is distributed in the hope that it will be useful, but
       without any warranty; without even the implied warranty of
       merchantability or fitness for a	particular purpose.

perl v5.32.1			  2020-10-08		       PPIx::Regexp(3)

NAME | SYNOPSIS | DEPRECATION NOTICE | INHERITANCE | DESCRIPTION | NOTICE | METHODS | RESTRICTIONS | SEE ALSO | SUPPORT | AUTHOR | COPYRIGHT AND LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=PPIx::Regexp&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help