Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Text::Balanced(3)     User Contributed Perl Documentation    Text::Balanced(3)

       Text::Balanced -	Extract	delimited text sequences from strings.

	   use Text::Balanced qw (

	   # Extract the initial substring of $text that is delimited by
	   # two (unescaped) instances of the first character in $delim.

	   ($extracted,	$remainder) = extract_delimited($text,$delim);

	   # Extract the initial substring of $text that is bracketed
	   # with a delimiter(s) specified by $delim (where the	string
	   # in	$delim contains	one or more of '(){}[]<>').

	   ($extracted,	$remainder) = extract_bracketed($text,$delim);

	   # Extract the initial substring of $text that is bounded by
	   # an	XML tag.

	   ($extracted,	$remainder) = extract_tagged($text);

	   # Extract the initial substring of $text that is bounded by
	   # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags

	   ($extracted,	$remainder) =

	   # Extract the initial substring of $text that represents a
	   # Perl "quote or quote-like operation"

	   ($extracted,	$remainder) = extract_quotelike($text);

	   # Extract the initial substring of $text that represents a block
	   # of	Perl code, bracketed by	any of character(s) specified by $delim
	   # (where the	string $delim contains one or more of '(){}[]<>').

	   ($extracted,	$remainder) = extract_codeblock($text,$delim);

	   # Extract the initial substrings of $text that would	be extracted by
	   # one or more sequential applications of the	specified functions
	   # or	regular	expressions

	   @extracted =	extract_multiple($text,
					 [ \&extract_bracketed,

	   # Create a string representing an optimized pattern (a la Friedl)
	   # that matches a substring delimited	by any of the specified	characters
	   # (in this case: any	type of	quote or a slash)

	   $patstring =	gen_delimited_pat(q{'"`/});

	   # Generate a	reference to an	anonymous sub that is just like	extract_tagged
	   # but pre-compiled and optimized for	a specific pair	of tags, and
	   # consequently much faster (i.e. 3 times faster). It	uses qr// for better
	   # performance on repeated calls.

	   $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
	   ($extracted,	$remainder) = $extract_head->($text);

       The various "extract_..." subroutines may be used to extract a
       delimited substring, possibly after skipping a specified	prefix string.
       By default, that	prefix is optional whitespace ("/\s*/"), but you can
       change it to whatever you wish (see below).

       The substring to	be extracted must appear at the	current	"pos" location
       of the string's variable	(or at index zero, if no "pos" position	is
       defined).  In other words, the "extract_..." subroutines	don't extract
       the first occurrence of a substring anywhere in a string	(like an
       unanchored regex	would).	Rather,	they extract an	occurrence of the
       substring appearing immediately at the current matching position	in the
       string (like a "\G"-anchored regex would).

   General Behaviour in	List Contexts
       In a list context, all the subroutines return a list, the first three
       elements	of which are always:

       [0] The extracted string, including the specified delimiters.  If the
	   extraction fails "undef" is returned.

       [1] The remainder of the	input string (i.e. the characters after	the
	   extracted string). On failure, the entire string is returned.

       [2] The skipped prefix (i.e. the	characters before the extracted
	   string).  On	failure, "undef" is returned.

       Note that in a list context, the	contents of the	original input text
       (the first argument) are	not modified in	any way.

       However,	if the input text was passed in	a variable, that variable's
       "pos" value is updated to point at the first character after the
       extracted text. That means that in a list context the various
       subroutines can be used much like regular expressions. For example:

	   while ( $next = (extract_quotelike($text))[0] )
	       # process next quote-like (in $next)

   General Behaviour in	Scalar and Void	Contexts
       In a scalar context, the	extracted string is returned, having first
       been removed from the input text. Thus, the following code also
       processes each quote-like operation, but	actually removes them from

	   while ( $next = extract_quotelike($text) )
	       # process next quote-like (in $next)

       Note that if the	input text is a	read-only string (i.e. a literal), no
       attempt is made to remove the extracted text.

       In a void context the behaviour of the extraction subroutines is
       exactly the same	as in a	scalar context,	except (of course) that	the
       extracted substring is not returned.

   A Note About	Prefixes
       Prefix patterns are matched without any trailing	modifiers ("/gimsox"
       etc.)  This can bite you	if you're expecting a prefix specification
       like '.*?(?=<H1>)' to skip everything up	to the first <H1> tag. Such a
       prefix pattern will only	succeed	if the <H1> tag	is on the current
       line, since . normally doesn't match newlines.

       To overcome this	limitation, you	need to	turn on	/s matching within the
       prefix pattern, using the "(?s)"	directive: '(?s).*?(?=<H1>)'

	   The "extract_delimited" function formalizes the common idiom	of
	   extracting a	single-character-delimited substring from the start of
	   a string. For example, to extract a single-quote delimited string,
	   the following code is typically used:

	       ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
	       $extracted = $1;

	   but with "extract_delimited"	it can be simplified to:

	       ($extracted,$remainder) = extract_delimited($text, "'");

	   "extract_delimited" takes up	to four	scalars	(the input text, the
	   delimiters, a prefix	pattern	to be skipped, and any escape
	   characters) and extracts the	initial	substring of the text that is
	   appropriately delimited. If the delimiter string has	multiple
	   characters, the first one encountered in the	text is	taken to
	   delimit the substring.  The third argument specifies	a prefix
	   pattern that	is to be skipped (but must be present!)	before the
	   substring is	extracted.  The	final argument specifies the escape
	   character to	be used	for each delimiter.

	   All arguments are optional. If the escape characters	are not
	   specified, every delimiter is escaped with a	backslash ("\").  If
	   the prefix is not specified,	the pattern '\s*' - optional
	   whitespace -	is used. If the	delimiter set is also not specified,
	   the set "/["'`]/" is	used. If the text to be	processed is not
	   specified either, $_	is used.

	   In list context, "extract_delimited"	returns	a array	of three
	   elements, the extracted substring (including	the surrounding
	   delimiters),	the remainder of the text, and the skipped prefix (if
	   any). If a suitable delimited substring is not found, the first
	   element of the array	is the empty string, the second	is the
	   complete original text, and the prefix returned in the third
	   element is an empty string.

	   In a	scalar context,	just the extracted substring is	returned. In a
	   void	context, the extracted substring (and any prefix) are simply
	   removed from	the beginning of the first argument.


	       # Remove	a single-quoted	substring from the very	beginning of $text:

		   $substring =	extract_delimited($text, "'", '');

	       # Remove	a single-quoted	Pascalish substring (i.e. one in which
	       # doubling the quote character escapes it) from the very
	       # beginning of $text:

		   $substring =	extract_delimited($text, "'", '', "'");

	       # Extract a single- or double- quoted substring from the
	       # beginning of $text, optionally	after some whitespace
	       # (note the list	context	to protect $text from modification):

		   ($substring)	= extract_delimited $text, q{"'};

	       # Delete	the substring delimited	by the first '/' in $text:

		   $text = join	'', (extract_delimited($text,'/','[^/]*')[2,1];

	   Note	that this last example is not the same as deleting the first
	   quote-like pattern. For instance, if	$text contained	the string:

	       "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"

	   then	after the deletion it would contain:

	       "if ('.$UNIXCMD/s) { $cmd = $1; }"


	       "if ('./cmd' =~ ms) { $cmd = $1;	}"

	   See "extract_quotelike" for a (partial) solution to this problem.

	   Like	"extract_delimited", the "extract_bracketed" function takes up
	   to three optional scalar arguments: a string	to extract from, a
	   delimiter specifier,	and a prefix pattern. As before, a missing
	   prefix defaults to optional whitespace and a	missing	text defaults
	   to $_. However, a missing delimiter specifier defaults to
	   '{}()[]<>' (see below).

	   "extract_bracketed" extracts	a balanced-bracket-delimited substring
	   (using any one (or more) of the user-specified delimiter brackets:
	   '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
	   quoted unbalanced brackets (see below).

	   A "delimiter	bracket" is a bracket in list of delimiters passed as
	   "extract_bracketed"'s second	argument. Delimiter brackets are
	   specified by	giving either the left or right	(or both!) versions of
	   the required	bracket(s). Note that the order	in which two or	more
	   delimiter brackets are specified is not significant.

	   A "balanced-bracket-delimited substring" is a substring bounded by
	   matched brackets, such that any other (left or right) delimiter
	   bracket within the substring	is also	matched	by an opposite (right
	   or left) delimiter bracket at the same level	of nesting. Any	type
	   of bracket not in the delimiter list	is treated as an ordinary

	   In other words, each	type of	bracket	specified as a delimiter must
	   be balanced and correctly nested within the substring, and any
	   other kind of ("non-delimiter") bracket in the substring is

	   For example,	given the string:

	       $text = "{ an '[irregularly :-(]	{} parenthesized >:-)' string }";

	   then	a call to "extract_bracketed" in a list	context:

	       @result = extract_bracketed( $text, '{}'	);

	   would return:

	       ( "{ an '[irregularly :-(] {} parenthesized >:-)' string	}" , ""	, "" )

	   since both sets of '{..}' brackets are properly nested and evenly
	   balanced.  (In a scalar context just	the first element of the array
	   would be returned. In a void	context, $text would be	replaced by an
	   empty string.)

	   Likewise the	call in:

	       @result = extract_bracketed( $text, '{['	);

	   would return	the same result, since all sets	of both	types of
	   specified delimiter brackets	are correctly nested and balanced.

	   However, the	call in:

	       @result = extract_bracketed( $text, '{([<' );

	   would fail, returning:

	       ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string	}"  );

	   because the embedded	pairs of '(..)'s and '[..]'s are "cross-
	   nested" and the embedded '>'	is unbalanced. (In a scalar context,
	   this	call would return an empty string. In a	void context, $text
	   would be unchanged.)

	   Note	that the embedded single-quotes	in the string don't help in
	   this	case, since they have not been specified as acceptable
	   delimiters and are therefore	treated	as non-delimiter characters
	   (and	ignored).

	   However, if a particular species of quote character is included in
	   the delimiter specification,	then that type of quote	will be
	   correctly handled.  for example, if $text is:

	       $text = '<A HREF=">>>>">link</A>';


	       @result = extract_bracketed( $text, '<">' );


	       ( '<A HREF=">>>>">', 'link</A>',	"" )

	   as expected.	Without	the specification of """ as an embedded

	       @result = extract_bracketed( $text, '<>'	);

	   the result would be:

	       ( '<A HREF=">', '>>>">link</A>',	"" )

	   In addition to the quote delimiters "'", """, and "`", full Perl
	   quote-like quoting (i.e. q{string}, qq{string}, etc)	can be
	   specified by	including the letter 'q' as a delimiter. Hence:

	       @result = extract_bracketed( $text, '<q>' );

	   would correctly match something like	this:

	       $text = '<leftop: conj /and/ conj>';

	   See also: "extract_quotelike" and "extract_codeblock".

	   "extract_variable" extracts any valid Perl variable or variable-
	   involved expression,	including scalars, arrays, hashes, array
	   accesses, hash look-ups, method calls through objects, subroutine
	   calls through subroutine references,	etc.

	   The subroutine takes	up to two optional arguments:

	   1.  A string	to be processed	($_ if the string is omitted or

	   2.  A string	specifying a pattern to	be matched as a	prefix (which
	       is to be	skipped). If omitted, optional whitespace is skipped.

	   On success in a list	context, an array of 3 elements	is returned.
	   The elements	are:

	   [0] the extracted variable, or variablish expression

	   [1] the remainder of	the input text,

	   [2] the prefix substring (if	any),

	   On failure, all of these values (except the remaining text) are

	   In a	scalar context,	"extract_variable" returns just	the complete
	   substring that matched a variablish expression. "undef" is returned
	   on failure. In addition, the	original input text has	the returned
	   substring (and any prefix) removed from it.

	   In a	void context, the input	text just has the matched substring
	   (and	any specified prefix) removed.

	   "extract_tagged" extracts and segments text between (balanced)
	   specified tags.

	   The subroutine takes	up to five optional arguments:

	   1.  A string	to be processed	($_ if the string is omitted or

	   2.  A string	specifying a pattern to	be matched as the opening tag.
	       If the pattern string is	omitted	(or "undef") then a pattern
	       that matches any	standard XML tag is used.

	   3.  A string	specifying a pattern to	be matched at the closing tag.
	       If the pattern string is	omitted	(or "undef") then the closing
	       tag is constructed by inserting a "/" after any leading bracket
	       characters in the actual	opening	tag that was matched (not the
	       pattern that matched the	tag). For example, if the opening tag
	       pattern is specified as '{{\w+}}' and actually matched the
	       opening tag "{{DATA}}", then the	constructed closing tag	would
	       be "{{/DATA}}".

	   4.  A string	specifying a pattern to	be matched as a	prefix (which
	       is to be	skipped). If omitted, optional whitespace is skipped.

	   5.  A hash reference	containing various parsing options (see	below)

	   The various options that can	be specified are:

	   "reject => $listref"
	       The list	reference contains one or more strings specifying
	       patterns	that must not appear within the	tagged text.

	       For example, to extract an HTML link (which should not contain
	       nested links) use:

		       extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );

	   "ignore => $listref"
	       The list	reference contains one or more strings specifying
	       patterns	that are not to	be treated as nested tags within the
	       tagged text (even if they would match the start tag pattern).

	       For example, to extract an arbitrary XML	tag, but ignore
	       "empty" elements:

		       extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );

	       (also see "gen_delimited_pat" below).

	   "fail => $str"
	       The "fail" option indicates the action to be taken if a
	       matching	end tag	is not encountered (i.e. before	the end	of the
	       string or some "reject" pattern matches). By default, a failure
	       to match	a closing tag causes "extract_tagged" to immediately

	       However,	if the string value associated with <reject> is	"MAX",
	       then "extract_tagged" returns the complete text up to the point
	       of failure.  If the string is "PARA", "extract_tagged" returns
	       only the	first paragraph	after the tag (up to the first line
	       that is either empty or contains	only whitespace	characters).
	       If the string is	"", the	default	behaviour (i.e.	failure) is

	       For example, suppose the	start tag "/para" introduces a
	       paragraph, which	then continues until the next "/endpara" tag
	       or until	another	"/para"	tag is encountered:

		       $text = "/para line 1\n\nline 3\n/para line 4";

		       extract_tagged($text, '/para', '/endpara', undef,
					       {reject => '/para', fail	=> MAX );

		       # EXTRACTED: "/para line	1\n\nline 3\n"

	       Suppose instead,	that if	no matching "/endpara" tag is found,
	       the "/para" tag refers only to the immediately following

		       $text = "/para line 1\n\nline 3\n/para line 4";

		       extract_tagged($text, '/para', '/endpara', undef,
				       {reject => '/para', fail	=> MAX );

		       # EXTRACTED: "/para line	1\n"

	       Note that the specified "fail" behaviour	applies	to nested tags
	       as well.

	   On success in a list	context, an array of 6 elements	is returned.
	   The elements	are:

	   [0] the extracted tagged substring (including the outermost tags),

	   [1] the remainder of	the input text,

	   [2] the prefix substring (if	any),

	   [3] the opening tag

	   [4] the text	between	the opening and	closing	tags

	   [5] the closing tag (or "" if no closing tag	was found)

	   On failure, all of these values (except the remaining text) are

	   In a	scalar context,	"extract_tagged" returns just the complete
	   substring that matched a tagged text	(including the start and end
	   tags). "undef" is returned on failure. In addition, the original
	   input text has the returned substring (and any prefix) removed from

	   In a	void context, the input	text just has the matched substring
	   (and	any specified prefix) removed.

	   "gen_extract_tagged"	generates a new	anonymous subroutine which
	   extracts text between (balanced) specified tags. In other words, it
	   generates a function	identical in function to "extract_tagged".

	   The difference between "extract_tagged" and the anonymous
	   subroutines generated by "gen_extract_tagged", is that those
	   generated subroutines:

	   o   do not have to reparse tag specification	or parsing options
	       every time they are called (whereas "extract_tagged" has	to
	       effectively rebuild its tag parser on every call);

	   o   make use	of the new qr//	construct to pre-compile the regexes
	       they use	(whereas "extract_tagged" uses standard	string
	       variable	interpolation to create	tag-matching patterns).

	   The subroutine takes	up to four optional arguments (the same	set as
	   "extract_tagged" except for the string to be	processed). It returns
	   a reference to a subroutine which in	turn takes a single argument
	   (the	text to	be extracted from).

	   In other words, the implementation of "extract_tagged" is exactly
	   equivalent to:

		   sub extract_tagged
			   my $text = shift;
			   $extractor =	gen_extract_tagged(@_);
			   return $extractor->($text);

	   (although "extract_tagged" is not currently implemented that	way).

	   Using "gen_extract_tagged" to create	extraction functions for
	   specific tags is a good idea	if those functions are going to	be
	   called more than once, since	their performance is typically twice
	   as good as the more general-purpose "extract_tagged".

	   "extract_quotelike" attempts	to recognize, extract, and segment any
	   one of the various Perl quotes and quotelike	operators (see
	   perlop(3)) Nested backslashed delimiters, embedded balanced bracket
	   delimiters (for the quotelike operators), and trailing modifiers
	   are all caught. For example,	in:

		   extract_quotelike 'q	# an octothorpe: \# (not the end of the	q!) #'

		   extract_quotelike '	"You said, \"Use sed\"."  '

		   extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '

		   extract_quotelike ' tr/\\\/\\\\/\\\//ds; '

	   the full Perl quotelike operations are all extracted	correctly.

	   Note	too that, when using the /x modifier on	a regex, any comment
	   containing the current pattern delimiter will cause the regex to be
	   immediately terminated. In other words:

		   'm /
			   (?i)		   # CASE INSENSITIVE

	   will	be extracted as	if it were:

		   'm /
			   (?i)		   # CASE INSENSITIVE
			   [a-z_]	   # LEADING ALPHABETIC/'

	   This	behaviour is identical to that of the actual compiler.

	   "extract_quotelike" takes two arguments: the	text to	be processed
	   and a prefix	to be matched at the very beginning of the text. If no
	   prefix is specified,	optional whitespace is the default. If no text
	   is given, $_	is used.

	   In a	list context, an array of 11 elements is returned. The
	   elements are:

	   [0] the extracted quotelike substring (including trailing

	   [1] the remainder of	the input text,

	   [2] the prefix substring (if	any),

	   [3] the name	of the quotelike operator (if any),

	   [4] the left	delimiter of the first block of	the operation,

	   [5] the text	of the first block of the operation (that is, the
	       contents	of a quote, the	regex of a match or substitution or
	       the target list of a translation),

	   [6] the right delimiter of the first	block of the operation,

	   [7] the left	delimiter of the second	block of the operation (that
	       is, if it is a "s", "tr", or "y"),

	   [8] the text	of the second block of the operation (that is, the
	       replacement of a	substitution or	the translation	list of	a

	   [9] the right delimiter of the second block of the operation	(if

	       the trailing modifiers on the operation (if any).

	   For each of the fields marked "(if any)" the	default	value on
	   success is an empty string.	On failure, all	of these values
	   (except the remaining text) are "undef".

	   In a	scalar context,	"extract_quotelike" returns just the complete
	   substring that matched a quotelike operation	(or "undef" on
	   failure). In	a scalar or void context, the input text has the same
	   substring (and any specified	prefix)	removed.


		   # Remove the	first quotelike	literal	that appears in	text

			   $quotelike =	extract_quotelike($text,'.*?');

		   # Replace one or more leading whitespace-separated quotelike
		   # literals in $_ with "<QLL>"

			   do {	$_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;

		   # Isolate the search	pattern	in a quotelike operation from $text

			   ($op,$pat) =	(extract_quotelike $text)[3,5];
			   if ($op =~ /[ms]/)
				   print "search pattern: $pat\n";
				   print "$op is not a pattern matching	operation\n";

	   "extract_quotelike" can successfully	extract	"here documents" from
	   an input string, but	with an	important caveat in list contexts.

	   Unlike other	types of quote-like literals, a	here document is
	   rarely a contiguous substring. For example, a typical piece of code
	   using here document might look like this:

		   <<'EOMSG' ||	die;
		   This	is the message.

	   Given this as an input string in a scalar context,
	   "extract_quotelike" would correctly return the string
	   "<<'EOMSG'\nThis is the message.\nEOMSG", leaving the string	" ||
	   die;\nexit;"	in the original	variable. In other words, the two
	   separate pieces of the here document	are successfully extracted and

	   In a	list context, "extract_quotelike" would	return the list

	   [0] "<<'EOMSG'\nThis	is the message.\nEOMSG\n" (i.e.	the full
	       extracted here document,	including fore and aft delimiters),

	   [1] " || die;\nexit;" (i.e. the remainder of	the input text,

	   [2] "" (i.e.	the prefix substring --	trivial	in this	case),

	   [3] "<<" (i.e. the "name" of	the quotelike operator)

	   [4] "'EOMSG'" (i.e. the left	delimiter of the here document,
	       including any quotes),

	   [5] "This is	the message.\n"	(i.e. the text of the here document),

	   [6] "EOMSG" (i.e. the right delimiter of the	here document),

	       "" (a here document has no second left delimiter, second	text,
	       second right delimiter, or trailing modifiers).

	   However, the	matching position of the input variable	would be set
	   to "exit;" (i.e. after the closing delimiter	of the here document),
	   which would cause the earlier " || die;\nexit;" to be skipped in
	   any sequence	of code	fragment extractions.

	   To avoid this problem, when it encounters a here document whilst
	   extracting from a modifiable	string,	"extract_quotelike" silently
	   rearranges the string to an equivalent piece	of Perl:

		   This	is the message.
		   || die;

	   in which the	here document is contiguous. It	still leaves the
	   matching position after the here document, but now the rest of the
	   line	on which the here document starts is not skipped.

	   To prevent <extract_quotelike> from mucking about with the input in
	   this	way (this is the only case where a list-context
	   "extract_quotelike" does so), you can pass the input	variable as an
	   interpolated	literal:

		   $quotelike =	extract_quotelike("$var");

	   "extract_codeblock" attempts	to recognize and extract a balanced
	   bracket delimited substring that may	contain	unbalanced brackets
	   inside Perl quotes or quotelike operations. That is,
	   "extract_codeblock" is like a combination of	"extract_bracketed"
	   and "extract_quotelike".

	   "extract_codeblock" takes the same initial three parameters as
	   "extract_bracketed":	a text to process, a set of delimiter brackets
	   to look for,	and a prefix to	match first. It	also takes an optional
	   fourth parameter, which allows the outermost	delimiter brackets to
	   be specified	separately (see	below).

	   Omitting the	first argument (input text) means process $_ instead.
	   Omitting the	second argument	(delimiter brackets) indicates that
	   only	'{' is to be used.  Omitting the third argument	(prefix
	   argument) implies optional whitespace at the	start.	Omitting the
	   fourth argument (outermost delimiter	brackets) indicates that the
	   value of the	second argument	is to be used for the outermost

	   Once	the prefix and the outermost opening delimiter bracket have
	   been	recognized, code blocks	are extracted by stepping through the
	   input text and trying the following alternatives in sequence:

	   1.  Try and match a closing delimiter bracket. If the bracket was
	       the same	species	as the last opening bracket, return the
	       substring to that point.	If the bracket was mismatched, return
	       an error.

	   2.  Try to match a quote or quotelike operator. If found, call
	       "extract_quotelike" to eat it. If "extract_quotelike" fails,
	       return the error	it returned. Otherwise go back to step 1.

	   3.  Try to match an opening delimiter bracket. If found, call
	       "extract_codeblock" recursively to eat the embedded block. If
	       the recursive call fails, return	an error. Otherwise, go	back
	       to step 1.

	   4.  Unconditionally match a bareword	or any other single character,
	       and then	go back	to step	1.


		   # Find a while loop in the text

			   if ($text =~	s/.*?while\s*\{/{/)
				   $loop = "while " . extract_codeblock($text);

		   # Remove the	first round-bracketed list (which may include
		   # round- or curly-bracketed code blocks or quotelike	operators)

			   extract_codeblock $text, "(){}", '[^(]*';

	   The ability to specify a different outermost	delimiter bracket is
	   useful in some circumstances. For example, in the Parse::RecDescent
	   module, parser actions which	are to be performed only on a
	   successful parse are	specified using	a "<defer:...>"	directive. For

		   sentence: subject verb object
				   <defer: {$::theVerb = $item{verb}} >

	   Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to
	   extract the code within the "<defer:...>" directive,	but there's a

	   A deferred action like this:

				   <defer: {if ($count>10) {$count--}} >

	   will	be incorrectly parsed as:

				   <defer: {if ($count>

	   because the "less than" operator is interpreted as a	closing

	   But,	by extracting the directive using
	   "extract_codeblock($text,A '{}',A undef,A '<>')" the	'>' character
	   is only treated as a	delimited at the outermost level of the	code
	   block, so the directive is parsed correctly.

	   The "extract_multiple" subroutine takes a string to be processed
	   and a list of extractors (subroutines or regular expressions) to
	   apply to that string.

	   In an array context "extract_multiple" returns an array of
	   substrings of the original string, as extracted by the specified
	   extractors.	In a scalar context, "extract_multiple"	returns	the
	   first substring successfully	extracted from the original string. In
	   both	scalar and void	contexts the original string has the first
	   successfully	extracted substring removed from it. In	all contexts
	   "extract_multiple" starts at	the current "pos" of the string, and
	   sets	that "pos" appropriately after it matches.

	   Hence, the aim of a call to "extract_multiple" in a list context is
	   to split the	processed string into as many non-overlapping fields
	   as possible,	by repeatedly applying each of the specified
	   extractors to the remainder of the string. Thus "extract_multiple"
	   is a	generalized form of Perl's "split" subroutine.

	   The subroutine takes	up to four optional arguments:

	   1.  A string	to be processed	($_ if the string is omitted or

	   2.  A reference to a	list of	subroutine references and/or qr//
	       objects and/or literal strings and/or hash references,
	       specifying the extractors to be used to split the string. If
	       this argument is	omitted	(or "undef") the list:

			       sub { extract_variable($_[0], '') },
			       sub { extract_quotelike($_[0],'') },
			       sub { extract_codeblock($_[0],'{}','') },

	       is used.

	   3.  An number specifying the	maximum	number of fields to return. If
	       this argument is	omitted	(or "undef"), split continues as long
	       as possible.

	       If the third argument is	N, then	extraction continues until N
	       fields have been	successfully extracted,	or until the string
	       has been	completely processed.

	       Note that in scalar and void contexts the value of this
	       argument	is automatically reset to 1 (under "-w", a warning is
	       issued if the argument has to be	reset).

	   4.  A value indicating whether unmatched substrings (see below)
	       within the text should be skipped or returned as	fields.	If the
	       value is	true, such substrings are skipped. Otherwise, they are

	   The extraction process works	by applying each extractor in sequence
	   to the text string.

	   If the extractor is a subroutine it is called in a list context and
	   is expected to return a list	of a single element, namely the
	   extracted text. It may optionally also return two further
	   arguments: a	string representing the	text left after	extraction
	   (like $' for	a pattern match), and a	string representing any	prefix
	   skipped before the extraction (like $` in a pattern match). Note
	   that	this is	designed to facilitate the use of other	Text::Balanced
	   subroutines with "extract_multiple".	Note too that the value
	   returned by an extractor subroutine need not	bear any relationship
	   to the corresponding	substring of the original text (see examples

	   If the extractor is a precompiled regular expression	or a string,
	   it is matched against the text in a scalar context with a leading
	   '\G'	and the	gc modifiers enabled. The extracted value is either $1
	   if that variable is defined after the match,	or else	the complete
	   match (i.e. $&).

	   If the extractor is a hash reference, it must contain exactly one
	   element.  The value of that element is one of the above extractor
	   types (subroutine reference,	regular	expression, or string).	 The
	   key of that element is the name of a	class into which the
	   successful return value of the extractor will be blessed.

	   If an extractor returns a defined value, that value is immediately
	   treated as the next extracted field and pushed onto the list	of
	   fields.  If the extractor was specified in a	hash reference,	the
	   field is also blessed into the appropriate class,

	   If the extractor fails to match (in the case	of a regex extractor),
	   or returns an empty list or an undefined value (in the case of a
	   subroutine extractor), it is	assumed	to have	failed to extract.  If
	   none	of the extractor subroutines succeeds, then one	character is
	   extracted from the start of the text	and the	extraction subroutines
	   reapplied. Characters which are thus	removed	are accumulated	and
	   eventually become the next field (unless the	fourth argument	is
	   true, in which case they are	discarded).

	   For example,	the following extracts substrings that are valid Perl

		   @fields = extract_multiple($text,
					      [	sub { extract_variable($_[0]) }	],
					      undef, 1);

	   This	example	separates a text into fields which are quote
	   delimited, curly bracketed, and anything else. The delimited	and
	   bracketed parts are also blessed to identify	them (the "anything
	   else" is unblessed):

		   @fields = extract_multiple($text,
				   { Delim => sub { extract_delimited($_[0],q{'"}) } },
				   { Brack => sub { extract_bracketed($_[0],'{}') } },

	   This	call extracts the next single substring	that is	a valid	Perl
	   quotelike operator (and removes it from $text):

		   $quotelike =	extract_multiple($text,
						   sub { extract_quotelike($_[0]) },
						 ], undef, 1);

	   Finally, here is yet	another	way to do comma-separated value

		   @fields = extract_multiple($csv_text,
						   sub { extract_delimited($_[0],q{'"})	},

	   The list in the second argument means: "Try and extract a ' or "
	   delimited string, otherwise extract anything	up to a	comma...".
	   The undef third argument means: " many times as	possible...",
	   and the true	value in the fourth argument means "...discarding
	   anything else that appears (i.e. the	commas)".

	   If you wanted the commas preserved as separate fields (i.e. like
	   split does if your split pattern has	capturing parentheses),	you
	   would just make the last parameter undefined	(or remove it).

	   The "gen_delimited_pat" subroutine takes a single (string) argument
	      >	builds a Friedl-style optimized	regex that matches a string
	   delimited by	any one	of the characters in the single	argument. For


	   returns the regex:


	   Note	that the specified delimiters are automatically	quotemeta'd.

	   A typical use of "gen_delimited_pat"	would be to build special
	   purpose tags	for "extract_tagged". For example, to properly ignore
	   "empty" XML elements	(which might contain quoted strings):

		   my $empty_tag = '<('	. gen_delimited_pat(q{'"}) . '|.)+/>';

		   extract_tagged($text, undef,	undef, undef, {ignore => [$empty_tag]} );

	   "gen_delimited_pat" may also	be called with an optional second
	   argument, which specifies the "escape" character(s) to be used for
	   each	delimiter.  For	example	to match a Pascal-style	string (where
	   ' is	the delimiter and '' is	a literal ' within the string):


	   Different escape characters can be specified	for different
	   delimiters.	For example, to	specify	that '/' is the	escape for
	   single quotes and '%' is the	escape for double quotes:


	   If more delimiters than escape chars	are specified, the last	escape
	   char	is used	for the	remaining delimiters.  If no escape char is
	   specified for a given specified delimiter, '\' is used.

	   Note	that "gen_delimited_pat" was previously	called
	   "delimited_pat".  That name may still be used, but is now

       In a list context, all the functions return "(undef,$original_text)" on
       failure.	In a scalar context, failure is	indicated by returning "undef"
       (in this	case the input text is not modified in any way).

       In addition, on failure in any context, the $@ variable is set.
       Accessing "$@->{error}" returns one of the error	diagnostics listed
       below.  Accessing "$@->{pos}" returns the offset	into the original
       string at which the error was detected (although	not necessarily	where
       it occurred!)  Printing $@ directly produces the	error message, with
       the offset appended.  On	success, the $@	variable is guaranteed to be

       The available diagnostics are:

       "Did not	find a suitable	bracket: "%s""
	   The delimiter provided to "extract_bracketed" was not one of

       "Did not	find prefix: /%s/"
	   A non-optional prefix was specified but wasn't found	at the start
	   of the text.

       "Did not	find opening bracket after prefix: "%s""
	   "extract_bracketed" or "extract_codeblock" was expecting a
	   particular kind of bracket at the start of the text,	and didn't
	   find	it.

       "No quotelike operator found after prefix: "%s""
	   "extract_quotelike" didn't find one of the quotelike	operators "q",
	   "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
	   was extracting.

       "Unmatched closing bracket: "%c""
	   "extract_bracketed",	"extract_quotelike" or "extract_codeblock"
	   encountered a closing bracket where none was	expected.

       "Unmatched opening bracket(s): "%s""
	   "extract_bracketed",	"extract_quotelike" or "extract_codeblock" ran
	   out of characters in	the text before	closing	one or more levels of
	   nested brackets.

       "Unmatched embedded quote (%s)"
	   "extract_bracketed" attempted to match an embedded quoted
	   substring, but failed to find a closing quote to match it.

       "Did not	find closing delimiter to match	'%s'"
	   "extract_quotelike" was unable to find a closing delimiter to match
	   the one that	opened the quote-like operation.

       "Mismatched closing bracket: expected "%c" but found "%s""
	   "extract_bracketed",	"extract_quotelike" or "extract_codeblock"
	   found a valid bracket delimiter, but	it was the wrong species. This
	   usually indicates a nesting error, but may indicate incorrect
	   quoting or escaping.

       "No block delimiter found after quotelike "%s""
	   "extract_quotelike" or "extract_codeblock" found one	of the
	   quotelike operators "q", "qq", "qw",	"qx", "s", "tr"	or "y" without
	   a suitable block after it.

       "Did not	find leading dereferencer"
	   "extract_variable" was expecting one	of '$',	'@', or	'%' at the
	   start of a variable,	but didn't find	any of them.

       "Bad identifier after dereferencer"
	   "extract_variable" found a '$', '@',	or '%' indicating a variable,
	   but that character was not followed by a legal Perl identifier.

       "Did not	find expected opening bracket at %s"
	   "extract_codeblock" failed to find any of the outermost opening
	   brackets that were specified.

       "Improperly nested codeblock at %s"
	   A nested code block was found that started with a delimiter that
	   was specified as being only to be used as an	outermost bracket.

       "Missing	second block for quotelike "%s""
	   "extract_codeblock" or "extract_quotelike" found one	of the
	   quotelike operators "s", "tr" or "y"	followed by only one block.

       "No match found for opening bracket"
	   "extract_codeblock" failed to find a	closing	bracket	to match the
	   outermost opening bracket.

       "Did not	find opening tag: /%s/"
	   "extract_tagged" did	not find a suitable opening tag	(after any
	   specified prefix was	removed).

       "Unable to construct closing tag	to match: /%s/"
	   "extract_tagged" matched the	specified opening tag and tried	to
	   modify the matched text to produce a	matching closing tag (because
	   none	was specified).	It failed to generate the closing tag, almost
	   certainly because the opening tag did not start with	a bracket of
	   some	kind.

       "Found invalid nested tag: %s"
	   "extract_tagged" found a nested tag that appeared in	the "reject"
	   list	(and the failure mode was not "MAX" or "PARA").

       "Found unbalanced nested	tag: %s"
	   "extract_tagged" found a nested opening tag that was	not matched by
	   a corresponding nested closing tag (and the failure mode was	not
	   "MAX" or "PARA").

       "Did not	find closing tag"
	   "extract_tagged" reached the	end of the text	without	finding	a
	   closing tag to match	the original opening tag (and the failure mode
	   was not "MAX" or "PARA").

       The following symbols are, or can be, exported by this module:

       Default Exports

       Optional	Exports
	   "extract_delimited",	"extract_bracketed", "extract_quotelike",
	   "extract_codeblock",	"extract_variable", "extract_tagged",
	   "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",

       Export Tags
	       "extract_delimited", "extract_bracketed", "extract_quotelike",
	       "extract_codeblock", "extract_variable",	"extract_tagged",
	       "extract_multiple", "gen_delimited_pat",	"gen_extract_tagged",


       Patches,	bug reports, suggestions or any	other feedback is welcome.

       Patches can be sent as GitHub pull requests at

       Bug reports and suggestions can be made on the CPAN Request Tracker at

       Currently active	requests on the	CPAN Request Tracker can be viewed at

       Please test this	distribution.  See CPAN	Testers	Reports	at
       <> for details of how to get	involved.

       Previous	test results on	CPAN Testers Reports can be viewed at

       Please rate this	distribution on	CPAN Ratings at

       The latest version of this module is available from CPAN	(see "CPAN" in
       perlmodlib for details) at

       <> or

       <> or


       The latest source code is available from	GitHub at

       See the INSTALL file.

       Damian Conway <	<>>.

       Steve Hay <	<>>	is now maintaining
       Text::Balanced as of version 2.03.

       Copyright (C) 1997-2001 Damian Conway.  All rights reserved.

       Copyright (C) 2009 Adam Kennedy.

       Copyright (C) 2015, 2020	Steve Hay.  All	rights reserved.

       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself, i.e. under the terms of either the
       GNU General Public License or the Artistic License, as specified	in the
       LICENCE file.

       Version 2.04

       11 Dec 2020

       See the Changes file.

perl v5.32.1			  2020-12-11		     Text::Balanced(3)


Want to link to this manual page? Use this URL:

home | help