Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
re(3)			   Erlang Module Definition			 re(3)

NAME
       re - Perl-like regular expressions for Erlang.

DESCRIPTION
       This  module contains regular expression	matching functions for strings
       and binaries.

       The regular expression syntax and semantics resemble that of Perl.

       The matching algorithms of the library are based	on the	PCRE  library,
       but not all of the PCRE library is interfaced and some parts of the li-
       brary go	beyond what PCRE offers. Currently PCRE	version	8.40  (release
       date  2017-01-11)  is used. The sections	of the PCRE documentation that
       are relevant to this module are included	here.

   Note:
       The Erlang literal syntax for strings uses the "\" (backslash)  charac-
       ter  as	an  escape  code.  You	need  to escape	backslashes in literal
       strings,	both in	your code and in the shell, with an  extra  backslash,
       that is,	"\\".

DATA TYPES
       mp() = {re_pattern, term(), term(), term(), term()}

	      Opaque  data type	containing a compiled regular expression. mp()
	      is guaranteed to be a tuple() having the atom re_pattern as  its
	      first element, to	allow for matching in guards. The arity	of the
	      tuple or the content of the other	fields can  change  in	future
	      Erlang/OTP releases.

       nl_spec() = cr |	crlf | lf | anycrlf | any

       compile_option()	=
	   unicode |
	   anchored |
	   caseless |
	   dollar_endonly |
	   dotall |
	   extended |
	   firstline |
	   multiline |
	   no_auto_capture |
	   dupnames |
	   ungreedy |
	   {newline, nl_spec()}	|
	   bsr_anycrlf |
	   bsr_unicode |
	   no_start_optimize |
	   ucp |
	   never_utf

EXPORTS
       version() -> binary()

	      The return of this function is a string with the PCRE version of
	      the system that was used in the Erlang/OTP compilation.

       compile(Regexp) -> {ok, MP} | {error, ErrSpec}

	      Types:

		 Regexp	= iodata()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      The same as compile(Regexp,[])

       compile(Regexp, Options)	-> {ok,	MP} | {error, ErrSpec}

	      Types:

		 Regexp	= iodata() | unicode:charlist()
		 Options = [Option]
		 Option	= compile_option()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Compiles a regular expression, with the syntax described	below,
	      into an internal format to be used later as a parameter to run/2
	      and run/3.

	      Compiling	the regular expression before matching	is  useful  if
	      the  same	 expression is to be used in matching against multiple
	      subjects during the lifetime of the program. Compiling once  and
	      executing	 many  times is	far more efficient than	compiling each
	      time one wants to	match.

	      When option unicode is specified,	the regular expression	is  to
	      be  specified  as	 a  valid Unicode charlist(), otherwise	as any
	      valid iodata().

	      Options:

		unicode:
		  The regular expression is specified as a Unicode  charlist()
		  and  the  resulting  regular	expression  code  is to	be run
		  against a valid Unicode charlist()  subject.	Also  consider
		  option ucp when using	Unicode	characters.

		anchored:
		  The  pattern is forced to be "anchored", that	is, it is con-
		  strained to match only at the	first matching	point  in  the
		  string  that is searched (the	"subject string"). This	effect
		  can also be achieved by appropriate constructs in  the  pat-
		  tern itself.

		caseless:
		  Letters  in  the  pattern match both uppercase and lowercase
		  letters. It is equivalent to	Perl  option  /i  and  can  be
		  changed within a pattern by a	(?i) option setting. Uppercase
		  and lowercase	letters	are defined as in the ISO 8859-1 char-
		  acter	set.

		dollar_endonly:
		  A  dollar  metacharacter  in the pattern matches only	at the
		  end of the subject string. Without  this  option,  a	dollar
		  also	matches	immediately before a newline at	the end	of the
		  string (but not before any other newlines). This  option  is
		  ignored if option multiline is specified. There is no	equiv-
		  alent	option in Perl,	and it cannot be set within a pattern.

		dotall:
		  A dot	in the pattern matches all characters, including those
		  indicating  newline.	Without	 it, a dot does	not match when
		  the current position is at a newline.	This option is equiva-
		  lent	to  Perl option	/s and it can be changed within	a pat-
		  tern by a (?s) option	setting. A  negative  class,  such  as
		  [^a],	 always	matches	newline	characters, independent	of the
		  setting of this option.

		extended:
		  If this option is set, most white space  characters  in  the
		  pattern  are totally ignored except when escaped or inside a
		  character class. However, white space	is not allowed	within
		  sequences  such  as (?_ that introduce various parenthesized
		  subpatterns, nor  within  a  numerical  quantifier  such  as
		  {1,3}.  However,  ignorable white space is permitted between
		  an item and a	following quantifier and between a  quantifier
		  and a	following + that indicates possessiveness.

		  White	 space	did not	used to	include	the VT character (code
		  11), because Perl did	not  treat  this  character  as	 white
		  space.  However,  Perl changed at release 5.18, so PCRE fol-
		  lowed	at release 8.34, and VT	is now treated as white	space.

		  This also causes characters between an unescaped # outside a
		  character  class  and	the next newline, inclusive, to	be ig-
		  nored. This is equivalent to Perl's /x option, and it	can be
		  changed within a pattern by a	(?x) option setting.

		  With	this  option, comments inside complicated patterns can
		  be included. However,	notice that this applies only to  data
		  characters.  Whitespace  characters  can never appear	within
		  special character sequences in a pattern, for	example	within
		  sequence (?( that introduces a conditional subpattern.

		firstline:
		  An  unanchored pattern is required to	match before or	at the
		  first	newline	in the subject string,	although  the  matched
		  text can continue over the newline.

		multiline:
		  By  default, PCRE treats the subject string as consisting of
		  a single line	of characters (even if it contains  newlines).
		  The  "start  of  line" metacharacter (^) matches only	at the
		  start	of the string, while the "end of  line"	 metacharacter
		  ($)  matches only at the end of the string, or before	a ter-
		  minating newline (unless  option  dollar_endonly  is	speci-
		  fied). This is the same as in	Perl.

		  When	this option is specified, the "start of	line" and "end
		  of line" constructs match immediately	following  or  immedi-
		  ately	 before	 internal  newlines in the subject string, re-
		  spectively, as well as at the	very start and	end.  This  is
		  equivalent  to  Perl	option	/m and can be changed within a
		  pattern by a (?m) option setting. If there are  no  newlines
		  in  a	 subject string, or no occurrences of ^	or $ in	a pat-
		  tern,	setting	multiline has no effect.

		no_auto_capture:
		  Disables the use of numbered capturing  parentheses  in  the
		  pattern.  Any	 opening parenthesis that is not followed by ?
		  behaves as if	it is followed by ?:.  Named  parentheses  can
		  still	be used	for capturing (and they	acquire	numbers	in the
		  usual	way). There is no equivalent option in Perl.

		dupnames:
		  Names	used to	identify capturing  subpatterns	 need  not  be
		  unique.  This	 can  be  helpful for certain types of pattern
		  when it is known that	only one instance of the named subpat-
		  tern	can ever be matched. More details of named subpatterns
		  are provided below.

		ungreedy:
		  Inverts the "greediness" of the quantifiers so that they are
		  not greedy by	default, but become greedy if followed by "?".
		  It is	not compatible with Perl. It can also be set by	a (?U)
		  option setting within	the pattern.

		{newline, NLSpec}:
		  Overrides the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		  cr:
		    Newline is indicated by a single character cr (ASCII 13).

		  lf:
		    Newline is indicated by a single character LF (ASCII  10),
		    the	default.

		  crlf:
		    Newline  is	 indicated by the two-character	CRLF (ASCII 13
		    followed by	ASCII 10) sequence.

		  anycrlf:
		    Any	of the three preceding sequences is to be recognized.

		  any:
		    Any	of the newline sequences above,	and  the  Unicode  se-
		    quences  VT	(vertical tab, U+000B),	FF (formfeed, U+000C),
		    NEL	(next line, U+0085), LS	(line separator, U+2028),  and
		    PS (paragraph separator, U+2029).

		bsr_anycrlf:
		  Specifies  specifically that \R is to	match only the CR, LF,
		  or CRLF sequences, not the Unicode-specific newline  charac-
		  ters.

		bsr_unicode:
		  Specifies  specifically  that	\R is to match all the Unicode
		  newline characters (including	CRLF, and so on, the default).

		no_start_optimize:
		  Disables  optimization  that	can  malfunction  if  "Special
		  start-of-pattern  items"  are	present	in the regular expres-
		  sion.	A typical example  would  be  when  matching  "DEFABC"
		  against "(*COMMIT)ABC", where	the start optimization of PCRE
		  would	skip the subject up to "A" and never realize that  the
		  (*COMMIT)  instruction  is  to  have made the	matching fail.
		  This option is only relevant if  you	use  "start-of-pattern
		  items",  as discussed	in section PCRE	Regular	Expression De-
		  tails.

		ucp:
		  Specifies that Unicode character properties are to  be  used
		  when	resolving  \B,	\b, \D,	\d, \S,	\s, \W and \w. Without
		  this flag, only ISO Latin-1 properties are used. Using  Uni-
		  code	properties hurts performance, but is semantically cor-
		  rect when working with Unicode  characters  beyond  the  ISO
		  Latin-1 range.

		never_utf:
		  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
		  items" are forbidden.	This flag cannot be combined with  op-
		  tion	unicode. Useful	if ISO Latin-1 patterns	from an	exter-
		  nal source are to be compiled.

       inspect(MP, Item) -> {namelist, [binary()]}

	      Types:

		 MP = mp()
		 Item =	namelist

	      Takes a compiled regular expression and an item, and returns the
	      relevant	data  from  the	regular	expression. The	only supported
	      item is  namelist,  which	 returns  the  tuple  {namelist,  [bi-
	      nary()]},	containing the names of	all (unique) named subpatterns
	      in the regular expression. For example:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      2> re:inspect(MP,namelist).
	      {namelist,[<<"A">>,<<"B">>,<<"C">>]}
	      3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      4> re:inspect(MPD,namelist).
	      {namelist,[<<"B">>,<<"C">>]}

	      Notice in	the second example that	the duplicate name only	occurs
	      once  in the returned list, and that the list is in alphabetical
	      order regardless of where	the names are positioned in the	 regu-
	      lar  expression. The order of the	names is the same as the order
	      of captured subexpressions if {capture, all_names} is  specified
	      as  an option to run/3. You can therefore	create a name-to-value
	      mapping from the result of run/3 like this:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      {ok,{re_pattern,3,0,0,
			      <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
				255,255,...>>}}
	      2> {namelist, N} = re:inspect(MP,namelist).
	      {namelist,[<<"A">>,<<"B">>,<<"C">>]}
	      3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
	      {match,[<<"A">>,<<>>,<<>>]}
	      4> NameMap = lists:zip(N,L).
	      [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]

       replace(Subject,	RE, Replacement) -> iodata() | unicode:charlist()

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Replacement = iodata()	| unicode:charlist()

	      Same as replace(Subject, RE, Replacement,	[]).

       replace(Subject,	RE, Replacement, Options) ->
		  iodata() | unicode:charlist()

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Replacement = iodata()	| unicode:charlist()
		 Options = [Option]
		 Option	=
		     anchored |
		     global |
		     notbol |
		     noteol |
		     notempty |
		     notempty_atstart |
		     {offset, integer()	>= 0} |
		     {newline, NLSpec} |
		     bsr_anycrlf |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     bsr_unicode |
		     {return, ReturnType} |
		     CompileOpt
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		 NLSpec	= cr | crlf | lf | anycrlf | any

	      Replaces the matched part	of the Subject string  with  the  con-
	      tents of Replacement.

	      The  permissible	options	are the	same as	for run/3, except that
	      option capture is	not allowed. Instead a {return,	ReturnType} is
	      present. The default return type is iodata, constructed in a way
	      to minimize copying. The iodata result can be used  directly  in
	      many  I/O	 operations. If	a flat list() is desired, specify {re-
	      turn, list}. If a	binary is desired, specify {return, binary}.

	      As in function run/3, an mp() compiled with option  unicode  re-
	      quires  Subject  to  be  a Unicode charlist(). If	compilation is
	      done implicitly and the unicode compilation option is  specified
	      to this function,	both the regular expression and	Subject	are to
	      specified	as valid Unicode charlist()s.

	      The replacement string can  contain  the	special	 character  _,
	      which  inserts  the whole	matching expression in the result, and
	      the special sequence \N (where N is an integer  >	 0),  \gN,  or
	      \g{N},  resulting	 in the	subexpression number N,	is inserted in
	      the result. If no	subexpression with that	number is generated by
	      the regular expression, nothing is inserted.

	      To insert	an & or	a \ in the result, precede it with a \.	Notice
	      that Erlang already gives	a special  meaning  to	\  in  literal
	      strings,	so  a single \ must be written as "\\" and therefore a
	      double \ as "\\\\".

	      Example:

	      re:replace("abcd","c","[&]",[{return,list}]).

	      gives

	      "ab[c]d"

	      while

	      re:replace("abcd","c","[\\&]",[{return,list}]).

	      gives

	      "ab[&]d"

	      As with run/3, compilation errors	raise  the  badarg  exception.
	      compile/2	can be used to get more	information about the error.

       run(Subject, RE)	-> {match, Captured} | nomatch

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Captured = [CaptureData]
		 CaptureData = {integer(), integer()}

	      Same as run(Subject,RE,[]).

       run(Subject, RE,	Options) ->
	      {match, Captured}	| match	| nomatch | {error, ErrType}

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	=
		     anchored |
		     global |
		     notbol |
		     noteol |
		     notempty |
		     notempty_atstart |
		     report_errors |
		     {offset, integer()	>= 0} |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     {newline, NLSpec :: nl_spec()} |
		     bsr_anycrlf |
		     bsr_unicode |
		     {capture, ValueSpec} |
		     {capture, ValueSpec, Type}	|
		     CompileOpt
		 Type =	index |	list | binary
		 ValueSpec =
		     all  |  all_but_first  |  all_names | first | none	| Val-
		 ueList
		 ValueList = [ValueID]
		 ValueID = integer() | string()	| atom()
		 CompileOpt = compile_option()
		   See compile/2.
		 Captured = [CaptureData] | [[CaptureData]]
		 CaptureData =
		     {integer(), integer()} | ListConversionData | binary()
		 ListConversionData =
		     string() |
		     {error, string(), binary()} |
		     {incomplete, string(), binary()}
		 ErrType =
		     match_limit  |  match_limit_recursion  |  {compile,  Com-
		 pileErr}
		 CompileErr =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Executes	  a   regular	expression   matching,	 and   returns
	      match/{match, Captured} or nomatch. The regular  expression  can
	      be  specified  either  as	iodata() in which case it is automati-
	      cally compiled (as by compile/2) and executed, or	as  a  precom-
	      piled  mp() in which case	it is executed against the subject di-
	      rectly.

	      When compilation is involved, exception badarg is	 thrown	 if  a
	      compilation  error  occurs.  Call	 compile/2  to get information
	      about the	location of the	error in the regular expression.

	      If the regular expression	is  previously	compiled,  the	option
	      list can only contain the	following options:

		* anchored

		* {capture, ValueSpec}/{capture, ValueSpec, Type}

		* global

		* {match_limit,	integer() _= 0}

		* {match_limit_recursion, integer() _= 0}

		* {newline, NLSpec}

		* notbol

		* notempty

		* notempty_atstart

		* noteol

		* {offset, integer() _=	0}

		* report_errors

	      Otherwise	 all options valid for function	compile/2 are also al-
	      lowed. Options allowed both for compilation and execution	 of  a
	      match,  namely  anchored	and {newline, NLSpec}, affect both the
	      compilation and execution	if present together with a non-precom-
	      piled regular expression.

	      If  the  regular	expression was previously compiled with	option
	      unicode,	Subject	 is  to	 be  provided  as  a   valid   Unicode
	      charlist(),  otherwise  any  iodata() will do. If	compilation is
	      involved and option unicode is specified,	both Subject  and  the
	      regular	expression  are	 to  be	 specified  as	valid  Unicode
	      charlists().

	      {capture,	ValueSpec}/{capture, ValueSpec,	Type} defines what  to
	      return  from  the	function upon successful matching. The capture
	      tuple can	contain	both a value specification, telling  which  of
	      the  captured substrings are to be returned, and a type specifi-
	      cation, telling how captured substrings are to be	 returned  (as
	      index  tuples, lists, or binaries). The options are described in
	      detail below.

	      If the capture options describe that no substring	 capturing  is
	      to  be  done  ({capture, none}), the function returns the	single
	      atom match upon successful matching, otherwise the tuple {match,
	      ValueList}. Disabling capturing can be done either by specifying
	      none or an empty list as ValueSpec.

	      Option report_errors adds	the possibility	that an	error tuple is
	      returned.	  The	tuple	either	 indicates  a  matching	 error
	      (match_limit or match_limit_recursion), or a compilation	error,
	      where  the  error	 tuple	has  the format	{error,	{compile, Com-
	      pileErr}}. Notice	that if	option report_errors is	not specified,
	      the function never returns error tuples, but reports compilation
	      errors as	a badarg exception and failed matches because  of  ex-
	      ceeded match limits simply as nomatch.

	      The following options are	relevant for execution:

		anchored:
		  Limits  run/3	to matching at the first matching position. If
		  a pattern was	compiled with anchored,	or turned  out	to  be
		  anchored  by virtue of its contents, it cannot be made unan-
		  chored at matching time, hence there is  no  unanchored  op-
		  tion.

		global:
		  Implements global (repetitive) search	(flag g	in Perl). Each
		  match	is returned as a separate list() containing  the  spe-
		  cific	match and any matching subexpressions (or as specified
		  by option capture. The Captured part of the return value  is
		  hence	a list() of list()s when this option is	specified.

		  The  interaction  of option global with a regular expression
		  that matches an empty	string surprises some users. When  op-
		  tion global is specified, run/3 handles empty	matches	in the
		  same way as Perl: a zero-length match	at any point  is  also
		  retried  with	 options [anchored, notempty_atstart]. If that
		  search gives a result	of length > 0, the result is included.
		  Example:

		re:run("cat","(|at)",[global]).

		  The following	matchings are performed:

		  At offset 0:
		    The	 regular  expression  (|at) first match	at the initial
		    position  of   string   cat,   giving   the	  result   set
		    [{0,0},{0,0}]  (the	 second	{0,0} is because of the	subex-
		    pression marked by the parentheses). As the	length of  the
		    match is 0,	we do not advance to the next position yet.

		  At offset 0 with [anchored, notempty_atstart]:
		    The	search is retried with options [anchored, notempty_at-
		    start] at the same position, which does not	give  any  in-
		    teresting  result of longer	length,	so the search position
		    is advanced	to the next character (a).

		  At offset 1:
		    The	search results in [{1,0},{1,0}],  so  this  search  is
		    also repeated with the extra options.

		  At offset 1 with [anchored, notempty_atstart]:
		    Alternative	 ab  is	found and the result is	[{1,2},{1,2}].
		    The	result is added	to the list of results and  the	 posi-
		    tion in the	search string is advanced two steps.

		  At offset 3:
		    The	 search	 once  again  matches the empty	string,	giving
		    [{3,0},{3,0}].

		  At offset 1 with [anchored, notempty_atstart]:
		    This gives no result of length > 0 and we are at the  last
		    position, so the global search is complete.

		  The result of	the call is:

		{match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}

		notempty:
		  An  empty  string  is	 not considered	to be a	valid match if
		  this option is specified. If alternatives in the pattern ex-
		  ist, they are	tried. If all the alternatives match the empty
		  string, the entire match fails.

		  Example:

		  If the following pattern is applied to a string  not	begin-
		  ning	with  "a"  or  "b",  it	would normally match the empty
		  string at the	start of the subject:

		a?b?

		  With option  notempty,  this	match  is  invalid,  so	 run/3
		  searches  further  into the string for occurrences of	"a" or
		  "b".

		notempty_atstart:
		  Like notempty, except	that an	empty string match that	is not
		  at  the start	of the subject is permitted. If	the pattern is
		  anchored, such a match can occur only	if  the	 pattern  con-
		  tains	\K.

		  Perl	has  no	 direct	equivalent of notempty or notempty_at-
		  start, but it	does make a special case of a pattern match of
		  the empty string within its split() function,	and when using
		  modifier /g. The Perl	behavior can be	emulated after	match-
		  ing  a  null	string	by first trying	the match again	at the
		  same offset with notempty_atstart and	anchored, and then, if
		  that fails, by advancing the starting	offset (see below) and
		  trying an ordinary match again.

		notbol:
		  Specifies that the first character of	the subject string  is
		  not the beginning of a line, so the circumflex metacharacter
		  is not to match before it. Setting  this  without  multiline
		  (at compile time) causes circumflex never to match. This op-
		  tion only affects the	behavior of the	circumflex metacharac-
		  ter. It does not affect \A.

		noteol:
		  Specifies  that the end of the subject string	is not the end
		  of a line, so	the dollar metacharacter is not	 to  match  it
		  nor  (except in multiline mode) a newline immediately	before
		  it. Setting this without multiline (at compile time)	causes
		  dollar never to match. This option affects only the behavior
		  of the dollar	metacharacter. It does not affect \Z or	\z.

		report_errors:
		  Gives	better control of the error handling  in  run/3.  When
		  specified,  compilation errors (if the regular expression is
		  not already compiled)	and runtime errors are explicitly  re-
		  turned as an error tuple.

		  The following	are the	possible runtime errors:

		  match_limit:
		    The	PCRE library sets a limit on how many times the	inter-
		    nal	match function can be called. Defaults	to  10,000,000
		    in	 the   library	 compiled   for	  Erlang.  If  {error,
		    match_limit} is returned, the execution of the regular ex-
		    pression  has  reached  this limit.	This is	normally to be
		    regarded as	a nomatch, which is the	default	 return	 value
		    when this occurs, but by specifying	report_errors, you are
		    informed when the match fails because of too many internal
		    calls.

		  match_limit_recursion:
		    This error is very similar to match_limit, but occurs when
		    the	internal  match	 function  of  PCRE  is	 "recursively"
		    called  more  times	 than the match_limit_recursion	limit,
		    which defaults to 10,000,000 as well. Notice that as  long
		    as the match_limit and match_limit_default values are kept
		    at the default  values,  the  match_limit_recursion	 error
		    cannot  occur, as the match_limit error occurs before that
		    (each recursive call is also a call, but not  conversely).
		    Both limits	can however be changed,	either by setting lim-
		    its	directly in the	regular	expression string (see section
		    PCRE Regular Eexpression Details) or by specifying options
		    to run/3.

		  It is	important to understand	that what is  referred	to  as
		  "recursion"  when limiting matches is	not recursion on the C
		  stack	of the Erlang machine or on the	Erlang process	stack.
		  The  PCRE  version  compiled into the	Erlang VM uses machine
		  "heap" memory	to store values	that must be kept over	recur-
		  sion in regular expression matches.

		{match_limit, integer()	_= 0}:
		  Limits  the  execution time of a match in an implementation-
		  specific way.	It is described	as follows by the  PCRE	 docu-
		  mentation:

		The match_limit	field provides a means of preventing PCRE from using
		up a vast amount of resources when running patterns that are not going
		to match, but which have a very	large number of	possibilities in their
		search trees. The classic example is a pattern that uses nested
		unlimited repeats.

		Internally, pcre_exec()	uses a function	called match(),	which it calls
		repeatedly (sometimes recursively). The	limit set by match_limit is
		imposed	on the number of times this function is	called during a	match,
		which has the effect of	limiting the amount of backtracking that can
		take place. For	patterns that are not anchored,	the count restarts
		from zero for each position in the subject string.

		  This	means that runaway regular expression matches can fail
		  faster if the	limit is lowered using this  option.  The  de-
		  fault	value 10,000,000 is compiled into the Erlang VM.

	    Note:
		This  option does in no	way affect the execution of the	Erlang
		VM in terms of "long running BIFs". run/3 always gives control
		back  to  the  scheduler of Erlang processes at	intervals that
		ensures	the real-time properties of the	Erlang system.

		{match_limit_recursion,	integer() _= 0}:
		  Limits the execution time and	memory consumption of a	 match
		  in   an   implementation-specific   way,   very  similar  to
		  match_limit. It is described as follows by the PCRE documen-
		  tation:

		The match_limit_recursion field	is similar to match_limit, but instead
		of limiting the	total number of	times that match() is called, it
		limits the depth of recursion. The recursion depth is a	smaller	number
		than the total number of calls,	because	not all	calls to match() are
		recursive. This	limit is of use	only if	it is set smaller than
		match_limit.

		Limiting the recursion depth limits the	amount of machine stack	that
		can be used, or, when PCRE has been compiled to	use memory on the heap
		instead	of the stack, the amount of heap memory	that can be used.

		  The  Erlang VM uses a	PCRE library where heap	memory is used
		  when regular expression match	recursion occurs. This	there-
		  fore limits the use of machine heap, not C stack.

		  Specifying a lower value can result in matches with deep re-
		  cursion failing, when	they should have matched:

		1> re:run("aaaaaaaaaaaaaz","(a+)*z").
		{match,[{0,14},{0,13}]}
		2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
		nomatch
		3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
		{error,match_limit_recursion}

		  This option and option match_limit are only to  be  used  in
		  rare	cases.	Understanding of the PCRE library internals is
		  recommended before tampering with these limits.

		{offset, integer() _= 0}:
		  Start	matching at the	offset	(position)  specified  in  the
		  subject  string.  The	 offset	is zero-based, so that the de-
		  fault	is {offset,0} (all of the subject string).

		{newline, NLSpec}:
		  Overrides the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		  cr:
		    Newline is indicated by a single character CR (ASCII 13).

		  lf:
		    Newline  is	indicated by a single character	LF (ASCII 10),
		    the	default.

		  crlf:
		    Newline is indicated by the	two-character CRLF  (ASCII  13
		    followed by	ASCII 10) sequence.

		  anycrlf:
		    Any	of the three preceding sequences is be recognized.

		  any:
		    Any	 of  the  newline sequences above, and the Unicode se-
		    quences VT (vertical tab, U+000B), FF (formfeed,  U+000C),
		    NEL	 (next line, U+0085), LS (line separator, U+2028), and
		    PS (paragraph separator, U+2029).

		bsr_anycrlf:
		  Specifies specifically that \R is to match only the  CR  LF,
		  or  CRLF sequences, not the Unicode-specific newline charac-
		  ters.	(Overrides the compilation option.)

		bsr_unicode:
		  Specifies specifically that \R is to match all  the  Unicode
		  newline characters (including	CRLF, and so on, the default).
		  (Overrides the compilation option.)

		{capture, ValueSpec}/{capture, ValueSpec, Type}:
		  Specifies which captured substrings are returned and in what
		  format.  By default, run/3 captures all of the matching part
		  of the substring and all capturing subpatterns (all  of  the
		  pattern  is automatically captured). The default return type
		  is (zero-based) indexes of the captured parts	of the string,
		  specified  as	 {Offset,Length} pairs (the index Type of cap-
		  turing).

		  As an	example	of the default behavior,  the  following  call
		  returns,  as	first  and  only captured string, the matching
		  part of the subject ("abcd" in the middle) as	an index  pair
		  {3,4},  where	character positions are	zero-based, just as in
		  offsets:

		re:run("ABCabcdABC","abcd",[]).

		  The return value of this call	is:

		{match,[{3,4}]}

		  Another (and quite common) case is where the regular expres-
		  sion matches all of the subject:

		re:run("ABCabcdABC",".*abcd.*",[]).

		  Here	the return value correspondingly points	out all	of the
		  string, beginning at index 0,	and it is 10 characters	long:

		{match,[{0,10}]}

		  If the regular expression  contains  capturing  subpatterns,
		  like in:

		re:run("ABCabcdABC",".*(abcd).*",[]).

		  all  of the matched subject is captured, as well as the cap-
		  tured	substrings:

		{match,[{0,10},{3,4}]}

		  The complete matching	pattern	always gives the first	return
		  value	in the list and	the remaining subpatterns are added in
		  the order they occurred in the regular expression.

		  The capture tuple is built up	as follows:

		  ValueSpec:
		    Specifies which captured (sub)patterns are to be returned.
		    ValueSpec  can  either  be an atom describing a predefined
		    set	of return values, or a list containing the indexes  or
		    the	names of specific subpatterns to return.

		    The	following are the predefined sets of subpatterns:

		    all:
		      All captured subpatterns including the complete matching
		      string. This is the default.

		    all_names:
		      All named	subpatterns in the regular expression, as if a
		      list() of	all the	names in alphabetical order was	speci-
		      fied. The	list of	all names can also be  retrieved  with
		      inspect/2.

		    first:
		      Only  the	first captured subpattern, which is always the
		      complete matching	part of	the  subject.  All  explicitly
		      captured subpatterns are discarded.

		    all_but_first:
		      All  but the first matching subpattern, that is, all ex-
		      plicitly captured	 subpatterns,  but  not	 the  complete
		      matching	part  of the subject string. This is useful if
		      the regular expression as	a whole	matches	a  large  part
		      of the subject, but the part you are interested in is in
		      an explicitly captured subpattern. If the	return type is
		      list  or	binary,	 not returning subpatterns you are not
		      interested in is a good way to optimize.

		    none:
		      Returns no matching subpatterns, gives the  single  atom
		      match  as	the return value of the	function when matching
		      successfully instead  of	the  {match,  list()}  return.
		      Specifying an empty list gives the same behavior.

		    The	value list is a	list of	indexes	for the	subpatterns to
		    return, where index	0 is for all of	the pattern, and 1  is
		    for	the first explicit capturing subpattern	in the regular
		    expression,	and so on. When	using named  captured  subpat-
		    terns  (see	 below)	in the regular expression, one can use
		    atom()s or string()s to specify the	subpatterns to be  re-
		    turned. For	example, consider the regular expression:

		  ".*(abcd).*"

		    matched  against  string  "ABCabcdABC", capturing only the
		    "abcd" part	(the first explicit subpattern):

		  re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).

		    The	call gives the following result, as the	first  explic-
		    itly  captured  subpattern is "(abcd)", matching "abcd" in
		    the	subject, at (zero-based) position 3, of	length 4:

		  {match,[{3,4}]}

		    Consider the same regular expression, but with the subpat-
		    tern explicitly named 'FOO':

		  ".*(?<FOO>abcd).*"

		    With this expression, we could still give the index	of the
		    subpattern with the	following call:

		  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).

		    giving the same result as before. But, as  the  subpattern
		    is named, we can also specify its name in the value	list:

		  re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).

		    This  would	 give the same result as the earlier examples,
		    namely:

		  {match,[{3,4}]}

		    The	values list can	specify	indexes	or names  not  present
		    in the regular expression, in which	case the return	values
		    vary depending on the type.	If the type is index, the  tu-
		    ple	 {-1,0}	 is  returned for values with no corresponding
		    subpattern in the regular expression, but  for  the	 other
		    types  (binary  and	list), the values are the empty	binary
		    or list, respectively.

		  Type:
		    Optionally specifies how captured substrings are to	be re-
		    turned. If omitted,	the default of index is	used.

		    Type can be	one of the following:

		    index:
		      Returns  captured	 substrings  as	 pairs of byte indexes
		      into the subject	string	and  length  of	 the  matching
		      string  in  the  subject	(as  if	the subject string was
		      flattened	  with	 erlang:iolist_to_binary/1   or	  uni-
		      code:characters_to_binary/2   before  matching).	Notice
		      that option unicode results in byte-oriented indexes  in
		      a	 (possibly virtual) UTF-8 encoded binary. A byte index
		      tuple {0,2} can therefore	represent one or  two  charac-
		      ters  when  unicode is in	effect.	This can seem counter-
		      intuitive, but has been deemed the  most	effective  and
		      useful  way to do	it. To return lists instead can	result
		      in simpler code if that is desired. This return type  is
		      the default.

		    list:
		      Returns  matching	substrings as lists of characters (Er-
		      lang string()s). It option unicode is used  in  combina-
		      tion  with  the \C sequence in the regular expression, a
		      captured subpattern can contain bytes that are not valid
		      UTF-8  (\C  matches bytes	regardless of character	encod-
		      ing). In that case the list capturing can	result in  the
		      same  types  of tuples that unicode:characters_to_list/2
		      can return, namely three-tuples with tag	incomplete  or
		      error, the successfully converted	characters and the in-
		      valid UTF-8 tail of the conversion as a binary. The best
		      strategy	is to avoid using the \C sequence when captur-
		      ing lists.

		    binary:
		      Returns matching substrings as binaries. If option  uni-
		      code is used, these binaries are in UTF-8. If the	\C se-
		      quence is	used together with unicode, the	 binaries  can
		      be invalid UTF-8.

		  In  general,	subpatterns  that were not assigned a value in
		  the match are	returned as the	tuple {-1,0} when type is  in-
		  dex. Unassigned subpatterns are returned as the empty	binary
		  or list, respectively, for other return types. Consider  the
		  following regular expression:

		".*((?<FOO>abdd)|a(..d)).*"

		  There	 are three explicitly capturing	subpatterns, where the
		  opening parenthesis position determines the order in the re-
		  sult,	 hence	((?_FOO_abdd)|a(..d))  is  subpattern index 1,
		  (?_FOO_abdd) is subpattern index 2, and (..d)	is  subpattern
		  index	3. When	matched	against	the following string:

		"ABCabcdABC"

		  the  subpattern  at index 2 does not match, as "abdd"	is not
		  present in the string, but the complete pattern matches (be-
		  cause	 of the	alternative a(..d)). The subpattern at index 2
		  is therefore unassigned and the default return value is:

		{match,[{0,10},{3,4},{-1,0},{4,3}]}

		  Setting the capture Type to binary gives:

		{match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}

		  Here the empty binary	(____) represents the unassigned  sub-
		  pattern.  In	the  binary  case,  some information about the
		  matching is therefore	lost, as ____ can  also	 be  an	 empty
		  string captured.

		  If  differentiation  between	empty matches and non-existing
		  subpatterns is necessary, use	the type index and do the con-
		  version to the final type in Erlang code.

		  When	option global is speciified, the capture specification
		  affects each match separately, so that:

		re:run("cacb","c(a|b)",[global,{capture,[1],list}]).

		  gives

		{match,[["a"],["b"]]}

	      For a descriptions of options  only  affecting  the  compilation
	      step, see	compile/2.

       split(Subject, RE) -> SplitList

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 SplitList = [iodata() | unicode:charlist()]

	      Same as split(Subject, RE, []).

       split(Subject, RE, Options) -> SplitList

	      Types:

		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	=
		     anchored |
		     notbol |
		     noteol |
		     notempty |
		     notempty_atstart |
		     {offset, integer()	>= 0} |
		     {newline, nl_spec()} |
		     {match_limit, integer() >=	0} |
		     {match_limit_recursion, integer() >= 0} |
		     bsr_anycrlf |
		     bsr_unicode |
		     {return, ReturnType} |
		     {parts, NumParts} |
		     group |
		     trim |
		     CompileOpt
		 NumParts = integer() >= 0 | infinity
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		   See compile/2.
		 SplitList = [RetData] | [GroupedRetData]
		 GroupedRetData	= [RetData]
		 RetData = iodata() | unicode:charlist() | binary() | list()

	      Splits  the  input into parts by finding tokens according	to the
	      regular expression supplied. The splitting is basically done  by
	      running  a global	regular	expression match and dividing the ini-
	      tial string wherever a match occurs. The matching	 part  of  the
	      string is	removed	from the output.

	      As  in run/3, an mp() compiled with option unicode requires Sub-
	      ject to be a Unicode charlist(). If compilation is done  implic-
	      itly  and	 the  unicode  compilation option is specified to this
	      function,	both the regular expression  and  Subject  are	to  be
	      specified	as valid Unicode charlist()s.

	      The  result  is given as a list of "strings", the	preferred data
	      type specified in	option return (default iodata).

	      If subexpressions	are specified in the regular  expression,  the
	      matching	subexpressions	are  returned in the resulting list as
	      well. For	example:

	      re:split("Erlang","[ln]",[{return,list}]).

	      gives

	      ["Er","a","g"]

	      while

	      re:split("Erlang","([ln])",[{return,list}]).

	      gives

	      ["Er","l","a","n","g"]

	      The text matching	the subexpression (marked by  the  parentheses
	      in  the regular expression) is inserted in the result list where
	      it was found. This means that  concatenating  the	 result	 of  a
	      split  where the whole regular expression	is a single subexpres-
	      sion (as in the last example) always  results  in	 the  original
	      string.

	      As  there	 is no matching	subexpression for the last part	in the
	      example (the "g"), nothing is inserted after that. To  make  the
	      group  of	strings	and the	parts matching the subexpressions more
	      obvious, one can use option group,  which	 groups	 together  the
	      part  of	the  subject string with the parts matching the	subex-
	      pressions	when the string	was split:

	      re:split("Erlang","([ln])",[{return,list},group]).

	      gives

	      [["Er","l"],["a","n"],["g"]]

	      Here the regular expression first	matched	the "l", causing  "Er"
	      to  be the first part in the result. When	the regular expression
	      matched, the (only) subexpression	was bound to the "l",  so  the
	      "l"  is inserted in the group together with "Er".	The next match
	      is of the	"n", making "a"	the next part to be returned.  As  the
	      subexpression is bound to	substring "n" in this case, the	"n" is
	      inserted into this group.	The last group consists	of the remain-
	      ing string, as no	more matches are found.

	      By  default,  all	 parts	of  the	 string,  including  the empty
	      strings, are returned from the function, for example:

	      re:split("Erlang","[lg]",[{return,list}]).

	      gives

	      ["Er","an",[]]

	      as the matching of the "g" in the	end of the  string  leaves  an
	      empty  rest,  which is also returned. This behavior differs from
	      the default behavior of the split	function in Perl, where	 empty
	      strings at the end are by	default	removed. To get	the "trimming"
	      default behavior of Perl,	specify	trim as	an option:

	      re:split("Erlang","[lg]",[{return,list},trim]).

	      gives

	      ["Er","an"]

	      The "trim" option	says; "give me as many parts as	 possible  ex-
	      cept  the	 empty	ones",	which sometimes	can be useful. You can
	      also specify how many parts you want, by specifying {parts,N}:

	      re:split("Erlang","[lg]",[{return,list},{parts,2}]).

	      gives

	      ["Er","ang"]

	      Notice that the last part	is "ang", not "an", as	splitting  was
	      specified	 into  two  parts, and the splitting stops when	enough
	      parts are	given, which is	why the	result differs	from  that  of
	      trim.

	      More than	three parts are	not possible with this indata, so

	      re:split("Erlang","[lg]",[{return,list},{parts,4}]).

	      gives  the  same result as the default, which is to be viewed as
	      "an infinite number of parts".

	      Specifying 0 as the number of parts gives	the same effect	as op-
	      tion  trim. If subexpressions are	captured, empty	subexpressions
	      matched at the end are also stripped from	the result if trim  or
	      {parts,0}	is specified.

	      The  trim	 behavior  corresponds	exactly	 to  the Perl default.
	      {parts,N}, where N is a positive integer,	corresponds exactly to
	      the Perl behavior	with a positive	numerical third	parameter. The
	      default behavior of split/3 corresponds  to  the	Perl  behavior
	      when  a negative integer is specified as the third parameter for
	      the Perl routine.

	      Summary of options not previously	described for function run/3:

		{return,ReturnType}:
		  Specifies how	the parts of the original string are presented
		  in the result	list. Valid types:

		  iodata:
		    The	 variant  of  iodata() that gives the least copying of
		    data with the current implementation (often	a binary,  but
		    do not depend on it).

		  binary:
		    All	parts returned as binaries.

		  list:
		    All	parts returned as lists	of characters ("strings").

		group:
		  Groups together the part of the string with the parts	of the
		  string matching the subexpressions of	 the  regular  expres-
		  sion.

		  The  return value from the function is in this case a	list()
		  of list()s. Each sublist begins with the string  picked  out
		  of  the  subject string, followed by the parts matching each
		  of the subexpressions	in order of occurrence in the  regular
		  expression.

		{parts,N}:
		  Specifies  the  number  of parts the subject string is to be
		  split	into.

		  The number of	parts is to be a positive integer for  a  spe-
		  cific	 maximum number	of parts, and infinity for the maximum
		  number of parts possible (the	default). Specifying {parts,0}
		  gives	 as many parts as possible disregarding	empty parts at
		  the end, the same as specifying trim.

		trim:
		  Specifies that empty parts at	the end	of the result list are
		  to  be  disregarded.	The same as specifying {parts,0}. This
		  corresponds to the default behavior of  the  split  built-in
		  function in Perl.

PERL-LIKE REGULAR EXPRESSION SYNTAX
       The  following  sections	contain	reference material for the regular ex-
       pressions used by this module. The information is  based	 on  the  PCRE
       documentation,  with  changes  where this module	behaves	differently to
       the PCRE	library.

PCRE REGULAR EXPRESSION	DETAILS
       The syntax and semantics	of the regular expressions supported  by  PCRE
       are  described  in detail in the	following sections. Perl's regular ex-
       pressions are described in its own documentation, and  regular  expres-
       sions in	general	are covered in many books, some	with copious examples.
       Jeffrey	Friedl's  "Mastering  Regular	Expressions",	published   by
       O'Reilly,  covers regular expressions in	great detail. This description
       of the PCRE regular expressions is intended as reference	material.

       The reference material is divided into the following sections:

	 * Special Start-of-Pattern Items

	 * Characters and Metacharacters

	 * Backslash

	 * Circumflex and Dollar

	 * Full	Stop (Period, Dot) and \N

	 * Matching a Single Data Unit

	 * Square Brackets and Character Classes

	 * Posix Character Classes

	 * Vertical Bar

	 * Internal Option Setting

	 * Subpatterns

	 * Duplicate Subpattern	Numbers

	 * Named Subpatterns

	 * Repetition

	 * Atomic Grouping and Possessive Quantifiers

	 * Back	References

	 * Assertions

	 * Conditional Subpatterns

	 * Comments

	 * Recursive Patterns

	 * Subpatterns as Subroutines

	 * Oniguruma Subroutine	Syntax

	 * Backtracking	Control

SPECIAL	START-OF-PATTERN ITEMS
       Some options that can be	passed to compile/2 can	also be	set by special
       items at	the start of a pattern.	These are not Perl-compatible, but are
       provided	to make	these options accessible to pattern  writers  who  are
       not  able  to change the	program	that processes the pattern. Any	number
       of these	items can appear, but they must	all be together	right  at  the
       start of	the pattern string, and	the letters must be in upper case.

       UTF Support

       Unicode	support	 is  basically UTF-8 based. To use Unicode characters,
       you either call compile/2 or run/3 with option unicode, or the  pattern
       must start with one of these special sequences:

       (*UTF8)
       (*UTF)

       Both  options  give the same effect, the	input string is	interpreted as
       UTF-8. Notice that with these instructions, the automatic conversion of
       lists  to  UTF-8	is not performed by the	re functions. Therefore, using
       these sequences is not recommended. Add	option	unicode	 when  running
       compile/2 instead.

       Some applications that allow their users	to supply patterns can wish to
       restrict	them to	non-UTF	data for security reasons. If option never_utf
       is  set	at compile time, (*UTF), and so	on, are	not allowed, and their
       appearance causes an error.

       Unicode Property	Support

       The following is	another	special	sequence that can appear at the	 start
       of a pattern:

       (*UCP)

       This  has  the  same  effect as setting option ucp: it causes sequences
       such as \d and \w to use	 Unicode  properties  to  determine  character
       types,  instead of recognizing only characters with codes < 256 through
       a lookup	table.

       Disabling Startup Optimizations

       If a pattern starts with	(*NO_START_OPT), it has	 the  same  effect  as
       setting option no_start_optimize	at compile time.

       Newline Conventions

       PCRE supports five conventions for indicating line breaks in strings: a
       single CR (carriage return) character, a	single LF (line	feed)  charac-
       ter,  the  two-character	sequence CRLF, any of the three	preceding, and
       any Unicode newline sequence.

       A newline convention can	also be	specified by starting a	pattern	string
       with one	of the following five sequences:

	 (*CR):
	   Carriage return

	 (*LF):
	   Line	feed

	 (*CRLF):
	   >Carriage return followed by	line feed

	 (*ANYCRLF):
	   Any of the three above

	 (*ANY):
	   All Unicode newline sequences

       These  override the default and the options specified to	compile/2. For
       example,	the following pattern changes the convention to	CR:

       (*CR)a.b

       This pattern matches a\nb, as LF	is no longer a newline.	If  more  than
       one of them is present, the last	one is used.

       The  newline  convention	affects	where the circumflex and dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter  when dotall is not set, and the behavior of \N. However, it does
       not affect what the \R escape sequence matches. By default, this	is any
       Unicode	newline	sequence, for Perl compatibility. However, this	can be
       changed;	see the	description of \R  in  section	Newline	 Sequences.  A
       change  of  the \R setting can be combined with a change	of the newline
       convention.

       Setting Match and Recursion Limits

       The caller of run/3 can set a limit on the number of times the internal
       match() function	is called and on the maximum depth of recursive	calls.
       These facilities	are provided to	catch runaway matches  that  are  pro-
       voked by	patterns with huge matching trees (a typical example is	a pat-
       tern with nested	unlimited repeats) and to avoid	running	out of	system
       stack  by  too  much  recursion.	 When  one of these limits is reached,
       pcre_exec() gives an error return. The limits can also be set by	 items
       at the start of the pattern of the following forms:

       (*LIMIT_MATCH=d)
       (*LIMIT_RECURSION=d)

       Here  d is any number of	decimal	digits.	However, the value of the set-
       ting must be less than the value	set by the caller of run/3 for	it  to
       have any	effect.	That is, the pattern writer can	lower the limit	set by
       the programmer, but not raise it. If there is more than one setting  of
       one of these limits, the	lower value is used.

       The  default  value for both the	limits is 10,000,000 in	the Erlang VM.
       Notice that the recursion limit does not	affect the stack depth of  the
       VM,  as	PCRE for Erlang	is compiled in such a way that the match func-
       tion never does recursion on the	C stack.

       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
       the limits set by the caller, not increase them.

CHARACTERS AND METACHARACTERS
       A  regular  expression  is  a pattern that is matched against a subject
       string from left	to right. Most characters stand	for  themselves	 in  a
       pattern	and  match  the	 corresponding characters in the subject. As a
       trivial example,	the following pattern matches a	portion	of  a  subject
       string that is identical	to itself:

       The quick brown fox

       When  caseless  matching	 is  specified	(option	caseless), letters are
       matched independently of	case.

       The power of regular expressions	comes from the ability to include  al-
       ternatives  and	repetitions  in	 the pattern. These are	encoded	in the
       pattern by the use of metacharacters, which do not stand	for themselves
       but instead are interpreted in some special way.

       Two sets	of metacharacters exist: those that are	recognized anywhere in
       the pattern except within square	brackets, and those  that  are	recog-
       nized  within square brackets. Outside square brackets, the metacharac-
       ters are	as follows:

	 \:
	   General escape character with many uses

	 ^:
	   Assert start	of string (or line, in multiline mode)

	 $:
	   Assert end of string	(or line, in multiline mode)

	 .:
	   Match any character except newline (by default)

	 [:
	   Start character class definition

	 |:
	   Start of alternative	branch

	 (:
	   Start subpattern

	 ):
	   End subpattern

	 ?:
	   Extends the meaning of (, also 0 or 1 quantifier,  also  quantifier
	   minimizer

	 *:
	   0 or	more quantifiers

	 +:
	   1 or	more quantifier, also "possessive quantifier"

	 {:
	   Start min/max quantifier

       Part of a pattern within	square brackets	is called a "character class".
       The following are the only metacharacters in a character	class:

	 \:
	   General escape character

	 ^:
	   Negate the class, but only if the first character

	 -:
	   Indicates character range

	 [:
	   Posix character class (only if followed by Posix syntax)

	 ]:
	   Terminates the character class

       The following sections describe the use of each metacharacter.

BACKSLASH
       The backslash character has many	uses. First, if	it is  followed	 by  a
       character  that	is not a number	or a letter, it	takes away any special
       meaning that a character	can have. This use of backslash	as  an	escape
       character applies both inside and outside character classes.

       For  example,  if  you want to match a *	character, you write \*	in the
       pattern.	This escaping action applies if	the following character	 would
       otherwise  be  interpreted  as a	metacharacter, so it is	always safe to
       precede a non-alphanumeric with backslash to specify that it stands for
       itself. In particular, if you want to match a backslash,	write \\.

       In  unicode mode, only ASCII numbers and	letters	have any special mean-
       ing after a backslash. All other	characters (in particular, those whose
       code points are > 127) are treated as literals.

       If  a  pattern is compiled with option extended,	whitespace in the pat-
       tern (other than	in a character class) and characters between a #  out-
       side  a	character  class and the next newline are ignored. An escaping
       backslash can be	used to	include	a whitespace or	# character as part of
       the pattern.

       To  remove  the special meaning from a sequence of characters, put them
       between \Q and \E. This is different from Perl in that $	and @ are han-
       dled  as	 literals  in  \Q...\E	sequences in PCRE, while $ and @ cause
       variable	interpolation in Perl. Notice the following examples:

       Pattern		  PCRE matches	 Perl matches

       \Qabc$xyz\E	  abc$xyz	 abc followed by the contents of $xyz
       \Qabc\$xyz\E	  abc\$xyz	 abc\$xyz
       \Qabc\E\$\Qxyz\E	  abc$xyz	 abc$xyz

       The \Q...\E sequence is recognized both inside  and  outside  character
       classes.	An isolated \E that is not preceded by \Q is ignored. If \Q is
       not followed by \E later	in the	pattern,  the  literal	interpretation
       continues  to  the  end	of  the	pattern	(that is, \E is	assumed	at the
       end). If	the isolated \Q	is inside a character class,  this  causes  an
       error, as the character class is	not terminated.

       Non-Printing Characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters in patterns in a visible manner. There is	no restriction on  the
       appearance  of non-printing characters, apart from the binary zero that
       terminates a pattern. When a pattern is prepared	by text	editing, it is
       often  easier to	use one	of the following escape	sequences than the bi-
       nary character it represents:

	 \a:
	   Alarm, that is, the BEL character (hex 07)

	 \cx:
	   "Control-x",	where x	is any ASCII character

	 \e:
	   Escape (hex 1B)

	 \f:
	   Form	feed (hex 0C)

	 \n:
	   Line	feed (hex 0A)

	 \r:
	   Carriage return (hex	0D)

	 \t:
	   Tab (hex 09)

	 \0dd:
	   Character with octal	code 0dd

	 \ddd:
	   Character with octal	code ddd, or back reference

	 \o{ddd..}:
	   character with octal	code ddd..

	 \xhh:
	   Character with hex code hh

	 \x{hhh..}:
	   Character with hex code hhh..

   Note:
       Note that \0dd is always	an octal code, and that	\8 and \9 are the lit-
       eral characters "8" and "9".

       The  precise effect of \cx on ASCII characters is as follows: if	x is a
       lowercase letter, it is converted to upper case.	 Then  bit  6  of  the
       character (hex 40) is inverted. Thus \cA	to \cZ become hex 01 to	hex 1A
       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is	7B), and  \c;  becomes
       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
       has a value > 127, a compile-time error occurs.	This  locks  out  non-
       ASCII characters	in all modes.

       The  \c	facility  was designed for use with ASCII characters, but with
       the extension to	Unicode	it is even less	useful than it once was.

       After \0	up to two further octal	digits are read. If  there  are	 fewer
       than  two  digits,  just	 those that are	present	are used. Thus the se-
       quence \0\x\015 specifies two binary zeros followed by a	 CR  character
       (code value 13).	Make sure you supply two digits	after the initial zero
       if the pattern character	that follows is	itself an octal	digit.

       The escape \o must be followed by a sequence of octal digits,  enclosed
       in  braces.  An	error occurs if	this is	not the	case. This escape is a
       recent addition to Perl;	it provides way	of specifying  character  code
       points  as  octal  numbers  greater than	0777, and it also allows octal
       numbers and back	references to be unambiguously specified.

       For greater clarity and unambiguity, it is best to avoid	following \ by
       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
       ter numbers, and	\g{} to	specify	back references. The  following	 para-
       graphs describe the old,	ambiguous syntax.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated, and Perl has changed in recent releases, causing	PCRE  also  to
       change. Outside a character class, PCRE reads the digit and any follow-
       ing digits as a decimal number. If the number is	< 8, or	if there  have
       been  at	least that many	previous capturing left	parentheses in the ex-
       pression, the entire sequence is	taken as a back	reference. A  descrip-
       tion  of	 how this works	is provided later, following the discussion of
       parenthesized subpatterns.

       Inside a	character class, or if the decimal number following \ is  >  7
       and  there  have	not been that many capturing subpatterns, PCRE handles
       \8 and \9 as the	literal	characters "8" and "9",	and otherwise re-reads
       up  to  three  octal  digits following the backslash, and using them to
       generate	a data character. Any subsequent digits	stand for  themselves.
       For example:

	 \040:
	   Another way of writing an ASCII space

	 \40:
	   The same, provided there are	< 40 previous capturing	subpatterns

	 \7:
	   Always a back reference

	 \11:
	   Can be a back reference, or another way of writing a	tab

	 \011:
	   Always a tab

	 \0113:
	   A tab followed by character "3"

	 \113:
	   Can	be  a  back reference, otherwise the character with octal code
	   113

	 \377:
	   Can be a back reference, otherwise value 255	(decimal)

	 \81:
	   Either a back reference, or the two characters "8" and "1"

       Notice that octal values	>= 100 that are	specified  using  this	syntax
       must  not  be introduced	by a leading zero, as no more than three octal
       digits are ever read.

       By default, after \x that is not	followed by {, from zero to two	 hexa-
       decimal	digits	are  read (letters can be in upper or lower case). Any
       number of hexadecimal digits may	appear between \x{ and }. If a charac-
       ter  other  than	 a  hexadecimal	digit appears between \x{ and }, or if
       there is	no terminating }, an error occurs.

       Characters whose	value is less than 256 can be defined by either	of the
       two  syntaxes  for  \x. There is	no difference in the way they are han-
       dled. For example, \xdc is exactly the same as \x{dc}.

       Constraints on character	values

       Characters that are specified using octal or  hexadecimal  numbers  are
       limited to certain values, as follows:

	 8-bit non-UTF mode:
	   < 0x100

	 8-bit UTF-8 mode:
	   < 0x10ffff and a valid codepoint

       Invalid	Unicode	 codepoints  are  the  range 0xd800 to 0xdfff (the so-
       called "surrogate" codepoints), and 0xffef.

       Escape sequences	in character classes

       All the sequences that define a single character	value can be used both
       inside  and  outside character classes. Also, inside a character	class,
       \b is interpreted as the	backspace character (hex 08).

       \N is not allowed in a character	class. \B, \R, and \X are not  special
       inside  a  character  class.  Like other	unrecognized escape sequences,
       they are	treated	as the literal characters "B", "R", and	"X". Outside a
       character class,	these sequences	have different meanings.

       Unsupported Escape Sequences

       In  Perl, the sequences \l, \L, \u, and \U are recognized by its	string
       handler and used	to modify the case of following	characters. PCRE  does
       not support these escape	sequences.

       Absolute	and Relative Back References

       The  sequence  \g followed by an	unsigned or a negative number, option-
       ally enclosed in	braces,	is an absolute or relative back	 reference.  A
       named back reference can	be coded as \g{name}. Back references are dis-
       cussed later, following the discussion of parenthesized subpatterns.

       Absolute	and Relative Subroutine	Calls

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       alternative syntax for referencing a subpattern as a "subroutine".  De-
       tails  are  discussed  later.  Notice  that  \g{...}  (Perl syntax) and
       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
       reference and the latter	is a subroutine	call.

       Generic Character Types

       Another use of backslash	is for specifying generic character types:

	 \d:
	   Any decimal digit

	 \D:
	   Any character that is not a decimal digit

	 \h:
	   Any horizontal whitespace character

	 \H:
	   Any character that is not a horizontal whitespace character

	 \s:
	   Any whitespace character

	 \S:
	   Any character that is not a whitespace character

	 \v:
	   Any vertical	whitespace character

	 \V:
	   Any character that is not a vertical	whitespace character

	 \w:
	   Any "word" character

	 \W:
	   Any "non-word" character

       There is	also the single	sequence \N, which matches a non-newline char-
       acter. This is the same as the "." metacharacter	 when  dotall  is  not
       set.  Perl  also	uses \N	to match characters by name, but PCRE does not
       support this.

       Each pair of lowercase and uppercase escape  sequences  partitions  the
       complete	 set of	characters into	two disjoint sets. Any given character
       matches one, and	only one, of each pair.	The sequences can appear  both
       inside  and outside character classes. They each	match one character of
       the appropriate type. If	the current matching point is at  the  end  of
       the subject string, all fail, as	there is no character to match.

       For  compatibility with Perl, \s	did not	used to	match the VT character
       (code 11), which	made it	different from the the	POSIX  "space"	class.
       However,	 Perl  added VT	at release 5.18, and PCRE followed suit	at re-
       lease 8.34. The default \s characters are now HT	(9), LF	(10), VT (11),
       FF  (12),  CR (13), and space (32), which are defined as	white space in
       the "C" locale. This list may vary if locale-specific matching is  tak-
       ing  place. For example,	in some	locales	the "non-breaking space" char-
       acter (\xA0) is recognized as white space, and in others	the VT charac-
       ter is not.

       A  "word"  character is an underscore or	any character that is a	letter
       or a digit. By default, the definition of letters and  digits  is  con-
       trolled	by the PCRE low-valued character tables, in Erlang's case (and
       without option unicode),	the ISO	Latin-1	character set.

       By default, in unicode mode, characters with values > 255, that is, all
       characters  outside  the	ISO Latin-1 character set, never match \d, \s,
       or \w, and always match \D, \S, and \W. These  sequences	 retain	 their
       original	meanings from before UTF support was available,	mainly for ef-
       ficiency	reasons. However, if  option  ucp  is  set,  the  behavior  is
       changed	so  that  Unicode  properties  are used	to determine character
       types, as follows:

	 \d:
	   Any character that \p{Nd} matches (decimal digit)

	 \s:
	   Any character that \p{Z} or \h or \v

	 \w:
	   Any character that matches \p{L} or \p{N} matches, plus underscore

       The uppercase escapes match the inverse sets of characters. Notice that
       \d matches only decimal digits, while \w	matches	any Unicode digit, any
       Unicode letter, and underscore. Notice also that	ucp affects \b and \B,
       as  they	are defined in terms of	\w and \W. Matching these sequences is
       noticeably slower when ucp is set.

       The sequences \h, \H, \v, and \V	are features that were added  to  Perl
       in  release  5.10. In contrast to the other sequences, which match only
       ASCII characters	by default, these  always  match  certain  high-valued
       code points, regardless if ucp is set.

       The following are the horizontal	space characters:

	 U+0009:
	   Horizontal tab (HT)

	 U+0020:
	   Space

	 U+00A0:
	   Non-break space

	 U+1680:
	   Ogham space mark

	 U+180E:
	   Mongolian vowel separator

	 U+2000:
	   En quad

	 U+2001:
	   Em quad

	 U+2002:
	   En space

	 U+2003:
	   Em space

	 U+2004:
	   Three-per-em	space

	 U+2005:
	   Four-per-em space

	 U+2006:
	   Six-per-em space

	 U+2007:
	   Figure space

	 U+2008:
	   Punctuation space

	 U+2009:
	   Thin	space

	 U+200A:
	   Hair	space

	 U+202F:
	   Narrow no-break space

	 U+205F:
	   Medium mathematical space

	 U+3000:
	   Ideographic space

       The following are the vertical space characters:

	 U+000A:
	   Line	feed (LF)

	 U+000B:
	   Vertical tab	(VT)

	 U+000C:
	   Form	feed (FF)

	 U+000D:
	   Carriage return (CR)

	 U+0085:
	   Next	line (NEL)

	 U+2028:
	   Line	separator

	 U+2029:
	   Paragraph separator

       In  8-bit,  non-UTF-8  mode, only the characters	with code points < 256
       are relevant.

       Newline Sequences

       Outside a character class, by default, the escape sequence  \R  matches
       any  Unicode  newline  sequence.	In non-UTF-8 mode, \R is equivalent to
       the following:

       (?>\r\n|\n|\x0b|\f|\r|\x85)

       This is an example of an	"atomic	group",	details	are provided below.

       This particular group matches either the	two-character sequence CR fol-
       lowed by	LF, or one of the single characters LF (line feed, U+000A), VT
       (vertical tab, U+000B), FF (form	feed, U+000C),	CR  (carriage  return,
       U+000D),	 or  NEL  (next	 line,	U+0085). The two-character sequence is
       treated as a single unit	that cannot be split.

       In Unicode mode,	two more characters whose code points are  >  255  are
       added:  LS  (line  separator,  U+2028)  and  PS	(paragraph  separator,
       U+2029).	Unicode	character property support is  not  needed  for	 these
       characters to be	recognized.

       \R can be restricted to match only CR, LF, or CRLF (instead of the com-
       plete set of Unicode line endings) by setting option bsr_anycrlf	either
       at  compile time	or when	the pattern is matched.	(BSR is	an acronym for
       "backslash R".) This can	be made	the default when PCRE is built;	if so,
       the  other  behavior can	be requested through option bsr_unicode. These
       settings	can also be specified by starting a pattern string with	one of
       the following sequences:

	 (*BSR_ANYCRLF):
	   CR, LF, or CRLF only

	 (*BSR_UNICODE):
	   Any Unicode newline sequence

       These  override	the default and	the options specified to the compiling
       function, but they can themselves be overridden by options specified to
       a  matching function. Notice that these special settings, which are not
       Perl-compatible,	are recognized only at the very	start  of  a  pattern,
       and  that  they	must  be  in  upper  case. If more than	one of them is
       present,	the last one is	used. They can be combined with	 a  change  of
       newline convention; for example,	a pattern can start with:

       (*ANY)(*BSR_ANYCRLF)

       They  can  also be combined with	the (*UTF8), (*UTF), or	(*UCP) special
       sequences. Inside a character class, \R is treated as  an  unrecognized
       escape sequence,	and so matches the letter "R" by default.

       Unicode Character Properties

       Three more escape sequences that	match characters with specific proper-
       ties are	available. When	in 8-bit non-UTF-8 mode, these	sequences  are
       limited	to testing characters whose code points	are < 256, but they do
       work in this mode. The following	are the	extra escape sequences:

	 \p{xx}:
	   A character with property xx

	 \P{xx}:
	   A character without property	xx

	 \X:
	   A Unicode extended grapheme cluster

       The property names represented by xx above are limited to  the  Unicode
       script names, the general category properties, "Any", which matches any
       character (including newline), and some special	PCRE  properties  (de-
       scribed	in  the	next section). Other Perl properties, such as "InMusi-
       calSymbols", are	currently not supported	by PCRE. Notice	 that  \P{Any}
       does not	match any characters and always	causes a match failure.

       Sets of Unicode characters are defined as belonging to certain scripts.
       A character from	one of these sets can be matched using a script	 name,
       for example:

       \p{Greek} \P{Han}

       Those  that are not part	of an identified script	are lumped together as
       "Common". The following is the current list of scripts:

	 * Arabic

	 * Armenian

	 * Avestan

	 * Balinese

	 * Bamum

	 * Bassa_Vah

	 * Batak

	 * Bengali

	 * Bopomofo

	 * Braille

	 * Buginese

	 * Buhid

	 * Canadian_Aboriginal

	 * Carian

	 * Caucasian_Albanian

	 * Chakma

	 * Cham

	 * Cherokee

	 * Common

	 * Coptic

	 * Cuneiform

	 * Cypriot

	 * Cyrillic

	 * Deseret

	 * Devanagari

	 * Duployan

	 * Egyptian_Hieroglyphs

	 * Elbasan

	 * Ethiopic

	 * Georgian

	 * Glagolitic

	 * Gothic

	 * Grantha

	 * Greek

	 * Gujarati

	 * Gurmukhi

	 * Han

	 * Hangul

	 * Hanunoo

	 * Hebrew

	 * Hiragana

	 * Imperial_Aramaic

	 * Inherited

	 * Inscriptional_Pahlavi

	 * Inscriptional_Parthian

	 * Javanese

	 * Kaithi

	 * Kannada

	 * Katakana

	 * Kayah_Li

	 * Kharoshthi

	 * Khmer

	 * Khojki

	 * Khudawadi

	 * Lao

	 * Latin

	 * Lepcha

	 * Limbu

	 * Linear_A

	 * Linear_B

	 * Lisu

	 * Lycian

	 * Lydian

	 * Mahajani

	 * Malayalam

	 * Mandaic

	 * Manichaean

	 * Meetei_Mayek

	 * Mende_Kikakui

	 * Meroitic_Cursive

	 * Meroitic_Hieroglyphs

	 * Miao

	 * Modi

	 * Mongolian

	 * Mro

	 * Myanmar

	 * Nabataean

	 * New_Tai_Lue

	 * Nko

	 * Ogham

	 * Ol_Chiki

	 * Old_Italic

	 * Old_North_Arabian

	 * Old_Permic

	 * Old_Persian

	 * Oriya

	 * Old_South_Arabian

	 * Old_Turkic

	 * Osmanya

	 * Pahawh_Hmong

	 * Palmyrene

	 * Pau_Cin_Hau

	 * Phags_Pa

	 * Phoenician

	 * Psalter_Pahlavi

	 * Rejang

	 * Runic

	 * Samaritan

	 * Saurashtra

	 * Sharada

	 * Shavian

	 * Siddham

	 * Sinhala

	 * Sora_Sompeng

	 * Sundanese

	 * Syloti_Nagri

	 * Syriac

	 * Tagalog

	 * Tagbanwa

	 * Tai_Le

	 * Tai_Tham

	 * Tai_Viet

	 * Takri

	 * Tamil

	 * Telugu

	 * Thaana

	 * Thai

	 * Tibetan

	 * Tifinagh

	 * Tirhuta

	 * Ugaritic

	 * Vai

	 * Warang_Citi

	 * Yi

       Each character has exactly one Unicode general category property, spec-
       ified  by  a  two-letter	acronym. For compatibility with	Perl, negation
       can be specified	by including a circumflex between  the	opening	 brace
       and the property	name. For example, \p{^Lu} is the same as \P{Lu}.

       If only one letter is specified with \p or \P, it includes all the gen-
       eral category properties	that start with	that letter. In	this case,  in
       the  absence of negation, the curly brackets in the escape sequence are
       optional. The following two examples have the same effect:

       \p{L}
       \pL

       The following general category property codes are supported:

	 C:
	   Other

	 Cc:
	   Control

	 Cf:
	   Format

	 Cn:
	   Unassigned

	 Co:
	   Private use

	 Cs:
	   Surrogate

	 L:
	   Letter

	 Ll:
	   Lowercase letter

	 Lm:
	   Modifier letter

	 Lo:
	   Other letter

	 Lt:
	   Title case letter

	 Lu:
	   Uppercase letter

	 M:
	   Mark

	 Mc:
	   Spacing mark

	 Me:
	   Enclosing mark

	 Mn:
	   Non-spacing mark

	 N:
	   Number

	 Nd:
	   Decimal number

	 Nl:
	   Letter number

	 No:
	   Other number

	 P:
	   Punctuation

	 Pc:
	   Connector punctuation

	 Pd:
	   Dash	punctuation

	 Pe:
	   Close punctuation

	 Pf:
	   Final punctuation

	 Pi:
	   Initial punctuation

	 Po:
	   Other punctuation

	 Ps:
	   Open	punctuation

	 S:
	   Symbol

	 Sc:
	   Currency symbol

	 Sk:
	   Modifier symbol

	 Sm:
	   Mathematical	symbol

	 So:
	   Other symbol

	 Z:
	   Separator

	 Zl:
	   Line	separator

	 Zp:
	   Paragraph separator

	 Zs:
	   Space separator

       The special property L& is also supported. It matches a character  that
       has  the	 Lu, Ll, or Lt property, that is, a letter that	is not classi-
       fied as a modifier or "other".

       The Cs (Surrogate) property applies only	to  characters	in  the	 range
       U+D800 to U+DFFF. Such characters are invalid in	Unicode	strings	and so
       cannot be tested	by PCRE. Perl does not support the Cs property.

       The long	synonyms for property names supported by Perl (such as \p{Let-
       ter})  are  not supported by PCRE. It is	not permitted to prefix	any of
       these properties	with "Is".

       No character in the Unicode table has  the  Cn  (unassigned)  property.
       This  property is instead assumed for any code point that is not	in the
       Unicode table.

       Specifying caseless matching does not affect  these  escape  sequences.
       For example, \p{Lu} always matches only uppercase letters. This is dif-
       ferent from the behavior	of current versions of Perl.

       Matching	characters by Unicode property is not fast, as PCRE must do  a
       multistage  table  lookup to find a character property. That is why the
       traditional escape sequences such as \d and \w do not use Unicode prop-
       erties  in PCRE by default. However, you	can make them do so by setting
       option ucp or by	starting the pattern with (*UCP).

       Extended	Grapheme Clusters

       The \X escape matches any number	of Unicode  characters	that  form  an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see below). Up to and including	release	8.31, PCRE matched an earlier,
       simpler	definition  that  was  equivalent  to (?_\PM\pM*). That	is, it
       matched a character without the "mark" property,	followed  by  zero  or
       more  characters	 with  the "mark" property. Characters with the	"mark"
       property	are typically non-spacing accents that	affect	the  preceding
       character.

       This  simple definition was extended in Unicode to include more compli-
       cated kinds of composite	character by giving each character a  grapheme
       breaking	 property, and creating	rules that use these properties	to de-
       fine the	boundaries of extended grapheme	 clusters.  In	PCRE  releases
       later than 8.31,	\X matches one of these	clusters.

       \X  always  matches  at least one character. Then it decides whether to
       add more	characters according to	the following rules for	ending a clus-
       ter:

	 * End at the end of the subject string.

	 * Do not end between CR and LF; otherwise end after any control char-
	   acter.

	 * Do not break	Hangul (a Korean script)  syllable  sequences.	Hangul
	   characters  are of five types: L, V,	T, LV, and LVT.	An L character
	   can be followed by an L, V, LV, or LVT character. An	LV or V	 char-
	   acter  can be followed by a V or T character. An LVT	or T character
	   can be followed only	by a T character.

	 * Do not end before extending characters or spacing marks. Characters
	   with	the "mark" property always have	the "extend" grapheme breaking
	   property.

	 * Do not end after prepend characters.

	 * Otherwise, end the cluster.

       PCRE Additional Properties

       In addition to the standard Unicode properties described	earlier,  PCRE
       supports	 four more that	make it	possible to convert traditional	escape
       sequences, such as \w and \s to use Unicode properties. PCRE uses these
       non-standard,  non-Perl	properties  internally	when the ucp option is
       passed. However,	they can also be used explicitly. The  properties  are
       as follows:

	 Xan:
	   Any alphanumeric character. Matches characters that have either the
	   L (letter) or the N (number)	property.

	 Xps:
	   Any Posix space character. Matches the characters tab,  line	 feed,
	   vertical  tab,  form	feed, carriage return, and any other character
	   that	has the	Z (separator) property.

	 Xsp:
	   Any Perl space character. Matches the same as Xps, except that ver-
	   tical tab is	excluded.

	 Xwd:
	   Any Perl "word" character. Matches the same characters as Xan, plus
	   underscore.

       Perl and	POSIX space are	now the	same. Perl added VT to its space char-
       acter set at release 5.18 and PCRE changed at release 8.34.

       Xan  matches  characters	that have either the L (letter)	or the N (num-
       ber) property. Xps matches the characters tab, linefeed,	vertical  tab,
       form  feed,  or carriage	return,	and any	other character	that has the Z
       (separator) property. Xsp is the	same as	Xps; it	used to	exclude	verti-
       cal tab,	for Perl compatibility,	but Perl changed, and so PCRE followed
       at release 8.34.	Xwd matches the	same characters	as  Xan,  plus	under-
       score.

       There  is another non-standard property,	Xuc, which matches any charac-
       ter that	can be represented by a	Universal Character Name  in  C++  and
       other  programming  languages.  These are the characters	$, @, `	(grave
       accent),	and all	characters with	Unicode	code points >= U+00A0,	except
       for  the	 surrogates  U+D800  to	 U+DFFF. Notice	that most base (ASCII)
       characters are excluded.	(Universal Character Names  are	 of  the  form
       \uHHHH  or  \UHHHHHHHH, where H is a hexadecimal	digit. Notice that the
       Xuc property does not match these sequences  but	 the  characters  that
       they represent.)

       Resetting the Match Start

       The  escape sequence \K causes any previously matched characters	not to
       be included in the final	matched	sequence. For example,	the  following
       pattern matches "foobar", but reports that it has matched "bar":

       foo\Kbar

       This  feature  is  similar to a lookbehind assertion (described below).
       However,	in this	case, the part of the subject before  the  real	 match
       does  not  have to be of	fixed length, as lookbehind assertions do. The
       use of \K does not interfere with the setting of	 captured  substrings.
       For  example,  when  the	 following pattern matches "foobar", the first
       substring is still set to "foo":

       (foo)\Kbar

       Perl documents that the use of \K within	assertions is  "not  well  de-
       fined". In PCRE,	\K is acted upon when it occurs	inside positive	asser-
       tions, but is ignored in	negative assertions. Note that when a  pattern
       such  as	 (?=ab\K)  matches,  the  reported  start  of the match	can be
       greater than the	end of the match.

       Simple Assertions

       The final use of	backslash is for certain simple	assertions. An	asser-
       tion  specifies a condition that	must be	met at a particular point in a
       match, without consuming	any characters from the	 subject  string.  The
       use  of subpatterns for more complicated	assertions is described	below.
       The following are the backslashed assertions:

	 \b:
	   Matches at a	word boundary.

	 \B:
	   Matches when	not at a word boundary.

	 \A:
	   Matches at the start	of the subject.

	 \Z:
	   Matches at the end of the subject, and before a newline at the  end
	   of the subject.

	 \z:
	   Matches only	at the end of the subject.

	 \G:
	   Matches at the first	matching position in the subject.

       Inside  a  character  class, \b has a different meaning;	it matches the
       backspace character. If any other of  these  assertions	appears	 in  a
       character  class, by default it matches the corresponding literal char-
       acter (for example, \B matches the letter B).

       A word boundary is a position in	the subject string where  the  current
       character  and  the previous character do not both match	\w or \W (that
       is, one matches \w and the other	matches	\W), or	the start  or  end  of
       the  string if the first	or last	character matches \w, respectively. In
       UTF mode, the meanings of \w and	\W can be changed  by  setting	option
       ucp. When this is done, it also affects \b and \B. PCRE and Perl	do not
       have a separate "start of word" or "end of word"	metasequence. However,
       whatever	 follows  \b normally determines which it is. For example, the
       fragment	\ba matches "a"	at the start of	a word.

       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
       and dollar (described in	the next section) in that they only ever match
       at the very start and end of the	subject	string,	whatever  options  are
       set.  Thus,  they are independent of multiline mode. These three	asser-
       tions are not affected by options notbol	or noteol, which  affect  only
       the  behavior  of the circumflex	and dollar metacharacters. However, if
       argument	startoffset of run/3 is	non-zero, indicating that matching  is
       to  start  at  a	 point other than the beginning	of the subject,	\A can
       never match. The	difference between \Z and \z is	that \Z	matches	before
       a  newline  at  the  end	 of  the  string and at	the very end, while \z
       matches only at the end.

       The \G assertion	is true	only when the current matching position	is  at
       the  start  point of the	match, as specified by argument	startoffset of
       run/3. It differs from \A when the value	of startoffset is non-zero. By
       calling	run/3 multiple times with appropriate arguments, you can mimic
       the Perl	option /g, and it is in	this kind of implementation  where  \G
       can be useful.

       Notice,	however,  that	the PCRE interpretation	of \G, as the start of
       the current match, is subtly different from Perl, which defines	it  as
       the end of the previous match. In Perl, these can be different when the
       previously matched string was empty. As PCRE does only one match	 at  a
       time, it	cannot reproduce this behavior.

       If  all	the alternatives of a pattern begin with \G, the expression is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

CIRCUMFLEX AND DOLLAR
       The  circumflex	and  dollar  metacharacters are	zero-width assertions.
       That is,	they test for a	particular condition to	be true	 without  con-
       suming any characters from the subject string.

       Outside a character class, in the default matching mode,	the circumflex
       character is an assertion that is true only  if	the  current  matching
       point is	at the start of	the subject string. If argument	startoffset of
       run/3 is	non-zero, circumflex can never match if	 option	 multiline  is
       unset.  Inside  a character class, circumflex has an entirely different
       meaning (see below).

       Circumflex needs	not to be the first character of the pattern  if  some
       alternatives  are involved, but it is to	be the first thing in each al-
       ternative in which it appears if	the pattern  is	 ever  to  match  that
       branch.	If all possible	alternatives start with	a circumflex, that is,
       if the pattern is constrained to	match only at the start	 of  the  sub-
       ject,  it  is  said  to be an "anchored"	pattern. (There	are also other
       constructs that can cause a pattern to be anchored.)

       The dollar character is an assertion that is true only if  the  current
       matching	 point is at the end of	the subject string, or immediately be-
       fore a newline at the end of the	string (by  default).  Notice  however
       that  it	 does  not  match the newline. Dollar needs not	to be the last
       character of the	pattern	if some	alternatives are involved, but	it  is
       to  be  the  last item in any branch in which it	appears. Dollar	has no
       special meaning in a character class.

       The meaning of dollar can be changed so that it	matches	 only  at  the
       very  end  of  the  string, by setting option dollar_endonly at compile
       time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are	changed	if op-
       tion  multiline is set. When this is the	case, a	circumflex matches im-
       mediately after internal	newlines and  at  the  start  of  the  subject
       string.	It does	not match after	a newline that ends the	string.	A dol-
       lar matches before any newlines in the string, and  at  the  very  end,
       when  multiline	is set.	When newline is	specified as the two-character
       sequence	CRLF, isolated CR and LF characters do not indicate newlines.

       For example, the	pattern	/^abc$/	matches	the subject string  "def\nabc"
       (where  \n  represents a	newline) in multiline mode, but	not otherwise.
       So, patterns that are anchored in single-line mode because all branches
       start  with  ^ are not anchored in multiline mode, and a	match for cir-
       cumflex is possible when	argument startoffset of	run/3 is non-zero. Op-
       tion dollar_endonly is ignored if multiline is set.

       Notice that the sequences \A, \Z, and \z	can be used to match the start
       and end of the subject in both modes. If	 all  branches	of  a  pattern
       start with \A, it is always anchored, regardless	if multiline is	set.

FULL STOP (PERIOD, DOT)	AND \N
       Outside	a  character class, a dot in the pattern matches any character
       in the subject string except (by	default) a  character  that  signifies
       the end of a line.

       When  a line ending is defined as a single character, dot never matches
       that character. When the	two-character sequence CRLF is used, dot  does
       not  match CR if	it is immediately followed by LF, otherwise it matches
       all characters (including isolated CRs and LFs).	When any Unicode  line
       endings	are recognized,	dot does not match CR, LF, or any of the other
       line-ending characters.

       The behavior of dot regarding newlines can be changed. If option	dotall
       is  set,	 a  dot	 matches any character,	without	exception. If the two-
       character sequence CRLF is present in the subject string, it takes  two
       dots to match it.

       The  handling of	dot is entirely	independent of the handling of circum-
       flex and	dollar,	the only relationship is that both  involve  newlines.
       Dot has no special meaning in a character class.

       The  escape  sequence  \N behaves like a	dot, except that it is not af-
       fected by option	PCRE_DOTALL. That is, it matches any character	except
       one  that signifies the end of a	line. Perl also	uses \N	to match char-
       acters by name but PCRE does not	support	this.

MATCHING A SINGLE DATA UNIT
       Outside a character class, the escape  sequence	\C  matches  any  data
       unit,  regardless  if a UTF mode	is set.	One data unit is one byte. Un-
       like a dot, \C always matches line-ending characters.  The  feature  is
       provided	in Perl	to match individual bytes in UTF-8 mode, but it	is un-
       clear how it can	usefully be used. As \C	breaks up characters into  in-
       dividual	data units, matching one unit with \C in a UTF mode means that
       the remaining string can	start with a malformed UTF character. This has
       undefined  results,  as	PCRE  assumes  that  it	 deals	with valid UTF
       strings.

       PCRE does not allow \C to appear	in  lookbehind	assertions  (described
       below) in a UTF mode, as	this would make	it impossible to calculate the
       length of the lookbehind.

       The \C escape sequence is best avoided. However,	one way	 of  using  it
       that  avoids  the problem of malformed UTF characters is	to use a look-
       ahead to	check the length of the	next character,	as  in	the  following
       pattern,	 which	can be used with a UTF-8 string	(ignore	whitespace and
       line breaks):

       (?| (?=[\x00-\x7f])(\C) |
	   (?=[\x80-\x{7ff}])(\C)(\C) |
	   (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
	   (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

       A group that starts with	(?| resets the capturing  parentheses  numbers
       in each alternative (see	section	Duplicate Subpattern Numbers). The as-
       sertions	at the start of	each branch check the next UTF-8 character for
       values whose encoding uses 1, 2,	3, or 4	bytes, respectively. The indi-
       vidual bytes of the character are then captured by the appropriate num-
       ber of groups.

SQUARE BRACKETS	AND CHARACTER CLASSES
       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial  by	 default.  However, if option PCRE_JAVASCRIPT_COMPAT is	set, a
       lone closing square bracket causes a compile-time error.	If  a  closing
       square  bracket	is  required as	a member of the	class, it is to	be the
       first data character in the class  (after  an  initial  circumflex,  if
       present)	or escaped with	a backslash.

       A  character  class matches a single character in the subject. In a UTF
       mode, the character can be more than one	 data  unit  long.  A  matched
       character must be in the	set of characters defined by the class,	unless
       the first character in the class	definition is a	circumflex,  in	 which
       case the	subject	character must not be in the set defined by the	class.
       If a circumflex is required as a	member of the class, ensure that it is
       not the first character,	or escape it with a backslash.

       For  example,  the character class [aeiou] matches any lowercase	vowel,
       while [^aeiou] matches any character that is not	a lowercase vowel. No-
       tice that a circumflex is just a	convenient notation for	specifying the
       characters that are in the class	by enumerating those that are  not.  A
       class  that starts with a circumflex is not an assertion; it still con-
       sumes a character from the subject string, and therefore	 it  fails  if
       the current pointer is at the end of the	string.

       In UTF-8	mode, characters with values > 255 (0xffff) can	be included in
       a class as a literal string of data units, or by	using the \x{ escaping
       mechanism.

       When  caseless  matching	 is set, any letters in	a class	represent both
       their uppercase and lowercase versions. For example, a caseless [aeiou]
       matches	"A" and	"a", and a caseless [^aeiou] does not match "A", but a
       caseful version would. In a UTF mode, PCRE always understands the  con-
       cept  of	case for characters whose values are < 256, so caseless	match-
       ing is always possible. For characters with higher values, the  concept
       of  case	 is  supported	only if	PCRE is	compiled with Unicode property
       support.	If you want to use caseless matching in	a UTF mode for charac-
       ters >=,	ensure that PCRE is compiled with Unicode property support and
       with UTF	support.

       Characters that can indicate line breaks	are never treated in any  spe-
       cial way	when matching character	classes, whatever line-ending sequence
       is in use, and whatever setting of options PCRE_DOTALL and  PCRE_MULTI-
       LINE  is	used. A	class such as [^a] always matches one of these charac-
       ters.

       The minus (hyphen) character can	be used	to specify a range of  charac-
       ters  in	 a  character class. For example, [d-m]	matches	any letter be-
       tween d and m, inclusive. If a minus character is required in a	class,
       it  must	 be  escaped with a backslash or appear	in a position where it
       cannot be interpreted as	indicating a range, typically as the first  or
       last character in the class, or immediately after a range. For example,
       [b-d-z] matches letters in the range b to d, a hyphen character,	or z.

       The literal character "]" cannot	be the end character  of  a  range.  A
       pattern	such  as  [W-]46]  is interpreted as a class of	two characters
       ("W" and	"-") followed by a literal string "46]",  so  it  would	 match
       "W46]"  or  "-46]".  However, if	"]" is escaped with a backslash, it is
       interpreted as the end of range,	so [W-\]46] is interpreted as a	 class
       containing a range followed by two other	characters. The	octal or hexa-
       decimal representation of "]" can also be used to end a range.

       An error	is generated if	a POSIX	character class	(see below) or an  es-
       cape sequence other than	one that defines a single character appears at
       a point where a	range  ending  character  is  expected.	 For  example,
       [z-\xff]	is valid, but [A-\d] and [A-[:digit:]] are not.

       Ranges  operate in the collating	sequence of character values. They can
       also  be	 used  for  characters	specified  numerically,	 for  example,
       [\000-\037].  Ranges  can include any characters	that are valid for the
       current mode.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to [][\\^_`wxyzabc], matched caselessly.	In a non-UTF mode, if  charac-
       ter tables for a	French locale are in use, [\xc8-\xcb] matches accented
       E characters in both cases. In UTF modes, PCRE supports the concept  of
       case  for  characters  with  values > 255 only when it is compiled with
       Unicode property	support.

       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
       \w, and \W can appear in	a character class, and add the characters that
       they match to the class.	For example, [\dABCDEF]	matches	any  hexadeci-
       mal  digit. In UTF modes, option	ucp affects the	meanings of \d,	\s, \w
       and their uppercase partners, just as it	does when they appear  outside
       a character class, as described in section Generic Character Types ear-
       lier. The escape	sequence \b has	a different meaning inside a character
       class;  it  matches  the	backspace character. The sequences \B, \N, \R,
       and \X are not special inside a character class.	Like any other	unrec-
       ognized	escape	sequences,  they are treated as	the literal characters
       "B", "N", "R", and "X".

       A circumflex can	conveniently be	 used  with  the  uppercase  character
       types  to specify a more	restricted set of characters than the matching
       lowercase type. For example, class [^\W_] matches any letter or	digit,
       but  not	underscore, while [\w] includes	underscore. A positive charac-
       ter class is to be read as "something OR	something OR ..." and a	 nega-
       tive class as "NOT something AND	NOT something AND NOT ...".

       Only the	following metacharacters are recognized	in character classes:

	 * Backslash

	 * Hyphen (only	where it can be	interpreted as specifying a range)

	 * Circumflex (only at the start)

	 * Opening  square  bracket (only when it can be interpreted as	intro-
	   ducing a Posix class	name, or for a special compatibility  feature;
	   see the next	two sections)

	 * Terminating closing square bracket

       However,	escaping other non-alphanumeric	characters does	no harm.

POSIX CHARACTER	CLASSES
       Perl supports the Posix notation	for character classes. This uses names
       enclosed	by [: and :] within the	enclosing square brackets.  PCRE  also
       supports	 this  notation.  For example, the following matches "0", "1",
       any alphabetic character, or "%":

       [01[:alpha:]%]

       The following are the supported class names:

	 alnum:
	   Letters and digits

	 alpha:
	   Letters

	 ascii:
	   Character codes 0-127

	 blank:
	   Space or tab	only

	 cntrl:
	   Control characters

	 digit:
	   Decimal digits (same	as \d)

	 graph:
	   Printing characters,	excluding space

	 lower:
	   Lowercase letters

	 print:
	   Printing characters,	including space

	 punct:
	   Printing characters,	excluding letters, digits, and space

	 space:
	   Whitespace (the same	as \s from PCRE	8.34)

	 upper:
	   Uppercase letters

	 word:
	   "Word" characters (same as \w)

	 xdigit:
	   Hexadecimal digits

       The default "space" characters are HT (9), LF (10), VT (11),  FF	 (12),
       CR  (13),  and space (32). If locale-specific matching is taking	place,
       the list	of space characters may	be different; there may	 be  fewer  or
       more of them. "Space" used to be	different to \s, which did not include
       VT, for Perl compatibility. However, Perl changed at release 5.18,  and
       PCRE followed at	release	8.34. "Space" and \s now match the same	set of
       characters.

       The name	"word" is a Perl extension, and	"blank"	 is  a	GNU  extension
       from  Perl  5.8.	Another	Perl extension is negation, which is indicated
       by a ^ character	after the colon. For example,  the  following  matches
       "1", "2", or any	non-digit:

       [12[:^digit:]]

       PCRE (and Perl) also recognize the Posix	syntax [.ch.] and [=ch=] where
       "ch" is a "collating element", but these	are not	supported, and an  er-
       ror is given if they are	encountered.

       By  default, characters with values > 255 do not	match any of the Posix
       character classes. However, if option PCRE_UCP is passed	 to  pcre_com-
       pile(), some of the classes are changed so that Unicode character prop-
       erties are used.	This is	achieved by replacing certain Posix classes by
       other sequences,	as follows:

	 [:alnum:]:
	   Becomes \p{Xan}

	 [:alpha:]:
	   Becomes \p{L}

	 [:blank:]:
	   Becomes \h

	 [:digit:]:
	   Becomes \p{Nd}

	 [:lower:]:
	   Becomes \p{Ll}

	 [:space:]:
	   Becomes \p{Xps}

	 [:upper:]:
	   Becomes \p{Lu}

	 [:word:]:
	   Becomes \p{Xwd}

       Negated versions, such as [:^alpha:], use \P instead of \p. Three other
       POSIX classes are handled specially in UCP mode:

	 [:graph:]:
	   This	matches	characters that	have glyphs that mark  the  page  when
	   printed.  In	Unicode	property terms,	it matches all characters with
	   the L, M, N,	P, S, or Cf properties,	except for:

	   U+061C:
	     Arabic Letter Mark

	   U+180E:
	     Mongolian Vowel Separator

	   U+2066 - U+2069:
	     Various "isolate"s

	 [:print:]:
	   This	matches	the same characters as [:graph:] plus space characters
	   that	are not	controls, that is, characters with the Zs property.

	 [:punct:]:
	   This	 matches  all characters that have the Unicode P (punctuation)
	   property, plus those	characters whose code points are less than 128
	   that	have the S (Symbol) property.

       The  other  POSIX classes are unchanged,	and match only characters with
       code points less	than 128.

       Compatibility Feature for Word Boundaries

       In the POSIX.2 compliant	library	that was included in 4.4BSD Unix,  the
       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
       and "end	of word". PCRE treats these items as follows:

	 [[:<:]]:
	   is converted	to \b(?=\w)

	 [[:>:]]:
	   is converted	to \b(?<=\w)

       Only these exact	character sequences are	recognized. A sequence such as
       [a[:<:]b]  provokes  error  for	an unrecognized	POSIX class name. This
       support is not compatible with Perl. It is provided to help  migrations
       from other environments,	and is best not	used in	any new	patterns. Note
       that \b matches at the start and	the end	of a word (see "Simple	asser-
       tions"  above),	and in a Perl-style pattern the	preceding or following
       character normally shows	which is wanted, without the need for the  as-
       sertions	 that are used above in	order to give exactly the POSIX	behav-
       iour.

VERTICAL BAR
       Vertical	bar characters are used	to separate alternative	patterns.  For
       example,	the following pattern matches either "gilbert" or "sullivan":

       gilbert|sullivan

       Any number of alternatives can appear, and an empty alternative is per-
       mitted (matching	the empty string). The matching	process	tries each al-
       ternative  in  turn, from left to right,	and the	first that succeeds is
       used. If	the alternatives are within a subpattern (defined  in  section
       Subpatterns),  "succeeds" means matching	the remaining main pattern and
       the alternative in the subpattern.

INTERNAL OPTION	SETTING
       The  settings  of  the  Perl-compatible	options	 caseless,  multiline,
       dotall,	and  extended  can be changed from within the pattern by a se-
       quence of Perl option letters enclosed between "(?" and ")". The	option
       letters are as follows:

	 i:
	   For caseless

	 m:
	   For multiline

	 s:
	   For dotall

	 x:
	   For extended

       For example, (?im) sets caseless, multiline matching. These options can
       also be unset by	preceding the letter with a hyphen. A combined setting
       and  unsetting  such  as	 (?im-sx),  which sets caseless	and multiline,
       while unsetting dotall and extended, is also permitted. If a letter ap-
       pears both before and after the hyphen, the option is unset.

       The  PCRE-specific options dupnames, ungreedy, and extra	can be changed
       in the same way as the Perl-compatible options by using the  characters
       J, U, and X respectively.

       When  one of these option changes occurs	at top-level (that is, not in-
       side subpattern parentheses), the change	applies	to  the	 remainder  of
       the pattern that	follows.

       An  option change within	a subpattern (see section Subpatterns) affects
       only that part of the subpattern	that follows  it.  So,	the  following
       matches	abc  and  aBc  and  no other strings (assuming caseless	is not
       used):

       (a(?i)b)c

       By this means, options can be made to have different settings  in  dif-
       ferent  parts  of  the  pattern.	Any changes made in one	alternative do
       carry on	into subsequent	branches within	the same subpattern. For exam-
       ple:

       (a(?i)b|c)

       matches	"ab", "aB", "c", and "C", although when	matching "C" the first
       branch is abandoned before the option setting. This is because the  ef-
       fects  of  option  settings  occur at compile time. There would be some
       weird behavior otherwise.

   Note:
       Other PCRE-specific options can be set by the application when the com-
       piling or matching functions are	called.	Sometimes the pattern can con-
       tain special leading sequences, such as (*CRLF),	to override  what  the
       application has set or what has been defaulted. Details are provided in
       section	Newline	Sequences earlier.

       The (*UTF8) and (*UCP) leading sequences	can be used  to	 set  UTF  and
       Unicode	property modes.	They are equivalent to setting options unicode
       and ucp,	respectively. The (*UTF) sequence is a	generic	 version  that
       can be used with	any of the libraries. However, the application can set
       option never_utf, which locks out the use of the	(*UTF) sequences.

SUBPATTERNS
       Subpatterns are delimited by parentheses	(round brackets), which	can be
       nested. Turning part of a pattern into a	subpattern does	two things:

	 1.:
	   It localizes	a set of alternatives. For example, the	following pat-
	   tern	matches	"cataract", "caterpillar", or "cat":

	 cat(aract|erpillar|)

	   Without the parentheses, it would match "cataract", "erpillar",  or
	   an empty string.

	 2.:
	   It  sets up the subpattern as a capturing subpattern. That is, when
	   the complete	pattern	matches, that portion of  the  subject	string
	   that	 matched  the  subpattern is passed back to the	caller through
	   the return value of run/3.

       Opening parentheses are counted from left to right (starting from 1) to
       obtain  numbers	for  the  capturing  subpatterns.  For example,	if the
       string "the red king" is	matched	against	 the  following	 pattern,  the
       captured	substrings are "red king", "red", and "king", and are numbered
       1, 2, and 3, respectively:

       the ((red|white)	(king|queen))

       It is not always	helpful	that plain parentheses fulfill two  functions.
       Often  a	 grouping  subpattern is required without a capturing require-
       ment. If	an opening parenthesis is followed by a	question  mark	and  a
       colon,  the  subpattern	does  not do any capturing, and	is not counted
       when computing the number of any	subsequent capturing subpatterns.  For
       example,	if the string "the white queen"	is matched against the follow-
       ing pattern, the	captured substrings are	"white queen" and "queen", and
       are numbered 1 and 2:

       the ((?:red|white) (king|queen))

       The maximum number of capturing subpatterns is 65535.

       As  a  convenient shorthand, if any option settings are required	at the
       start of	a non-capturing	subpattern, the	option letters can appear  be-
       tween  "?" and ":". Thus, the following two patterns match the same set
       of strings:

       (?i:saturday|sunday)
       (?:(?i)saturday|sunday)

       As alternative branches are tried from left to right, and  options  are
       not reset until the end of the subpattern is reached, an	option setting
       in one branch does affect subsequent branches, so  the  above  patterns
       match both "SUNDAY" and "Saturday".

DUPLICATE SUBPATTERN NUMBERS
       Perl  5.10  introduced a	feature	where each alternative in a subpattern
       uses the	same numbers for its capturing parentheses. Such a  subpattern
       starts  with (?|	and is itself a	non-capturing subpattern. For example,
       consider	the following pattern:

       (?|(Sat)ur|(Sun))day

       As the two alternatives are inside a (?|	group, both sets of  capturing
       parentheses  are	 numbered one. Thus, when the pattern matches, you can
       look at captured	substring number one, whichever	 alternative  matched.
       This  construct is useful when you want to capture a part, but not all,
       of one of many alternatives. Inside a (?| group,	parentheses  are  num-
       bered  as  usual,  but the number is reset at the start of each branch.
       The numbers of any capturing parentheses	 that  follow  the  subpattern
       start  after the	highest	number used in any branch. The following exam-
       ple is from the Perl documentation;  the	 numbers  underneath  show  in
       which buffer the	captured content is stored:

       # before	 ---------------branch-reset-----------	after
       / ( a )	(?| x (	y ) z |	(p (q) r) | (t)	u (v) )	( z ) /x
       # 1	      2		2  3	    2	  3	4

       A  back	reference  to a	numbered subpattern uses the most recent value
       that is set for that number by any subpattern.  The  following  pattern
       matches "abcabc"	or "defdef":

       /(?|(abc)|(def))\1/

       In  contrast,  a	subroutine call	to a numbered subpattern always	refers
       to the first one	in the pattern with the	given  number.	The  following
       pattern matches "abcabc"	or "defabc":

       /(?|(abc)|(def))(?1)/

       If  a  condition	 test for a subpattern having matched refers to	a non-
       unique number, the test is true if any of the subpatterns of that  num-
       ber have	matched.

       An alternative approach using this "branch reset" feature is to use du-
       plicate named subpatterns, as described in the next section.

NAMED SUBPATTERNS
       Identifying capturing parentheses by number is simple, but  it  can  be
       hard  to	 keep track of the numbers in complicated regular expressions.
       Also, if	an expression is modified, the numbers	can  change.  To  help
       with  this  difficulty,	PCRE  supports the naming of subpatterns. This
       feature was not added to	Perl until release 5.10. Python	had  the  fea-
       ture  earlier,  and PCRE	introduced it at release 4.0, using the	Python
       syntax. PCRE now	supports both the Perl and the Python syntax. Perl al-
       lows identically	numbered subpatterns to	have different names, but PCRE
       does not.

       In PCRE,	a subpattern can be named in one of three  ways:  (?_name_...)
       or  (?'name'...)	 as in Perl, or	(?P_name_...) as in Python. References
       to capturing parentheses	from other parts of the	pattern, such as  back
       references,  recursion, and conditions, can be made by name and by num-
       ber.

       Names consist of	up to 32 alphanumeric characters and underscores,  but
       must  start with	a non-digit. Named capturing parentheses are still al-
       located numbers as well as names, exactly as  if	 the  names  were  not
       present.	 The  capture  specification  to run/3 can use named values if
       they are	present	in the regular expression.

       By default, a name must be unique within	a pattern, but this constraint
       can  be	relaxed	by setting option dupnames at compile time. (Duplicate
       names are also always permitted for subpatterns with the	 same  number,
       set  up	as  described in the previous section.)	Duplicate names	can be
       useful for patterns where only one instance of  the  named  parentheses
       can match. Suppose that you want	to match the name of a weekday,	either
       as a 3-letter abbreviation or as	the full name, and in both  cases  you
       want  to	 extract the abbreviation. The following pattern (ignoring the
       line breaks) does the job:

       (?<DN>Mon|Fri|Sun)(?:day)?|
       (?<DN>Tue)(?:sday)?|
       (?<DN>Wed)(?:nesday)?|
       (?<DN>Thu)(?:rsday)?|
       (?<DN>Sat)(?:urday)?

       There are five capturing	substrings, but	only one is ever set  after  a
       match.  (An alternative way of solving this problem is to use a "branch
       reset" subpattern, as described in the previous section.)

       For capturing named subpatterns which names are not unique,  the	 first
       matching	 occurrence (counted from left to right	in the subject)	is re-
       turned from run/3, if the name is specified in the values part  of  the
       capture	statement. The all_names capturing value matches all the names
       in the same way.

   Note:
       You cannot use different	names to distinguish between  two  subpatterns
       with  the same number, as PCRE uses only	the numbers when matching. For
       this reason, an error is	given at compile time if different  names  are
       specified to subpatterns	with the same number. However, you can specify
       the same	name to	subpatterns with the same number, even	when  dupnames
       is not set.

REPETITION
       Repetition  is  specified  by  quantifiers, which can follow any	of the
       following items:

	 * A literal data character

	 * The dot metacharacter

	 * The \C escape sequence

	 * The \X escape sequence

	 * The \R escape sequence

	 * An escape such as \d	or \pL that matches a single character

	 * A character class

	 * A back reference (see the next section)

	 * A parenthesized subpattern (including assertions)

	 * A subroutine	call to	a subpattern (recursive	or otherwise)

       The general repetition quantifier specifies a minimum and maximum  num-
       ber  of	permitted matches, by giving the two numbers in	curly brackets
       (braces), separated by a	comma. The numbers must	be <  65536,  and  the
       first  must  be less than or equal to the second. For example, the fol-
       lowing matches "zz", "zzz", or "zzzz":

       z{2,4}

       A closing brace on its own is not a special character.  If  the	second
       number  is  omitted, but	the comma is present, there is no upper	limit.
       If the second number and	the comma are  both  omitted,  the  quantifier
       specifies  an  exact  number  of	 required matches. Thus, the following
       matches at least	three successive vowels, but can match many more:

       [aeiou]{3,}

       The following matches exactly eight digits:

       \d{8}

       An opening curly	bracket	that appears in	a position where a  quantifier
       is  not allowed,	or one that does not match the syntax of a quantifier,
       is taken	as a literal character.	For example, {,6} is not a quantifier,
       but a literal string of four characters.

       In  Unicode  mode, quantifiers apply to characters rather than to indi-
       vidual data units. Thus,	for example, \x{100}{2}	 matches  two  charac-
       ters,  each  of	which  is  represented by a 2-byte sequence in a UTF-8
       string. Similarly, \X{3}	matches	three Unicode extended grapheme	 clus-
       ters,  each  of	which  can be many data	units long (and	they can be of
       different lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This can be use-
       ful for subpatterns that	are referenced as subroutines  from  elsewhere
       in  the	pattern	(but see also section  Defining	Subpatterns for	Use by
       Reference Only).	Items other than subpatterns that have a  {0}  quanti-
       fier are	omitted	from the compiled pattern.

       For  convenience, the three most	common quantifiers have	single-charac-
       ter abbreviations:

	 *:
	   Equivalent to {0,}

	 +:
	   Equivalent to {1,}

	 ?:
	   Equivalent to {0,1}

       Infinite	loops can be constructed by following a	 subpattern  that  can
       match  no characters with a quantifier that has no upper	limit, for ex-
       ample:

       (a?)*

       Earlier versions	of Perl	and PCRE used to give an error at compile time
       for  such  patterns. However, as	there are cases	where this can be use-
       ful, such patterns are now accepted. However, if	any repetition of  the
       subpattern matches no characters, the loop is forcibly broken.

       By  default,  the quantifiers are "greedy", that	is, they match as much
       as possible (up to the maximum  number  of  permitted  times),  without
       causing	the  remaining	pattern	 to fail. The classic example of where
       this gives problems is in trying	to match comments in C programs. These
       appear  between /* and */. Within the comment, individual * and / char-
       acters can appear. An attempt to	match C	comments by applying the  pat-
       tern

       /\*.*\*/

       to the string

       /* first	comment	*/  not	comment	 /* second comment */

       fails,  as  it matches the entire string	owing to the greediness	of the
       .* item.

       However,	if a quantifier	is followed by a question mark,	it  ceases  to
       be greedy, and instead matches the minimum number of times possible, so
       the following pattern does the right thing with the C comments:

       /\*.*?\*/

       The meaning of the various quantifiers is not otherwise	changed,  only
       the  preferred  number  of matches. Do not confuse this use of question
       mark with its use as a quantifier in its	own right. As it has two uses,
       it can sometimes	appear doubled,	as in

       \d??\d

       which matches one digit by preference, but can match two	if that	is the
       only way	the remaining pattern matches.

       If option ungreedy is set (an option that is not	 available  in	Perl),
       the  quantifiers	 are not greedy	by default, but	individual ones	can be
       made greedy by following	them with a question mark. That	is, it inverts
       the default behavior.

       When  a	parenthesized  subpattern  is quantified with a	minimum	repeat
       count that is > 1 or with a limited maximum, more  memory  is  required
       for  the	 compiled pattern, in proportion to the	size of	the minimum or
       maximum.

       If a pattern starts with	.* or .{0,} and	option dotall  (equivalent  to
       Perl  option  /s)  is set, thus allowing	the dot	to match newlines, the
       pattern is implicitly  anchored,	 because  whatever  follows  is	 tried
       against every character position	in the subject string. So, there is no
       point in	retrying the overall match at any position  after  the	first.
       PCRE normally treats such a pattern as if it was	preceded by \A.

       In  cases  where	 it  is	known that the subject string contains no new-
       lines, it is worth setting dotall to obtain this	optimization,  or  al-
       ternatively using ^ to indicate anchoring explicitly.

       However,	 there	are  some cases	where the optimization cannot be used.
       When .* is inside capturing parentheses that are	the subject of a  back
       reference elsewhere in the pattern, a match at the start	can fail where
       a later one succeeds. Consider, for example:

       (.*)abc\1

       If the subject is "xyz123abc123", the match point is the	fourth charac-
       ter. Therefore, such a pattern is not implicitly	anchored.

       Another	case where implicit anchoring is not applied is	when the lead-
       ing .* is inside	an atomic group. Once again, a match at	the start  can
       fail where a later one succeeds.	Consider the following pattern:

       (?>.*?a)b

       It  matches "ab"	in the subject "aab". The use of the backtracking con-
       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.

       When a capturing	subpattern is repeated,	the value captured is the sub-
       string that matched the final iteration.	For example, after

       (tweedle[dume]{3}\s*)+

       has  matched  "tweedledum  tweedledee",	the value of the captured sub-
       string is "tweedledee". However,	if there are nested capturing  subpat-
       terns,  the corresponding captured values can have been set in previous
       iterations. For example,	after

       /(a|(b))+/

       matches "aba", the value	of the second captured substring is "b".

ATOMIC GROUPING	AND POSSESSIVE QUANTIFIERS
       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
       repetition,  failure  of	what follows normally causes the repeated item
       to be re-evaluated to see if a different	number of repeats  allows  the
       remaining pattern to match. Sometimes it	is useful to prevent this, ei-
       ther to change the nature of the	match, or to cause it to fail  earlier
       than  it	 otherwise  might,  when  the author of	the pattern knows that
       there is	no point in carrying on.

       Consider, for example, the pattern \d+foo when applied to the following
       subject line:

       123456bar

       After matching all six digits and then failing to match "foo", the nor-
       mal action of the matcher is to try again with only five	digits	match-
       ing item	\d+, and then with four, and so	on, before ultimately failing.
       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
       the  means for specifying that once a subpattern	has matched, it	is not
       to be re-evaluated in this way.

       If atomic grouping is used for the previous example, the	matcher	 gives
       up  immediately	on failing to match "foo" the first time. The notation
       is a kind of special parenthesis, starting with (?_ as in the following
       example:

       (?>\d+)foo

       This kind of parenthesis	"locks up" the part of the pattern it contains
       once it has matched, and	a failure further into	the  pattern  is  pre-
       vented  from  backtracking  into	 it.  Backtracking past	it to previous
       items, however, works as	normal.

       An alternative description is that a subpattern of  this	 type  matches
       the  string  of	characters  that an identical standalone pattern would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be	thought	of as a	maximizing repeat that
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
       pared  to  adjust the number of digits they match to make the remaining
       pattern match, (?_\d+) can only match an	entire sequence	of digits.

       Atomic groups in	general	can contain any	complicated  subpatterns,  and
       can be nested. However, when the	subpattern for an atomic group is just
       a single	repeated item, as in the example above,	 a  simpler  notation,
       called a	"possessive quantifier"	can be used. This consists of an extra
       + character following a quantifier. Using this notation,	 the  previous
       example can be rewritten	as

       \d++foo

       Notice  that  a possessive quantifier can be used with an entire	group,
       for example:

       (abc|xyz){2,3}+

       Possessive quantifiers are always greedy; the  setting  of  option  un-
       greedy is ignored. They are a convenient	notation for the simpler forms
       of an atomic group. However, there is no	difference in the meaning of a
       possessive quantifier and the equivalent	atomic group, but there	can be
       a performance difference; possessive quantifiers	are probably  slightly
       faster.

       The  possessive	quantifier syntax is an	extension to the Perl 5.8 syn-
       tax. Jeffrey Friedl originated the idea (and the	 name)	in  the	 first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built the Sun Java package, and PCRE copied it  from  there.  It	 ulti-
       mately found its	way into Perl at release 5.10.

       PCRE has	an optimization	that automatically "possessifies" certain sim-
       ple pattern constructs. For example, the	sequence  A+B  is  treated  as
       A++B,  as there is no point in backtracking into	a sequence of A:s when
       B must follow.

       When a pattern contains an unlimited repeat inside  a  subpattern  that
       can  itself  be	repeated  an  unlimited	number of times, the use of an
       atomic group is the only	way to avoid some  failing  matches  taking  a
       long time. The pattern

       (\D+|<\d+>)*[!?]

       matches	an  unlimited number of	substrings that	either consist of non-
       digits, or digits enclosed in <>, followed by ! or ?. When it  matches,
       it runs quickly.	However, if it is applied to

       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it  takes  a  long  time	 before	reporting failure. This	is because the
       string can be divided between the internal \D+ repeat and the  external
       *  repeat  in  many ways, and all must be tried.	(The example uses [!?]
       rather than a single character at the end, as both PCRE and  Perl  have
       an optimization that allows for fast failure when a single character is
       used. They remember the last single character that is  required	for  a
       match,  and fail	early if it is not present in the string.) If the pat-
       tern is changed so that it uses an atomic group,	 like  the  following,
       sequences of non-digits cannot be broken, and failure happens quickly:

       ((?>\D+)|<\d+>)*[!?]

BACK REFERENCES
       Outside	a  character  class,  a	backslash followed by a	digit >	0 (and
       possibly	further	digits)	is a back reference to a capturing  subpattern
       earlier (that is, to its	left) in the pattern, provided there have been
       that many previous capturing left parentheses.

       However,	if the decimal number following	the backslash is < 10,	it  is
       always taken as a back reference, and causes an error only if there are
       not that	many capturing left parentheses	in the	entire	pattern.  That
       is,  the	 parentheses that are referenced do need not be	to the left of
       the reference for numbers < 10. A "forward back reference" of this type
       can  make sense when a repetition is involved and the subpattern	to the
       right has participated in an earlier iteration.

       It is not possible to have a numerical "forward back  reference"	 to  a
       subpattern  whose number	is 10 or more using this syntax, as a sequence
       such as \50 is interpreted as a character defined in  octal.  For  more
       details	of  the	 handling of digits following a	backslash, see section
       Non-Printing Characters earlier.	There is no such  problem  when	 named
       parentheses  are	 used.	A back reference to any	subpattern is possible
       using named parentheses (see below).

       Another way to avoid the	ambiguity inherent in the use of  digits  fol-
       lowing  a  backslash is to use the \g escape sequence. This escape must
       be followed by an unsigned number or a negative number, optionally  en-
       closed in braces. The following examples	are identical:

       (ring), \1
       (ring), \g1
       (ring), \g{1}

       An  unsigned number specifies an	absolute reference without the ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits follow the reference. A negative number is a relative reference.
       Consider	the following example:

       (abc(def)ghi)\g{-1}

       The sequence \g{-1} is a	reference to the most recently started captur-
       ing subpattern before \g, that is, it is	equivalent to \2 in this exam-
       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
       references  can	be helpful in long patterns, and also in patterns that
       are created by joining fragments	 containing  references	 within	 them-
       selves.

       A  back	reference matches whatever matched the capturing subpattern in
       the current subject string, rather than anything	matching  the  subpat-
       tern itself (section Subpattern as Subroutines describes	a way of doing
       that). So, the following	pattern	matches	"sense	and  sensibility"  and
       "response and responsibility", but not "sense and responsibility":

       (sens|respons)e and \1ibility

       If  caseful matching is in force	at the time of the back	reference, the
       case of letters is relevant. For	example, the  following	 matches  "rah
       rah"  and "RAH RAH", but	not "RAH rah", although	the original capturing
       subpattern is matched caselessly:

       ((?i)rah)\s+\1

       There are many different	ways of	writing	back references	to named  sub-
       patterns.  The  .NET  syntax  \k{name}  and the Perl syntax \k_name_ or
       \k'name'	are supported, as is the Python	syntax (?P=name). The  unified
       back  reference	syntax	in Perl	5.10, in which \g can be used for both
       numeric and named references, is	also supported.	The  previous  example
       can be rewritten	in the following ways:

       (?<p1>(?i)rah)\s+\k<p1>
       (?'p1'(?i)rah)\s+\k{p1}
       (?P<p1>(?i)rah)\s+(?P=p1)
       (?<p1>(?i)rah)\s+\g{p1}

       A  subpattern  that is referenced by name can appear in the pattern be-
       fore or after the reference.

       There can be more than one back reference to the	same subpattern. If  a
       subpattern has not been used in a particular match, any back references
       to it always fails. For example,	the following pattern always fails  if
       it starts to match "a" rather than "bc":

       (a|(bc))\2

       As  there  can  be  many	capturing parentheses in a pattern, all	digits
       following the backslash are taken as part of a potential	back reference
       number. If the pattern continues	with a digit character,	some delimiter
       must be used to terminate the back reference.  If  option  extended  is
       set,  this  can	be whitespace. Otherwise an empty comment (see section
       Comments) can be	used.

       Recursive Back References

       A back reference	that occurs inside the parentheses to which it	refers
       fails  when  the	subpattern is first used, so, for example, (a\1) never
       matches.	However, such references can be	useful inside repeated subpat-
       terns.  For  example,  the following pattern matches any	number of "a"s
       and also	"aba", "ababbaa", and so on:

       (a|b\1)+

       At each iteration of the	subpattern, the	 back  reference  matches  the
       character  string corresponding to the previous iteration. In order for
       this to work, the pattern must be such that the	first  iteration  does
       not  need  to match the back reference. This can	be done	using alterna-
       tion, as	in the example above, or by a quantifier  with	a  minimum  of
       zero.

       Back  references	of this	type cause the group that they reference to be
       treated as an atomic group. Once	the whole group	has  been  matched,  a
       subsequent  matching  failure cannot cause backtracking into the	middle
       of the group.

ASSERTIONS
       An assertion is a test on the characters	 following  or	preceding  the
       current matching	point that does	not consume any	characters. The	simple
       assertions coded	as \b, \B, \A, \G, \Z, \z, ^, and $ are	 described  in
       the previous sections.

       More  complicated  assertions  are  coded as subpatterns. There are two
       kinds: those that look ahead of the current  position  in  the  subject
       string,	and  those  that  look	behind	it. An assertion subpattern is
       matched in the normal way, except that it does not  cause  the  current
       matching	position to be changed.

       Assertion  subpatterns are not capturing	subpatterns. If	such an	asser-
       tion contains capturing subpatterns within it, these  are  counted  for
       the  purposes  of numbering the capturing subpatterns in	the whole pat-
       tern. However, substring	capturing is done  only	 for  positive	asser-
       tions.  (Perl sometimes,	but not	always,	performs capturing in negative
       assertions.)

   Warning:
       If a positive assertion containing one or  more	capturing  subpatterns
       succeeds, but failure to	match later in the pattern causes backtracking
       over this assertion, the	captures within	the assertion are  reset  only
       if no higher numbered captures are already set. This is,	unfortunately,
       a fundamental limitation	of the current implementation, and as PCRE1 is
       now in maintenance-only status, it is unlikely ever to change.

       For  compatibility  with	 Perl,	assertion subpatterns can be repeated.
       However,	it makes no sense to assert the	same  thing  many  times,  the
       side  effect  of	 capturing  parentheses	can occasionally be useful. In
       practice, there are only	three cases:

	 * If the quantifier is	{0}, the  assertion  is	 never	obeyed	during
	   matching.  However, it can contain internal capturing parenthesized
	   groups that are called from elsewhere through the subroutine	mecha-
	   nism.

	 * If  quantifier  is  {0,n},  where n > 0, it is treated as if	it was
	   {0,1}. At runtime, the remaining pattern match is  tried  with  and
	   without  the	 assertion, the	order depends on the greediness	of the
	   quantifier.

	 * If the minimum repetition is	> 0, the quantifier  is	 ignored.  The
	   assertion is	obeyed only once when encountered during matching.

       Lookahead Assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example, the following matches a word followed
       by a semicolon, but does	not include the	semicolon in the match:

       \w+(?=;)

       The  following  matches any occurrence of "foo" that is not followed by
       "bar":

       foo(?!bar)

       Notice that the apparently similar pattern

       (?!foo)bar

       does not	find an	occurrence of "bar"  that  is  preceded	 by  something
       other  than  "foo". It finds any	occurrence of "bar" whatsoever,	as the
       assertion (?!foo) is always true	when the  next	three  characters  are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most convenient way to do it is with (?!), as an	 empty	string	always
       matches.	 So,  an  assertion  that requires there is not	to be an empty
       string must always fail.	The backtracking control verb (*FAIL) or  (*F)
       is a synonym for	(?!).

       Lookbehind Assertions

       Lookbehind  assertions start with (?<= for positive assertions and (?<!
       for negative assertions.	For example, the following finds an occurrence
       of "bar"	that is	not preceded by	"foo":

       (?<!foo)bar

       The contents of a lookbehind assertion are restricted such that all the
       strings it matches must have a fixed length. However, if	there are many
       top-level  alternatives,	 they  do  not all have	to have	the same fixed
       length. Thus, the following is permitted:

       (?<=bullock|donkey)

       The following causes an error at	compile	time:

       (?<!dogs?|cats?)

       Branches	that match different length strings are	permitted only at  the
       top-level of a lookbehind assertion. This is an extension compared with
       Perl, which requires all	branches to match the same length  of  string.
       An assertion such as the	following is not permitted, as its single top-
       level branch can	match two different lengths:

       (?<=ab(c|de))

       However,	it is acceptable to PCRE if rewritten  to  use	two  top-level
       branches:

       (?<=abc|abde)

       Sometimes  the  escape sequence \K (see above) can be used instead of a
       lookbehind assertion to get round the fixed-length restriction.

       The implementation of lookbehind	assertions is, for  each  alternative,
       to  move	 the current position back temporarily by the fixed length and
       then try	to match. If there are insufficient characters before the cur-
       rent position, the assertion fails.

       In  a UTF mode, PCRE does not allow the \C escape (which	matches	a sin-
       gle data	unit even in a UTF mode) to appear in  lookbehind  assertions,
       as  it  makes  it impossible to calculate the length of the lookbehind.
       The \X and \R escapes, which can	match different	numbers	of data	units,
       are not permitted either.

       "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
       lookbehinds, as long as the subpattern matches a	 fixed-length  string.
       Recursion, however, is not supported.

       Possessive  quantifiers can be used with	lookbehind assertions to spec-
       ify efficient matching of fixed-length strings at the  end  of  subject
       strings.	 Consider  the following simple	pattern	when applied to	a long
       string that does	not match:

       abcd$

       As matching proceeds from left to right,	PCRE looks for each "a"	in the
       subject and then	sees if	what follows matches the remaining pattern. If
       the pattern is specified	as

       ^.*abcd$

       the initial .* matches the entire string	at first. However,  when  this
       fails  (as  there  is no	following "a"),	it backtracks to match all but
       the last	character, then	all but	the last two characters,  and  so  on.
       Once  again  the	search for "a" covers the entire string, from right to
       left, so	we are no better off. However, if the pattern is written as

       ^.*+(?<=abcd)

       there can be no backtracking for	the .*+	item; it can  match  only  the
       entire  string.	The subsequent lookbehind assertion does a single test
       on the last four	characters. If it fails, the match fails  immediately.
       For  long  strings, this	approach makes a significant difference	to the
       processing time.

       Using Multiple Assertions

       Many assertions (of any sort) can occur in succession. For example, the
       following matches "foo" preceded	by three digits	that are not "999":

       (?<=\d{3})(?<!999)foo

       Notice that each	of the assertions is applied independently at the same
       point in	the subject string. First there	is a check that	 the  previous
       three  characters  are  all  digits, and	then there is a	check that the
       same three characters are not "999". This pattern does not match	 "foo"
       preceded	 by six	characters, the	first of which are digits and the last
       three of	which are not "999". For example, it does not  match  "123abc-
       foo". A pattern to do that is the following:

       (?<=\d{3}...)(?<!999)foo

       This  time  the	first assertion	looks at the preceding six characters,
       checks that the first three are digits, and then	the  second  assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example, the following
       matches an occurrence of	"baz" that is preceded by "bar", which in turn
       is not preceded by "foo":

       (?<=(?<!foo)bar)baz

       The  following  pattern	matches	"foo" preceded by three	digits and any
       three characters	that are not "999":

       (?<=\d{3}(?!999)...)foo

CONDITIONAL SUBPATTERNS
       It is possible to cause the matching process to obey a subpattern  con-
       ditionally  or to choose	between	two alternative	subpatterns, depending
       on the result of	an assertion, or whether a specific capturing  subpat-
       tern has	already	been matched. The following are	the two	possible forms
       of conditional subpattern:

       (?(condition)yes-pattern)
       (?(condition)yes-pattern|no-pattern)

       If the condition	is satisfied, the yes-pattern is used,	otherwise  the
       no-pattern  (if	present).  If  more than two alternatives exist	in the
       subpattern, a compile-time error	occurs.	Each of	the  two  alternatives
       can  itself  contain  nested  subpatterns of any	form, including	condi-
       tional subpatterns; the restriction to two alternatives applies only at
       the  level of the condition. The	following pattern fragment is an exam-
       ple where the alternatives are complex:

       (?(1) (A|B|C) | (D | (?(2)E|F) |	E) )

       There are four kinds of condition: references  to  subpatterns,	refer-
       ences to	recursion, a pseudo-condition called DEFINE, and assertions.

       Checking	for a Used Subpattern By Number

       If  the	text between the parentheses consists of a sequence of digits,
       the condition is	true if	a capturing subpattern of that number has pre-
       viously	matched.  If  more than	one capturing subpattern with the same
       number exists (see section  Duplicate Subpattern	Numbers	earlier),  the
       condition  is true if any of them have matched. An alternative notation
       is to precede the digits	with a plus or minus sign. In this  case,  the
       subpattern  number  is relative rather than absolute. The most recently
       opened parentheses can be referenced by (?(-1), the next	most recent by
       (?(-2),	and  so	 on.  Inside loops, it can also	make sense to refer to
       subsequent groups. The next parentheses to be opened can	be  referenced
       as  (?(+1),  and	 so  on.  (The value zero in any of these forms	is not
       used; it	provokes a compile-time	error.)

       Consider	the following pattern, which contains  non-significant	white-
       space  to  make it more readable	(assume	option extended) and to	divide
       it into three parts for ease of discussion:

       ( \( )?	  [^()]+    (?(1) \) )

       The first part matches an optional opening  parenthesis,	 and  if  that
       character is present, sets it as	the first captured substring. The sec-
       ond part	matches	one or more characters that are	not  parentheses.  The
       third part is a conditional subpattern that tests whether the first set
       of parentheses matched or not. If they did, that	is, if subject started
       with an opening parenthesis, the	condition is true, and so the yes-pat-
       tern is executed	and a closing parenthesis is required.	Otherwise,  as
       no-pattern  is  not  present,  the subpattern matches nothing. That is,
       this pattern matches a sequence of non-parentheses, optionally enclosed
       in parentheses.

       If  this	 pattern is embedded in	a larger one, a	relative reference can
       be used:

       This makes the fragment independent of the parentheses  in  the	larger
       pattern.

       Checking	for a Used Subpattern By Name

       Perl  uses  the	syntax	(?(<name>)...) or (?('name')...) to test for a
       used subpattern by name.	For compatibility  with	 earlier  versions  of
       PCRE,  which  had this facility before Perl, the	syntax (?(name)...) is
       also recognized.

       Rewriting the previous example to use a named subpattern	gives:

       (?<OPEN>	\( )?	 [^()]+	   (?(<OPEN>) \) )

       If the name used	in a condition of this kind is a duplicate,  the  test
       is  applied to all subpatterns of the same name,	and is true if any one
       of them has matched.

       Checking	for Pattern Recursion

       If the condition	is the string (R), and there is	no subpattern with the
       name  R,	the condition is true if a recursive call to the whole pattern
       or any subpattern has been made.	If digits or a name preceded by	amper-
       sand follow the letter R, for example:

       (?(R3)...) or (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a subpattern
       whose number or name is given. This condition does not check the	entire
       recursion  stack. If the	name used in a condition of this kind is a du-
       plicate,	the test is applied to all subpatterns of the same  name,  and
       is true if any one of them is the most recent recursion.

       At "top-level", all these recursion test	conditions are false. The syn-
       tax for recursive patterns is described below.

       Defining	Subpatterns for	Use By Reference Only

       If the condition	is the string (DEFINE),	and  there  is	no  subpattern
       with  the  name	DEFINE,	 the  condition	is always false. In this case,
       there can be only one alternative  in  the  subpattern.	It  is	always
       skipped	if  control reaches this point in the pattern. The idea	of DE-
       FINE is that it can be used to define "subroutines" that	can be	refer-
       enced  from elsewhere. (The use of subroutines is described below.) For
       example,	a pattern to match an IPv4 address, such as  "192.168.23.245",
       can be written like this	(ignore	whitespace and line breaks):

       (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] |	1\d\d |	[1-9]?\d) ) \b (?&byte)	(\.(?&byte)){3}	\b

       The  first  part	of the pattern is a DEFINE group inside	which is a an-
       other group named "byte"	is defined. This matches an individual	compo-
       nent  of	 an  IPv4 address (a number < 256). When matching takes	place,
       this part of the	pattern	is skipped, as DEFINE acts like	a false	condi-
       tion. The remaining pattern uses	references to the named	group to match
       the four	dot-separated components of an IPv4 address,  insisting	 on  a
       word boundary at	each end.

       Assertion Conditions

       If  the condition is not	in any of the above formats, it	must be	an as-
       sertion.	This can be a positive or negative lookahead or	lookbehind as-
       sertion.	 Consider  the	following  pattern, containing non-significant
       whitespace, and with the	two alternatives on the	second line:

       (?(?=[^a-z]*[a-z])
       \d{2}-[a-z]{3}-\d{2}  |	\d{2}-\d{2}-\d{2} )

       The condition is	a positive lookahead assertion	that  matches  an  op-
       tional  sequence	of non-letters followed	by a letter. That is, it tests
       for the presence	of at least one	letter in the subject. If a letter  is
       found,  the subject is matched against the first	alternative, otherwise
       it is matched against the second. This pattern matches strings  in  one
       of  the	two  forms dd-aaa-dd or	dd-dd-dd, where	aaa are	letters	and dd
       are digits.

COMMENTS
       There are two ways to include comments in patterns that	are  processed
       by PCRE.	In both	cases, the start of the	comment	must not be in a char-
       acter class, or in the middle of	any other sequence of related  charac-
       ters  such  as  (?: or a	subpattern name	or number. The characters that
       make up a comment play no part in the pattern matching.

       The sequence (?#	marks the start	of a comment that continues up to  the
       next  closing parenthesis. Nested parentheses are not permitted.	If op-
       tion PCRE_EXTENDED is set, an unescaped # character also	 introduces  a
       comment,	 which	in  this  case continues to immediately	after the next
       newline character or character sequence in the pattern.	Which  charac-
       ters are	interpreted as newlines	is controlled by the options passed to
       a compiling function or by a special sequence at	the start of the  pat-
       tern, as	described in section  Newline Conventions earlier.

       Notice  that  the  end of this type of comment is a literal newline se-
       quence in the pattern; escape sequences that happen to represent	a new-
       line do not count. For example, consider	the following pattern when ex-
       tended is set, and the default newline convention is in force:

       abc #comment \n still comment

       On encountering character #, pcre_compile() skips along,	looking	for  a
       newline in the pattern. The sequence \n is still	literal	at this	stage,
       so it does not terminate	the comment. Only a character with code	 value
       0x0a (the default newline) does so.

RECURSIVE PATTERNS
       Consider	 the problem of	matching a string in parentheses, allowing for
       unlimited nested	parentheses. Without the use of	 recursion,  the  best
       that  can  be  done  is	to use a pattern that matches up to some fixed
       depth of	nesting. It is not possible to	handle	an  arbitrary  nesting
       depth.

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions to	recurse	(among other things). It does  this  by	 interpolating
       Perl  code  in the expression at	runtime, and the code can refer	to the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       Item  (?p{...})	interpolates  Perl  code  at runtime, and in this case
       refers recursively to the pattern in which it appears.

       Obviously, PCRE cannot support the interpolation	of Perl	code. Instead,
       it supports special syntax for recursion	of the entire pattern, and for
       individual subpattern recursion.	After its  introduction	 in  PCRE  and
       Python,	this  kind  of recursion was later introduced into Perl	at re-
       lease 5.10.

       A special item that consists of (? followed by a	number > 0 and a clos-
       ing parenthesis is a recursive subroutine call of the subpattern	of the
       given number, if	it occurs inside that subpattern. (If  not,  it	 is  a
       non-recursive subroutine	call, which is described in the	next section.)
       The special item	(?R) or	(?0) is	a recursive call of the	entire regular
       expression.

       This  PCRE  pattern  solves the nested parentheses problem (assume that
       option extended is set so that whitespace is ignored):

       \( ( [^()]++ | (?R) )* \)

       First it	matches	an opening parenthesis.	Then it	matches	any number  of
       substrings,  which can either be	a sequence of non-parentheses or a re-
       cursive match of	the pattern itself (that is, a correctly parenthesized
       substring). Finally there is a closing parenthesis. Notice the use of a
       possessive quantifier to	avoid  backtracking  into  sequences  of  non-
       parentheses.

       If this was part	of a larger pattern, you would not want	to recurse the
       entire pattern, so instead you can use:

       ( \( ( [^()]++ |	(?1) )*	\) )

       The pattern is here within parentheses so that the recursion refers  to
       them instead of the whole pattern.

       In  a  larger  pattern,	keeping	 track	of  parenthesis	numbers	can be
       tricky. This is made easier by the use of relative references.  Instead
       of  (?1)	in the pattern above, you can write (?-2) to refer to the sec-
       ond most	recently opened	parentheses preceding the recursion. That  is,
       a negative number counts	capturing parentheses leftwards	from the point
       at which	it is encountered.

       It is also possible to refer to later opened  parentheses,  by  writing
       references  such	 as  (?+2). However, these cannot be recursive,	as the
       reference is not	inside the parentheses that are	referenced.  They  are
       always  non-recursive  subroutine  calls, as described in the next sec-
       tion.

       An alternative approach is to use named parentheses instead.  The  Perl
       syntax  for this	is (?&name). The earlier PCRE syntax (?P>name) is also
       supported. We can rewrite the above example as follows:

       (?<pn> \( ( [^()]++ | (?&pn) )* \) )

       If there	is more	than one subpattern with the same name,	 the  earliest
       one is used.

       This  particular	 example  pattern that we have studied contains	nested
       unlimited repeats, and so the use of a possessive quantifier for	match-
       ing  strings  of	non-parentheses	is important when applying the pattern
       to strings that do not match. For example, when this pattern is applied
       to

       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it gives	"no match" quickly. However, if	a possessive quantifier	is not
       used, the match runs for	a long time, as	there are  so  many  different
       ways  the  +  and  *  repeats can carve up the subject, and all must be
       tested before failure can be reported.

       At the end of a match, the values of capturing  parentheses  are	 those
       from the	outermost level. If the	pattern	above is matched against

       (ab(cd)ef)

       the  value  for	the  inner capturing parentheses (numbered 2) is "ef",
       which is	the last value taken on	at the top-level. If a capturing  sub-
       pattern	is  not	 matched at the	top level, its final captured value is
       unset, even if it was (temporarily) set at a deeper  level  during  the
       matching	process.

       Do not confuse item (?R)	with condition (R), which tests	for recursion.
       Consider	the following pattern, which matches text in  angle  brackets,
       allowing	 for  arbitrary	 nesting.  Only	 digits	 are allowed in	nested
       brackets	(that is, when recursing), while any characters	are  permitted
       at the outer level.

       < (?: (?(R) \d++	 | [^<>]*+) | (?R)) * >

       Here (?(R) is the start of a conditional	subpattern, with two different
       alternatives for	the recursive and non-recursive	cases.	Item  (?R)  is
       the actual recursive call.

       Differences in Recursion	Processing between PCRE	and Perl

       Recursion  processing  in PCRE differs from Perl	in two important ways.
       In PCRE (like Python, but unlike	Perl), a recursive subpattern call  is
       always treated as an atomic group. That is, once	it has matched some of
       the subject string, it is never re-entered, even	if it contains untried
       alternatives  and  there	 is a subsequent matching failure. This	can be
       illustrated by the following pattern, which means  to  match  a	palin-
       dromic string containing	an odd number of characters (for example, "a",
       "aba", "abcba", "abcdcba"):

       ^(.|(.)(?1)\2)$

       The idea	is that	it either matches a single character, or two identical
       characters surrounding a	subpalindrome. In Perl,	this pattern works; in
       PCRE it does not	work if	the pattern is longer than  three  characters.
       Consider	the subject string "abcba".

       At  the	top level, the first character is matched, but as it is	not at
       the end of the string, the first	alternative fails, the second alterna-
       tive  is	 taken,	and the	recursion kicks	in. The	recursive call to sub-
       pattern 1 successfully matches the next character ("b").	 (Notice  that
       the beginning and end of	line tests are not part	of the recursion.)

       Back  at	 the top level,	the next character ("c") is compared with what
       subpattern 2 matched, which was "a". This fails.	As  the	 recursion  is
       treated	as  an atomic group, there are now no backtracking points, and
       so the entire match fails. (Perl	can now	re-enter the recursion and try
       the  second  alternative.)  However, if the pattern is written with the
       alternatives in the other order,	things are different:

       ^((.)(?1)\2|.)$

       This time, the recursing	alternative is tried first, and	 continues  to
       recurse	until  it runs out of characters, at which point the recursion
       fails. But this time we have another alternative	to try at  the	higher
       level. That is the significant difference: in the previous case the re-
       maining alternative is at a deeper recursion level, which  PCRE	cannot
       use.

       To  change  the pattern so that it matches all palindromic strings, not
       only those with an odd number of	characters, it is tempting  to	change
       the pattern to this:

       ^((.)(?1)\2|.?)$

       Again,  this  works  in Perl, but not in	PCRE, and for the same reason.
       When a deeper recursion has matched a single character,	it  cannot  be
       entered again to	match an empty string. The solution is to separate the
       two cases, and write out	the odd	and even cases as alternatives at  the
       higher level:

       ^(?:((.)(?1)\2|)|((.)(?3)\4|.))

       If  you want to match typical palindromic phrases, the pattern must ig-
       nore all	non-word characters, which can be done as follows:

       ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$

       If run with option caseless, this pattern matches phrases  such	as  "A
       man, a plan, a canal: Panama!" and it works well	in both	PCRE and Perl.
       Notice the use of the possessive	quantifier *+  to  avoid  backtracking
       into  sequences	of  non-word characters. Without this, PCRE takes much
       longer (10 times	or more) to match typical phrases, and Perl  takes  so
       long that you think it has gone into a loop.

   Note:
       The  palindrome-matching	patterns above work only if the	subject	string
       does not	start with a  palindrome  that	is  shorter  than  the	entire
       string. For example, although "abcba" is	correctly matched, if the sub-
       ject is "ababa",	PCRE finds palindrome "aba" at	the  start,  and  then
       fails  at  top  level,  as  the end of the string does not follow. Once
       again, it cannot	jump back into the recursion  to  try  other  alterna-
       tives, so the entire match fails.

       The  second  way	 in which PCRE and Perl	differ in their	recursion pro-
       cessing is in the handling of captured values. In Perl, when a  subpat-
       tern  is	 called	recursively or as a subpattern (see the	next section),
       it has no access	to any values that were	captured  outside  the	recur-
       sion.  In  PCRE	these values can be referenced.	Consider the following
       pattern:

       ^(.)(\1|a(?2))

       In PCRE,	it matches "bab". The first capturing parentheses  match  "b",
       then  in	 the  second  group, when the back reference \1	fails to match
       "b", the	second alternative matches "a",	and then recurses. In the  re-
       cursion,	 \1  does  now	match  "b" and so the whole match succeeds. In
       Perl, the pattern fails to match	because	inside the recursive  call  \1
       cannot access the externally set	value.

SUBPATTERNS AS SUBROUTINES
       If  the	syntax for a recursive subpattern call (either by number or by
       name) is	used outside the parentheses to	which it refers,  it  operates
       like  a subroutine in a programming language. The called	subpattern can
       be defined before or after the reference. A numbered reference  can  be
       absolute	or relative, as	in the following examples:

       (...(absolute)...)...(?2)...
       (...(relative)...)...(?-1)...
       (...(?+1)...(relative)...

       An  earlier  example  pointed  out  that	 the following pattern matches
       "sense and sensibility" and  "response  and  responsibility",  but  not
       "sense and responsibility":

       (sens|respons)e and \1ibility

       If instead the following	pattern	is used, it matches "sense and respon-
       sibility" and the other two strings:

       (sens|respons)e and (?1)ibility

       Another example is provided in the discussion of	DEFINE earlier.

       All subroutine calls, recursive or not, are always  treated  as	atomic
       groups.	That  is,  once	 a  subroutine has matched some	of the subject
       string, it is never re-entered, even if it  contains  untried  alterna-
       tives  and there	is a subsequent	matching failure. Any capturing	paren-
       theses that are set during the subroutine call revert to	their previous
       values afterwards.

       Processing  options  such as case-independence are fixed	when a subpat-
       tern is defined,	so if it is used as a subroutine, such options	cannot
       be  changed  for	 different  calls.  For	example, the following pattern
       matches "abcabc"	but not	"abcABC", as the change	of  processing	option
       does not	affect the called subpattern:

       (abc)(?i:(?-1))

ONIGURUMA SUBROUTINE SYNTAX
       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
       name or a number	enclosed either	in angle brackets or single quotes, is
       alternative syntax for referencing a subpattern as a subroutine,	possi-
       bly recursively.	Here follows two of the	examples used above, rewritten
       using this syntax:

       (?<pn> \( ( (?>[^()]+) |	\g<pn> )* \) )
       (sens|respons)e and \g'1'ibility

       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
       plus or minus sign, it is taken as a relative reference,	for example:

       (abc)(?i:\g<-1>)

       Notice that \g{...} (Perl syntax) and \g<...>  (Oniguruma  syntax)  are
       not synonymous. The former is a back reference; the latter is a subrou-
       tine call.

BACKTRACKING CONTROL
       Perl 5.10 introduced some "Special Backtracking Control	Verbs",	 which
       are still described in the Perl documentation as	"experimental and sub-
       ject to change or removal in a future version of	Perl". It goes	on  to
       say:  "Their usage in production	code should be noted to	avoid problems
       during upgrades." The same remarks apply	to the PCRE features described
       in this section.

       The  new	verbs make use of what was previously invalid syntax: an open-
       ing parenthesis followed	by an asterisk.	They are generally of the form
       (*VERB)	or  (*VERB:NAME). Some can take	either form, possibly behaving
       differently depending on	whether	a name is present. A name is  any  se-
       quence  of  characters that does	not include a closing parenthesis. The
       maximum name length is 255 in the 8-bit library and 65535 in the	16-bit
       and  32-bit  libraries.	If  the	name is	empty, that is,	if the closing
       parenthesis immediately follows the colon, the  effect  is  as  if  the
       colon was not there. Any	number of these	verbs can occur	in a pattern.

       The behavior of these verbs in repeated groups, assertions, and in sub-
       patterns	called as subroutines (whether	or  not	 recursively)  is  de-
       scribed below.

       Optimizations That Affect Backtracking Verbs

       PCRE  contains some optimizations that are used to speed	up matching by
       running some checks at the start	of each	match attempt. For example, it
       can  know  the minimum length of	matching subject, or that a particular
       character must be present. When one of these optimizations bypasses the
       running	of a match, any	included backtracking verbs are	not processed.
       processed. You can suppress the start-of-match optimizations by setting
       option  no_start_optimize when calling compile/2	or run/3, or by	start-
       ing the pattern with (*NO_START_OPT).

       Experiments with	Perl suggest that it too  has  similar	optimizations,
       sometimes leading to anomalous results.

       Verbs That Act Immediately

       The  following verbs act	as soon	as they	are encountered. They must not
       be followed by a	name.

       (*ACCEPT)

       This verb causes	the match to end successfully, skipping	the  remainder
       of  the pattern.	However, when it is inside a subpattern	that is	called
       as a subroutine,	only that subpattern is	ended  successfully.  Matching
       then continues at the outer level. If (*ACCEPT) is triggered in a posi-
       tive assertion, the assertion succeeds; in a  negative  assertion,  the
       assertion fails.

       If  (*ACCEPT)  is inside	capturing parentheses, the data	so far is cap-
       tured. For example, the following matches "AB", "AAD", or  "ACD".  When
       it matches "AB",	"B" is captured	by the outer parentheses.

       A((?:A|B(*ACCEPT)|C)D)

       The  following  verb causes a matching failure, forcing backtracking to
       occur. It is equivalent to (?!) but easier to read.

       (*FAIL) or (*F)

       The Perl	documentation states that it is	probably useful	only when com-
       bined  with  (?{})  or  (??{}).	Those  are  Perl features that are not
       present in PCRE.

       A match with the	string "aaaa" always fails, but	the callout  is	 taken
       before each backtrack occurs (in	this example, 10 times).

       Recording Which Path Was	Taken

       The  main  purpose of this verb is to track how a match was arrived at,
       although	it also	has a secondary	use in with advancing the match	start-
       ing point (see (*SKIP) below).

   Note:
       In  Erlang,  there  is no interface to retrieve a mark with run/2,3, so
       only the	secondary purpose is relevant to the Erlang programmer.

       The rest	of this	section	is  therefore  deliberately  not  adapted  for
       reading	by  the	Erlang programmer, but the examples can	help in	under-
       standing	NAMES as they can be used by (*SKIP).

       (*MARK:NAME) or (*:NAME)

       A name is always	required with this verb. There	can  be	 as  many  in-
       stances	of  (*MARK)  as	 you like in a pattern,	and their names	do not
       have to be unique.

       When a match succeeds, the name of the last  encountered	 (*MARK:NAME),
       (*PRUNE:NAME),  or  (*THEN:NAME)	on the matching	path is	passed back to
       the caller as described in section "Extra data for pcre_exec()" in  the
       pcreapi documentation. In the following example of pcretest output, the
       /K modifier requests the	retrieval and outputting of (*MARK) data:

	 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
       data> XY
	0: XY
       MK: A
       XZ
	0: XZ
       MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple  it indicates which of the two alternatives matched.	This is	a more
       efficient way of	obtaining this information than	putting	each  alterna-
       tive in its own capturing parentheses.

       If  a  verb  with a name	is encountered in a positive assertion that is
       true, the name is recorded and passed back if it	is  the	 last  encoun-
       tered.  This does not occur for negative	assertions or failing positive
       assertions.

       After a partial match or	a failed match,	the last encountered  name  in
       the entire match	process	is returned, for example:

	 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
       data> XP
       No match, mark =	B

       Notice  that  in	this unanchored	example, the mark is retained from the
       match attempt that started at letter "X"	 in  the  subject.  Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, nevertheless	do not reset it.

       Verbs That Act after Backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues  with what follows, but if there is no subsequent	match, causing
       a backtrack to the verb,	a failure is  forced.  That  is,  backtracking
       cannot  pass  to	the left of the	verb. However, when one	of these verbs
       appears inside an atomic	group or an assertion that is true, its	effect
       is confined to that group, as once the group has	been matched, there is
       never any backtracking into it. In  this	 situation,  backtracking  can
       "jump  back"  to	the left of the	entire atomic group or assertion. (Re-
       member also, as stated above, that this localization  also  applies  in
       subroutine calls.)

       These  verbs  differ  in	exactly	what kind of failure occurs when back-
       tracking	reaches	them. The behavior described below is what occurs when
       the  verb  is  not in a subroutine or an	assertion. Subsequent sections
       cover these special cases.

       The following verb, which must not be followed by a  name,  causes  the
       whole  match to fail outright if	there is a later matching failure that
       causes backtracking to reach it.	Even if	the pattern is unanchored,  no
       further	attempts  to find a match by advancing the starting point take
       place.

       (*COMMIT)

       If (*COMMIT) is the only	backtracking verb that is encountered, once it
       has  been  passed,  run/2,3 is committed	to find	a match	at the current
       starting	point, or not at all, for example:

       a+(*COMMIT)b

       This matches "xxaab" but	not "aacaab". It can be	thought	of as  a  kind
       of dynamic anchor, or "I've started, so I must finish". The name	of the
       most recently passed (*MARK) in the path	is passed back when  (*COMMIT)
       forces a	match failure.

       If more than one	backtracking verb exists in a pattern, a different one
       that follows (*COMMIT) can be triggered first, so merely	passing	(*COM-
       MIT)  during  a match does not always guarantee that a match must be at
       this starting point.

       Notice that (*COMMIT) at	the start of a pattern is not the same	as  an
       anchor, unless the PCRE start-of-match optimizations are	turned off, as
       shown in	the following example:

       1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
       {match,["abc"]}
       2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
       nomatch

       For this	pattern, PCRE knows that any match must	start with "a",	so the
       optimization skips along	the subject to "a" before applying the pattern
       to the first set	of data. The match attempt then	succeeds. In the  sec-
       ond  call  the  no_start_optimize  disables the optimization that skips
       along to	the first character. The pattern is now	 applied  starting  at
       "x",  and  so the (*COMMIT) causes the match to fail without trying any
       other starting points.

       The following verb causes the match to fail at the current starting po-
       sition  in the subject if there is a later matching failure that	causes
       backtracking to reach it:

       (*PRUNE)	or (*PRUNE:NAME)

       If the pattern is unanchored, the normal	 "bumpalong"  advance  to  the
       next starting character then occurs. Backtracking can occur as usual to
       the left	of (*PRUNE), before it is reached, or  when  matching  to  the
       right  of (*PRUNE), but if there	is no match to the right, backtracking
       cannot cross (*PRUNE). In simple	cases, the use of (*PRUNE) is just  an
       alternative  to an atomic group or possessive quantifier, but there are
       some uses of (*PRUNE) that cannot be expressed in any other way.	In  an
       anchored	pattern, (*PRUNE) has the same effect as (*COMMIT).

       The    behavior	 of   (*PRUNE:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*PRUNE). It	is like	(*MARK:NAME) in	that the name  is  re-
       membered	for passing back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

   Note:
       The fact	that (*PRUNE:NAME) remembers the name is useless to the	Erlang
       programmer, as names cannot be retrieved.

       The  following  verb,  when specified without a name, is	like (*PRUNE),
       except that if the pattern is unanchored, the  "bumpalong"  advance  is
       not  to	the  next  character, but to the position in the subject where
       (*SKIP) was encountered.

       (*SKIP)

       (*SKIP) signifies that whatever text was	matched	leading	up to it  can-
       not be part of a	successful match. Consider:

       a+(*SKIP)b

       If  the	subject	 is  "aaaac...",  after	 the first match attempt fails
       (starting at the	first character	in the	string),  the  starting	 point
       skips  on  to  start  the next attempt at "c". Notice that a possessive
       quantifier does not have	the same effect	as this	example;  although  it
       would  suppress backtracking during the first match attempt, the	second
       attempt would start at the second character instead of skipping	on  to
       "c".

       When (*SKIP) has	an associated name, its	behavior is modified:

       (*SKIP:NAME)

       When  this  is  triggered,  the	previous  path	through	the pattern is
       searched	for the	most recent (*MARK) that has the same name. If one  is
       found,  the  "bumpalong"	advance	is to the subject position that	corre-
       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
       no (*MARK) with a matching name is found, (*SKIP) is ignored.

       Notice  that  (*SKIP:NAME) searches only	for names set by (*MARK:NAME).
       It ignores names	that are set by	(*PRUNE:NAME) or (*THEN:NAME).

       The following verb causes a skip	to the next innermost alternative when
       backtracking  reaches  it. That is, it cancels any further backtracking
       within the current alternative.

       (*THEN) or (*THEN:NAME)

       The verb	name comes from	the observation	that it	can be used for	a pat-
       tern-based if-then-else block:

       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If  the COND1 pattern matches, FOO is tried (and	possibly further items
       after the end of	the group if FOO succeeds). On	failure,  the  matcher
       skips  to  the second alternative and tries COND2, without backtracking
       into COND1. If that succeeds and	BAR fails, COND3 is tried. If BAZ then
       fails, there are	no more	alternatives, so there is a backtrack to what-
       ever came before	the entire group. If (*THEN) is	not inside an alterna-
       tion, it	acts like (*PRUNE).

       The    behavior	  of   (*THEN:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the	name is	remem-
       bered  for  passing  back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

   Note:
       The fact	that (*THEN:NAME) remembers the	name is	useless	to the	Erlang
       programmer, as names cannot be retrieved.

       A  subpattern that does not contain a | character is just a part	of the
       enclosing alternative; it is not	a nested alternation with only one al-
       ternative.  The	effect	of (*THEN) extends beyond such a subpattern to
       the enclosing alternative. Consider the following pattern, where	A,  B,
       and  so	on,  are  complex  pattern fragments that do not contain any |
       characters at this level:

       A (B(*THEN)C) | D

       If A and	B are matched, but there is a failure in C, matching does  not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	if the subpattern containing (*THEN) is	given an  alternative,
       it behaves differently:

       A (B(*THEN)C | (*FAIL)) | D

       The  effect of (*THEN) is now confined to the inner subpattern. After a
       failure in C, matching moves to (*FAIL),	which causes the whole subpat-
       tern  to	 fail, as there	are no more alternatives to try. In this case,
       matching	does now backtrack into	A.

       Notice that a conditional subpattern is not considered  as  having  two
       alternatives,  as  only one is ever used. That is, the |	character in a
       conditional subpattern has a different  meaning.	 Ignoring  whitespace,
       consider:

       ^.*? (?(?=a) a |	b(*THEN)c )

       If  the	subject	 is  "ba",  this pattern does not match. As .*?	is un-
       greedy, it initially matches zero characters. The condition (?=a)  then
       fails,  the  character  "b"  is matched,	but "c"	is not.	At this	point,
       matching	does not backtrack to .*? as can perhaps be expected from  the
       presence	 of the	| character. The conditional subpattern	is part	of the
       single alternative that comprises the whole pattern, and	so  the	 match
       fails.  (If  there  was a backtrack into	.*?, allowing it to match "b",
       the match would succeed.)

       The verbs described above provide four different	"strengths" of control
       when subsequent matching	fails:

	 * (*THEN)  is the weakest, carrying on	the match at the next alterna-
	   tive.

	 * (*PRUNE) comes next,	fails the match	at the current starting	 posi-
	   tion,  but  allows  an  advance to the next character (for an unan-
	   chored pattern).

	 * (*SKIP) is similar, except that the advance can be  more  than  one
	   character.

	 * (*COMMIT) is	the strongest, causing the entire match	to fail.

       More than One Backtracking Verb

       If  more	 than  one  backtracking verb is present in a pattern, the one
       that is backtracked onto	first acts. For	example, consider the  follow-
       ing pattern, where A, B,	and so on, are complex pattern fragments:

       (A(*COMMIT)B(*THEN)C|ABD)

       If  A matches but B fails, the backtrack	to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN) causes the next alternative (ABD) to be tried. This behavior is
       consistent, but is not always the same as in Perl. It means that	if two
       or  more	 backtracking verbs appear in succession, the last of them has
       no effect. Consider the following example:

       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       causes  it to be	triggered, and its action is taken. There can never be
       a backtrack onto	(*COMMIT).

       Backtracking Verbs in Repeated Groups

       PCRE differs from Perl in its handling of  backtracking	verbs  in  re-
       peated groups. For example, consider:

       /(a(*COMMIT)b)+ac/

       If  the	subject	 is  "abac",  Perl matches, but	PCRE fails because the
       (*COMMIT) in the	second repeat of the group acts.

       Backtracking Verbs in Assertions

       (*FAIL) in an assertion has its normal effect: it forces	 an  immediate
       backtrack.

       (*ACCEPT) in a positive assertion causes	the assertion to succeed with-
       out any further processing. In a	negative assertion,  (*ACCEPT)	causes
       the assertion to	fail without any further processing.

       The  other  backtracking	verbs are not treated specially	if they	appear
       in a positive assertion.	In particular, (*THEN) skips to	the  next  al-
       ternative  in  the innermost enclosing group that has alternations, re-
       gardless	if this	is within the assertion.

       Negative	assertions are,	however, different, to ensure that changing  a
       positive	 assertion into	a negative assertion changes its result. Back-
       tracking	into (*COMMIT),	(*SKIP), or (*PRUNE) causes a negative	asser-
       tion  to	 be true, without considering any further alternative branches
       in the assertion. Backtracking into (*THEN) causes it to	 skip  to  the
       next  enclosing alternative within the assertion	(the normal behavior),
       but if the assertion does not have such an alternative, (*THEN) behaves
       like (*PRUNE).

       Backtracking Verbs in Subroutines

       These  behaviors	 occur	regardless  if the subpattern is called	recur-
       sively. The treatment of	subroutines  in	 Perl  is  different  in  some
       cases.

	 * (*FAIL)  in	a subpattern called as a subroutine has	its normal ef-
	   fect: it forces an immediate	backtrack.

	 * (*ACCEPT) in	a subpattern called as a subroutine causes the subrou-
	   tine	match to succeed without any further processing. Matching then
	   continues after the subroutine call.

	 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as  a  sub-
	   routine cause the subroutine	match to fail.

	 * (*THEN)  skips  to  the next	alternative in the innermost enclosing
	   group within	the subpattern that has	alternatives. If there	is  no
	   such	 group	within	the  subpattern, (*THEN) causes	the subroutine
	   match to fail.

Ericsson AB			  stdlib 3.8				 re(3)

NAME | DESCRIPTION | DATA TYPES | EXPORTS | PERL-LIKE REGULAR EXPRESSION SYNTAX | PCRE REGULAR EXPRESSION DETAILS | SPECIAL START-OF-PATTERN ITEMS | CHARACTERS AND METACHARACTERS | BACKSLASH | CIRCUMFLEX AND DOLLAR | FULL STOP (PERIOD, DOT) AND \N | MATCHING A SINGLE DATA UNIT | SQUARE BRACKETS AND CHARACTER CLASSES | POSIX CHARACTER CLASSES | VERTICAL BAR | INTERNAL OPTION SETTING | SUBPATTERNS | DUPLICATE SUBPATTERN NUMBERS | NAMED SUBPATTERNS | REPETITION | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | BACK REFERENCES | ASSERTIONS | CONDITIONAL SUBPATTERNS | COMMENTS | RECURSIVE PATTERNS | SUBPATTERNS AS SUBROUTINES | ONIGURUMA SUBROUTINE SYNTAX | BACKTRACKING CONTROL

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=re&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help