Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
re(3)			   Erlang Module Definition			 re(3)

       re - Perl like regular expressions for Erlang

       This  module contains regular expression	matching functions for strings
       and binaries.

       The regular expression syntax and semantics resemble that of Perl.

       The library's matching algorithms  are  currently  based	 on  the  PCRE
       library,	 but  not all of the PCRE library is interfaced	and some parts
       of the library go beyond	what PCRE offers. The  sections	 of  the  PCRE
       documentation which are relevant	to this	module are included here.

       The  Erlang literal syntax for strings uses the "\" (backslash) charac-
       ter as an escape	code.  You  need  to  escape  backslashes  in  literal
       strings,	 both  in your code and	in the shell, with an additional back-
       slash, i.e.: "\\".

       mp() = {re_pattern, term(), term(), term(), term()}

	      Opaque datatype containing a compiled  regular  expression.  The
	      mp()  is guaranteed to be	a tuple() having the atom 're_pattern'
	      as its first element, to allow for matching in guards. The arity
	      of  the tuple() or the content of	the other fields may change in
	      future releases.

       nl_spec() = cr |	crlf | lf | anycrlf | any

       compile_option()	= unicode
			| anchored
			| caseless
			| dollar_endonly
			| dotall
			| extended
			| firstline
			| multiline
			| no_auto_capture
			| dupnames
			| ungreedy
			| {newline, nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| no_start_optimize
			| ucp
			| never_utf

       compile(Regexp) -> {ok, MP} | {error, ErrSpec}


		 Regexp	= iodata()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      The same as compile(Regexp,[])

       compile(Regexp, Options)	-> {ok,	MP} | {error, ErrSpec}


		 Regexp	= iodata() | unicode:charlist()
		 Options = [Option]
		 Option	= compile_option()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      This function compiles a	regular	 expression  with  the	syntax
	      described	 below	into  an internal format to be used later as a
	      parameter	to the run/2,3 functions.

	      Compiling	the regular expression before matching	is  useful  if
	      the  same	 expression is to be used in matching against multiple
	      subjects during the program's lifetime. Compiling	once and  exe-
	      cuting many times	is far more efficient than compiling each time
	      one wants	to match.

	      When the unicode option is given,	the regular expression	should
	      be  given	 as a valid Unicode charlist(),	otherwise as any valid

	      The options have the following meanings:

		  The regular expression is given as a Unicode charlist()  and
		  the resulting	regular	expression code	is to be run against a
		  valid	Unicode	charlist()  subject.  Also  consider  the  ucp
		  option when using Unicode characters.

		  The  pattern is forced to be "anchored", that	is, it is con-
		  strained to match only at the	first matching	point  in  the
		  string  that	is being searched (the "subject	string"). This
		  effect can also be achieved by appropriate constructs	in the
		  pattern itself.

		  Letters  in the pattern match	both upper and lower case let-
		  ters.	It is equivalent to Perl's /i option, and  it  can  be
		  changed within a pattern by a	(?i) option setting. Uppercase
		  and lowercase	letters	are defined as in the ISO-8859-1 char-
		  acter	set.

		  A  dollar  metacharacter  in the pattern matches only	at the
		  end of the subject string. Without  this  option,  a	dollar
		  also	matches	immediately before a newline at	the end	of the
		  string  (but	not  before  any  other	 newlines).  The  dol-
		  lar_endonly  option  is ignored if multiline is given. There
		  is no	equivalent option in Perl, and no way to set it	within
		  a pattern.

		  A dot	in the pattern matches all characters, including those
		  that indicate	newline. Without it, a dot does	not match when
		  the current position is at a newline.	This option is equiva-
		  lent to Perl's /s option, and	it can	be  changed  within  a
		  pattern  by  a (?s) option setting. A	negative class such as
		  [^a] always matches newline characters, independent of  this
		  option's setting.

		  Whitespace data characters in	the pattern are	ignored	except
		  when escaped or inside a character  class.  Whitespace  does
		  not  include the VT character	(ASCII 11). In addition, char-
		  acters between an unescaped #	outside	a character class  and
		  the  next  newline,  inclusive,  are	also  ignored. This is
		  equivalent to	Perl's /x option, and it can be	changed	within
		  a  pattern  by  a  (?x) option setting. This option makes it
		  possible to include comments	inside	complicated  patterns.
		  Note,	 however,  that	 this applies only to data characters.
		  Whitespace characters	may never appear within	special	 char-
		  acter	 sequences  in	a  pattern,  for  example  within  the
		  sequence (?( which introduces	a conditional subpattern.

		  An unanchored	pattern	is required to match before or at  the
		  first	newline	in the subject string, though the matched text
		  may continue over the	newline.

		  By default, PCRE treats the subject string as	consisting  of
		  a  single  line  of characters (even if it actually contains
		  newlines). The "start	of  line"  metacharacter  (^)  matches
		  only	at  the	 start	of the string, while the "end of line"
		  metacharacter	($) matches only at the	end of the string,  or
		  before  a  terminating  newline  (unless  dollar_endonly  is
		  given). This is the same as Perl.

		  When multiline is given, the "start of  line"	 and  "end  of
		  line"	 constructs match immediately following	or immediately
		  before internal newlines  in	the  subject  string,  respec-
		  tively, as well as at	the very start and end.	This is	equiv-
		  alent	to Perl's /m option, and it can	be  changed  within  a
		  pattern  by  a (?m) option setting. If there are no newlines
		  in a subject string, or no occurrences of ^ or $ in  a  pat-
		  tern,	setting	multiline has no effect.

		  Disables  the	 use  of numbered capturing parentheses	in the
		  pattern. Any opening parenthesis that	is not followed	 by  ?
		  behaves  as  if it were followed by ?: but named parentheses
		  can still be used for	capturing (and they acquire numbers in
		  the  usual  way).  There  is no equivalent of	this option in

		  Names	used to	identify capturing  subpatterns	 need  not  be
		  unique.  This	 can  be  helpful for certain types of pattern
		  when it is known that	only one instance of the named subpat-
		  tern	can  ever  be matched. There are more details of named
		  subpatterns below

		  This option inverts the "greediness" of the  quantifiers  so
		  that	they  are  not greedy by default, but become greedy if
		  followed by "?". It is not compatible	with Perl. It can also
		  be set by a (?U) option setting within the pattern.

		{newline, NLSpec}:
		  Override  the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline is indicated by a single character LF (ASCII  10),
		    the	default

		    Newline  is	 indicated by the two-character	CRLF (ASCII 13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	of the	newline	 sequences  above,  plus  the  Unicode
		    sequences	VT   (vertical	tab,  U+000B),	FF  (formfeed,
		    U+000C), NEL (next	line,  U+0085),	 LS  (line  separator,
		    U+2028), and PS (paragraph separator, U+2029).

		  Specifies  specifically  that	\R is to match only the	cr, lf
		  or crlf sequences, not the Unicode specific newline  charac-

		  Specifies  specifically  that	\R is to match all the Unicode
		  newline characters (including	crlf etc, the default).

		  This option disables optimization that  may  malfunction  if
		  "Special  start-of-pattern items" are	present	in the regular
		  expression.  A  typical  example  would  be  when   matching
		  "DEFABC"  against  "(*COMMIT)ABC", where the start optimiza-
		  tion of PCRE would skip the subject up to the	"A" and	 would
		  never	 realize  that	the  (*COMMIT) instruction should have
		  made the matching fail. This option is only relevant if  you
		  use  "start-of-pattern  items",  as discussed	in the section
		  "PCRE	regular	expression details" below.

		  Specifies that Unicode Character Properties should  be  used
		  when	resolving  \B,	\b, \D,	\d, \S,	\s, \W and \w. Without
		  this flag, only ISO-Latin-1 properties are used. Using  Uni-
		  code	properties hurts performance, but is semantically cor-
		  rect when working with Unicode characters  beyond  the  ISO-
		  Latin-1 range.

		  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
		  items" are forbidden.	This flag can  not  be	combined  with
		  unicode.  Useful  if	ISO-Latin-1  patterns from an external
		  source are to	be compiled.

       inspect(MP, Item) -> {namelist, [binary()]}


		 MP = mp()
		 Item =	namelist

	      This function takes a compiled regular expression	and  an	 item,
	      returning	 the  relevant	data from the regular expression. Cur-
	      rently the only supported	item is	namelist,  which  returns  the
	      tuple  {namelist,	 [  binary()]},	 containing  the  names	of all
	      (unique) named subpatterns in the	regular	expression.


	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> re:inspect(MP,namelist).
	      3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
	      4> re:inspect(MPD,namelist).

	      Note specifically	in the second example that the duplicate  name
	      only  occurs  once in the	returned list, and that	the list is in
	      alphabetical order regardless of where the names are  positioned
	      in the regular expression. The order of the names	is the same as
	      the order	of captured subexpressions if {capture,	all_names}  is
	      given as an option to re:run/3. You can therefore	create a name-
	      to-value mapping from the	result of re:run/3 like	this:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> {namelist, N} = re:inspect(MP,namelist).
	      3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
	      4> NameMap = lists:zip(N,L).

	      More items are expected to be added in the future.

       run(Subject, RE)	-> {match, Captured} | nomatch


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Captured = [CaptureData]
		 CaptureData = {integer(), integer()}

	      The same as run(Subject,RE,[]).

       run(Subject, RE,	Options) ->
	      {match, Captured}	| match	| nomatch | {error, ErrType}


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| report_errors
			| {offset, integer() >=	0}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| {newline, NLSpec :: nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| {capture, ValueSpec}
			| {capture, ValueSpec, Type}
			| CompileOpt
		 Type =	index |	list | binary
		 ValueSpec = all
			   | all_but_first
			   | all_names
			   | first
			   | none
			   | ValueList
		 ValueList = [ValueID]
		 ValueID = integer() | string()	| atom()
		 CompileOpt = compile_option()
		   See compile/2 above.
		 Captured = [CaptureData] | [[CaptureData]]
		 CaptureData = {integer(), integer()}
			     | ListConversionData
			     | binary()
		 ListConversionData = string()
				    | {error, string(),	binary()}
				    | {incomplete, string(), binary()}
		 ErrType = match_limit
			 | match_limit_recursion
			 | {compile, CompileErr}
		 CompileErr =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Executes a regexp	matching, returning match/{match, Captured} or
	      nomatch.	The regular expression can be given either as iodata()
	      in which case it is automatically	compiled (as by	 re:compile/2)
	      and executed, or as a pre-compiled mp() in which case it is exe-
	      cuted against the	subject	directly.

	      When compilation is involved, the	exception badarg is thrown  if
	      a	compilation error occurs. Call re:compile/2 to get information
	      about the	location of the	error in the regular expression.

	      If the regular expression	is  previously	compiled,  the	option
	      list  can	 only  contain	the  options anchored, global, notbol,
	      noteol,  report_errors,  notempty,  notempty_atstart,   {offset,
	      integer()	   _=	 0},	{match_limit,	 integer()    _=   0},
	      {match_limit_recursion, integer()	_= 0}, {newline,  NLSpec}  and
	      {capture,	 ValueSpec}/{capture,  ValueSpec, Type}. Otherwise all
	      options valid for	the re:compile/2 function are allowed as well.
	      Options  allowed	both for compilation and execution of a	match,
	      namely anchored and {newline, NLSpec}, will affect both the com-
	      pilation	and  execution if present together with	a non pre-com-
	      piled regular expression.

	      If the regular  expression  was  previously  compiled  with  the
	      option  unicode,	the Subject should be provided as a valid Uni-
	      code charlist(), otherwise any iodata() will do. If  compilation
	      is  involved  and	 the option unicode is given, both the Subject
	      and the regular expression should	 be  given  as	valid  Unicode

	      The {capture, ValueSpec}/{capture, ValueSpec, Type} defines what
	      to return	from the function upon successful matching.  The  cap-
	      ture  tuple may contain both a value specification telling which
	      of the captured substrings are to	be returned, and a type	speci-
	      fication,	telling	how captured substrings	are to be returned (as
	      index tuples, lists or binaries).	The capture option  makes  the
	      function	quite flexible and powerful. The different options are
	      described	in detail below.

	      If the capture options describe that no substring	 capturing  at
	      all  is  to  be done ({capture, none}), the function will	return
	      the single atom match upon successful  matching,	otherwise  the
	      tuple {match, ValueList} is returned. Disabling capturing	can be
	      done either by specifying	none or	an empty list as ValueSpec.

	      The report_errors	option adds  the  possibility  that  an	 error
	      tuple  is	 returned.  The	 tuple will either indicate a matching
	      error (match_limit or match_limit_recursion)  or	a  compilation
	      error,  where  the  error	tuple has the format {error, {compile,
	      CompileErr}}. Note that  if  the	option	report_errors  is  not
	      given,  the function never returns error tuples, but will	report
	      compilation errors as a badarg exception and failed matches  due
	      to exceeded match	limits simply as nomatch.

	      The options relevant for execution are:

		  Limits  re:run/3 to matching at the first matching position.
		  If a pattern was compiled with anchored, or turned out to be
		  anchored  by virtue of its contents, it cannot be made unan-
		  chored at  matching  time,  hence  there  is	no  unanchored

		  Implements  global (repetitive) search (the g	flag in	Perl).
		  Each match is	returned as a separate list()  containing  the
		  specific match as well as any	matching subexpressions	(or as
		  specified by the capture option). The	Captured part  of  the
		  return  value	 will  hence  be a list() of list()s when this
		  option is given.

		  The interaction of the global	option with a regular  expres-
		  sion	which  matches	an  empty string surprises some	users.
		  When the global option  is  given,  re:run/3	handles	 empty
		  matches  in the same way as Perl: a zero-length match	at any
		  point	 will  be  retried   with   the	  options   [anchored,
		  notempty_atstart]  as	well. If that search gives a result of
		  length > 0, the result is included. For example:


		  The following	matching will be performed:

		  At offset 0:
		    The	regexp (|at) will first	match at the initial  position
		    of	the  string  cat,  giving the result set [{0,0},{0,0}]
		    (the second	{0,0} is due to	the  subexpression  marked  by
		    the	 parentheses).	As  the	 length	 of the	match is 0, we
		    don't advance to the next position yet.

		  At offset 0 with [anchored, notempty_atstart]:
		     The  search  is  retried  with  the  options   [anchored,
		    notempty_atstart]  at  the	same  position,	which does not
		    give any interesting  result  of  longer  length,  so  the
		    search position is now advanced to the next	character (a).

		  At offset 1:
		    This  time,	 the  search results in	[{1,0},{1,0}], so this
		    search will	also be	repeated with the extra	options.

		  At offset 1 with [anchored, notempty_atstart]:
		    Now	the ab alternative is found and	 the  result  will  be
		    [{1,2},{1,2}].  The	result is added	to the list of results
		    and	the position in	the  search  string  is	 advanced  two

		  At offset 3:
		    The	search now once	again matches the empty	string,	giving

		  At offset 1 with [anchored, notempty_atstart]:
		    This will give no result of	length > 0 and we are  at  the
		    last position, so the global search	is complete.

		  The result of	the call is:


		  An  empty  string  is	 not considered	to be a	valid match if
		  this option is given.	If there are alternatives in the  pat-
		  tern,	 they  are  tried.  If	all the	alternatives match the
		  empty	string,	the entire match fails.	For  example,  if  the


		  is  applied  to  a  string not beginning with	"a" or "b", it
		  would	normally match the empty string	at the	start  of  the
		  subject.  With the notempty option, this match is not	valid,
		  so re:run/3 searches further into the	string for occurrences
		  of "a" or "b".

		  This	is  like  notempty,  except that an empty string match
		  that is not at the start of the subject is permitted.	If the
		  pattern is anchored, such a match can	occur only if the pat-
		  tern contains	\K.

		  Perl	 has   no   direct   equivalent	  of	notempty    or
		  notempty_atstart,  but it does make a	special	case of	a pat-
		  tern match of	the empty string within	its split()  function,
		  and  when  using  the	/g modifier. It	is possible to emulate
		  Perl's behavior after	matching a null	string by first	trying
		  the match again at the same offset with notempty_atstart and
		  anchored, and	then, if that fails, by	advancing the starting
		  offset (see below) and trying	an ordinary match again.

		  This	option	specifies that the first character of the sub-
		  ject string is not the beginning of a	line, so  the  circum-
		  flex	metacharacter should not match before it. Setting this
		  without multiline (at	compile	time) causes circumflex	 never
		  to  match. This option only affects the behavior of the cir-
		  cumflex metacharacter. It does not affect \A.

		  This option specifies	that the end of	the subject string  is
		  not  the  end	 of a line, so the dollar metacharacter	should
		  not match it nor (except in multiline	mode) a	newline	 imme-
		  diately  before  it. Setting this without multiline (at com-
		  pile time) causes dollar never to match. This	option affects
		  only	the  behavior of the dollar metacharacter. It does not
		  affect \Z or \z.

		  This option gives better control of the  error  handling  in
		  re:run/3. When it is given, compilation errors (if the regu-
		  lar expression isn't already compiled) as well  as  run-time
		  errors are explicitly	returned as an error tuple.

		  The possible run-time	errors are:

		    The	PCRE library sets a limit on how many times the	inter-
		    nal	match function can be called. The  default  value  for
		    this  is  10000000	in the library compiled	for Erlang. If
		    {error, match_limit} is returned, it means that the	execu-
		    tion  of  the  regular  expression has reached this	limit.
		    Normally this is to	be regarded as a nomatch, which	is the
		    default  return value when this happens, but by specifying
		    report_errors, you will get	informed when the match	 fails
		    due	to to many internal calls.

		    This error is very similar to match_limit, but occurs when
		    the	internal  match	 function  of  PCRE  is	 "recursively"
		    called  more times than the	"match_limit_recursion"	limit,
		    which is by	default	10000000 as well. Note that as long as
		    the	match_limit and	match_limit_default values are kept at
		    the	default	values,	the  match_limit_recursion  error  can
		    not	occur, as the match_limit error	will occur before that
		    (each recursive call is also a call, but not vice  versa).
		    Both limits	can however be changed,	either by setting lim-
		    its	directly in the	regular	expression string (see	refer-
		    ence section below)	or by giving options to	re:run/3

		  It  is  important  to	understand that	what is	referred to as
		  "recursion" when limiting matches is not actually  recursion
		  on  the  C stack of the Erlang machine, neither is it	recur-
		  sion on the Erlang process stack. The	version	of  PCRE  com-
		  piled	into the Erlang	VM uses	machine	"heap" memory to store
		  values that needs to	be  kept  over	recursion  in  regular
		  expression matches.

		{match_limit, integer()	_= 0}:
		  This	option	limits	the  execution	time  of a match in an
		  implementation-specific way. It is described in the  follow-
		  ing way by the PCRE documentation:

		The match_limit	field provides a means of preventing PCRE from using
		up a vast amount of resources when running patterns that are not going
		to match, but which have a very	large number of	possibilities in their
		search trees. The classic example is a pattern that uses nested
		unlimited repeats.

		Internally, pcre_exec()	uses a function	called match(),	which it calls
		repeatedly (sometimes recursively). The	limit set by match_limit is
		imposed	on the number of times this function is	called during a	match,
		which has the effect of	limiting the amount of backtracking that can
		take place. For	patterns that are not anchored,	the count restarts
		from zero for each position in the subject string.

		  This	means that runaway regular expression matches can fail
		  faster if the	 limit	is  lowered  using  this  option.  The
		  default  value  compiled  into the Erlang virtual machine is

		This option does in no way affect the execution	of the	Erlang
		virtual	 machine  in  terms  of	 "long	running	BIF's".	re:run
		always give control back to the	scheduler of Erlang  processes
		at  intervals  that  ensures  the  real	time properties	of the
		Erlang system.

		{match_limit_recursion,	integer() _= 0}:
		  This option limits the execution time	and memory consumption
		  of  a	 match in an implementation-specific way, very similar
		  to match_limit. It is	described in the following way by  the
		  PCRE documentation:

		The match_limit_recursion field	is similar to match_limit, but instead
		of limiting the	total number of	times that match() is called, it
		limits the depth of recursion. The recursion depth is a	smaller	number
		than the total number of calls,	because	not all	calls to match() are
		recursive. This	limit is of use	only if	it is set smaller than

		Limiting the recursion depth limits the	amount of machine stack	that
		can be used, or, when PCRE has been compiled to	use memory on the heap
		instead	of the stack, the amount of heap memory	that can be

		  The  Erlang  virtual	machine	uses a PCRE library where heap
		  memory is used when regular expression match recursion  hap-
		  pens,	 why  this  limits  the	 usage	of machine heap, not C

		  Specifying a lower value may result  in  matches  with  deep
		  recursion failing, when they should actually have matched:

		1> re:run("aaaaaaaaaaaaaz","(a+)*z").
		2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
		3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).

		  This	option,	 as well as the	match_limit option should only
		  be used in  very  rare  cases.  Understanding	 of  the  PCRE
		  library internals is recommended before tampering with these

		{offset, integer() _= 0}:
		  Start	matching at the	offset (position) given	in the subject
		  string.  The	offset	is  zero-based,	so that	the default is
		  {offset,0} (all of the subject string).

		{newline, NLSpec}:
		  Override the default definition of a newline in the  subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline  is	indicated by a single character	LF (ASCII 10),
		    the	default

		    Newline is indicated by the	two-character CRLF  (ASCII  13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	 of  the  newline  sequences  above,  plus the Unicode
		    sequences  VT  (vertical  tab,  U+000B),   FF   (formfeed,
		    U+000C),  NEL  (next  line,	 U+0085),  LS (line separator,
		    U+2028), and PS (paragraph separator, U+2029).

		  Specifies specifically that \R is to match only the  cr,  lf
		  or  crlf sequences, not the Unicode specific newline charac-
		  ters.	(overrides compilation option)

		  Specifies specifically that \R is to match all  the  Unicode
		  newline  characters (including crlf etc, the default).(over-
		  rides	compilation option)

		{capture, ValueSpec}/{capture, ValueSpec, Type}:
		  Specifies which captured substrings are returned and in what
		  format.  By  default,	 re:run/3 captures all of the matching
		  part of the substring	as well	as all	capturing  subpatterns
		  (all	of the pattern is automatically	captured). The default
		  return type is (zero-based) indexes of the captured parts of
		  the  string,	given as {Offset,Length} pairs (the index Type
		  of capturing).

		  As an	example	of the default behavior, the following call:


		  returns, as first and	only captured string the matching part
		  of the subject ("abcd" in the	middle)	as a index pair	{3,4},
		  where	character positions are	zero based, just  as  in  off-
		  sets.	The return value of the	call above would then be:


		  Another (and quite common) case is where the regular expres-
		  sion matches all of the subject, as in:


		  where	the return value correspondingly will point out	all of
		  the  string,	beginning  at  index 0 and being 10 characters


		  If the regular expression  contains  capturing  subpatterns,
		  like in the following	case:


		  all  of the matched subject is captured, as well as the cap-
		  tured	substrings:


		  the complete matching	pattern	always giving the first	return
		  value	 in  the  list	and  the rest of the subpatterns being
		  added	in the order they occurred in the regular expression.

		  The capture tuple is built up	as follows:

		    Specifies which captured (sub)patterns are to be returned.
		    The	 ValueSpec  can	 either	be an atom describing a	prede-
		    fined set of return	values,	or a  list  containing	either
		    the	 indexes  or  the  names  of  specific	subpatterns to

		    The	predefined sets	of subpatterns are:

		      All captured subpatterns including the complete matching
		      string. This is the default.

		      All named	subpatterns in the regular expression, as if a
		      list() of	all the	names in alphabetical order was	given.
		      The  list	 of  all  names	can also be retrieved with the
		      inspect/2	function.

		      Only the first captured subpattern, which	is always  the
		      complete	matching  part	of the subject.	All explicitly
		      captured subpatterns are discarded.

		      All but the first	matching subpattern, i.e. all  explic-
		      itly captured subpatterns, but not the complete matching
		      part of the subject string. This is useful if the	 regu-
		      lar  expression  as  a whole matches a large part	of the
		      subject, but the part you're  interested	in  is	in  an
		      explicitly  captured  subpattern.	 If the	return type is
		      list or binary, not  returning  subpatterns  you're  not
		      interested in is a good way to optimize.

		      Do  not return matching subpatterns at all, yielding the
		      single atom match	as the return value  of	 the  function
		      when   matching  successfully  instead  of  the  {match,
		      list()} return. Specifying an empty list gives the  same

		    The	value list is a	list of	indexes	for the	subpatterns to
		    return, where index	0 is for all of	the pattern, and 1  is
		    for	the first explicit capturing subpattern	in the regular
		    expression,	and so forth. When using named	captured  sub-
		    patterns  (see  below)  in the regular expression, one can
		    use	atom()s	or string()s to	specify	the subpatterns	to  be
		    returned. For example, consider the	regular	expression:


		    matched  against  the  string "ABCabcdABC",	capturing only
		    the	"abcd" part (the first explicit	subpattern):


		    The	call will yield	the following result:


		    as the first explicitly captured subpattern	 is  "(abcd)",
		    matching  "abcd"  in the subject, at (zero-based) position
		    3, of length 4.

		    Now	consider the same regular  expression,	but  with  the
		    subpattern explicitly named	'FOO':


		    With this expression, we could still give the index	of the
		    subpattern with the	following call:


		    giving the same result as before. But, since  the  subpat-
		    tern  is  named, we	can also specify its name in the value


		    which would	yield the same result as the earlier examples,


		    The	values list might specify indexes or names not present
		    in the regular expression, in which	case the return	values
		    vary  depending  on	 the  type.  If	the type is index, the
		    tuple {-1,0} is returned for values	having no  correspond-
		    ing	 subpattern  in	 the  regexp,  but for the other types
		    (binary and	list), the values are the empty	binary or list

		    Optionally	specifies  how	captured  substrings are to be
		    returned. If omitted, the default of index	is  used.  The
		    Type can be	one of the following:

		      Return captured substrings as pairs of byte indexes into
		      the subject string and length of the matching string  in
		      the subject (as if the subject string was	flattened with
		      iolist_to_binary/1   or	unicode:characters_to_binary/2
		      prior to matching). Note that the	unicode	option results
		      in byte-oriented indexes in a (possibly  virtual)	 UTF-8
		      encoded binary. A	byte index tuple {0,2} might therefore
		      represent	one or	two  characters	 when  unicode	is  in
		      effect.  This might seem counter-intuitive, but has been
		      deemed the most effective	and useful way to  way	to  do
		      it. To return lists instead might	result in simpler code
		      if that is desired. This return type is the default.

		      Return  matching	substrings  as	lists  of   characters
		      (Erlang  string()s).  It	the  unicode option is used in
		      combination with the \C sequence in the regular  expres-
		      sion,  a	captured subpattern can	contain	bytes that are
		      not valid	UTF-8 (\C matches bytes	regardless of  charac-
		      ter  encoding).  In  that	 case  the  list capturing may
		      result in	the same types of tuples that  unicode:charac-
		      ters_to_list/2  can return, namely three-tuples with the
		      tag incomplete  or  error,  the  successfully  converted
		      characters  and the invalid UTF-8	tail of	the conversion
		      as a binary. The best strategy is	to avoid using the  \C
		      sequence when capturing lists.

		      Return  matching	substrings as binaries.	If the unicode
		      option is	used, these binaries are in UTF-8. If  the  \C
		      sequence	is used	together with unicode the binaries may
		      be invalid UTF-8.

		  In general, subpatterns that were not	assigned  a  value  in
		  the  match  are  returned  as	 the tuple {-1,0} when type is
		  index. Unassigned subpatterns	 are  returned	as  the	 empty
		  binary  or  list, respectively, for other return types. Con-
		  sider	the regular expression:


		  There	are three explicitly capturing subpatterns, where  the
		  opening  parenthesis	position  determines  the order	in the
		  result, hence	((?_FOO_abdd)|a(..d)) is subpattern  index  1,
		  (?_FOO_abdd)	is  subpattern index 2 and (..d) is subpattern
		  index	3. When	matched	against	the following string:


		  the subpattern at index 2 won't  match,  as  "abdd"  is  not
		  present in the string, but the complete pattern matches (due
		  to the alternative a(..d). The  subpattern  at  index	 2  is
		  therefore unassigned and the default return value will be:


		  Setting the capture Type to binary would give	the following:


		  where	the empty binary (____)	represents the unassigned sub-
		  pattern. In the binary  case,	 some  information  about  the
		  matching  is	therefore lost,	the ____ might just as well be
		  an empty string captured.

		  If differentiation between empty matches  and	 non  existing
		  subpatterns is necessary, use	the type index and do the con-
		  version to the final type in Erlang code.

		  When the option global is given, the	capture	 specification
		  affects each match separately, so that:


		  gives	the result:


	      The  options solely affecting the	compilation step are described
	      in the re:compile/2 function.

       replace(Subject,	RE, Replacement) -> iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Replacement = iodata()	| unicode:charlist()

	      The same as replace(Subject,RE,Replacement,[]).

       replace(Subject,	RE, Replacement, Options) ->
		  iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Replacement = iodata()	| unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, NLSpec}
			| bsr_anycrlf
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_unicode
			| {return, ReturnType}
			| CompileOpt
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		 NLSpec	= cr | crlf | lf | anycrlf | any

	      Replaces the matched part	of the Subject string  with  the  con-
	      tents of Replacement.

	      The  permissible	options	 are  the same as for re:run/3,	except
	      that the capture option  is  not	allowed.  Instead  a  {return,
	      ReturnType}  is present. The default return type is iodata, con-
	      structed in a way	to minimize copying. The iodata	result can  be
	      used  directly  in  many	I/O-operations.	 If  a	flat list() is
	      desired, specify {return,	list} and if a	binary	is  preferred,
	      specify {return, binary}.

	      As  in  the re:run/3 function, an	mp() compiled with the unicode
	      option requires the Subject to be	a Unicode charlist(). If  com-
	      pilation	is  done implicitly and	the unicode compilation	option
	      is given to this function, both the regular expression  and  the
	      Subject should be	given as valid Unicode charlist()s.

	      The  replacement	string	can  contain  the special character _,
	      which inserts the	whole matching expression in the  result,  and
	      the  special  sequence  \N  (where  N is an integer > 0),	\gN or
	      \g{N} resulting in the subexpression number N will  be  inserted
	      in the result. If	no subexpression with that number is generated
	      by the regular expression, nothing is inserted.

	      To insert	an _ or	\ in the result, precede it  with  a  \.  Note
	      that  Erlang  already  gives  a  special meaning to \ in literal
	      strings, so a single \ has to be written as "\\" and therefore a
	      double \ as "\\\\". Example:








	      As with re:run/3,	compilation errors raise the badarg exception,
	      re:compile/2 can be used	to  get	 more  information  about  the

       split(Subject, RE) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 SplitList = [iodata() | unicode:charlist()]

	      The same as split(Subject,RE,[]).

       split(Subject, RE, Options) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, nl_spec()}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_anycrlf
			| bsr_unicode
			| {return, ReturnType}
			| {parts, NumParts}
			| group
			| trim
			| CompileOpt
		 NumParts = integer() >= 0 | infinity
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		   See compile/2 above.
		 SplitList = [RetData] | [GroupedRetData]
		 GroupedRetData	= [RetData]
		 RetData = iodata() | unicode:charlist() | binary() | list()

	      This  function  splits  the  input  into parts by	finding	tokens
	      according	to the regular expression supplied.

	      The splitting is done basically by running a global regexp match
	      and  dividing  the  initial  string wherever a match occurs. The
	      matching part of the string is removed from the output.

	      As in the	re:run/3 function, an mp() compiled with  the  unicode
	      option  requires the Subject to be a Unicode charlist(). If com-
	      pilation is done implicitly and the unicode  compilation	option
	      is  given	 to this function, both	the regular expression and the
	      Subject should be	given as valid Unicode charlist()s.

	      The result is given  as  a  list	of  "strings",	the  preferred
	      datatype given in	the return option (default iodata).

	      If  subexpressions  are  given  in  the  regular expression, the
	      matching subexpressions are returned in the  resulting  list  as
	      well. An example:


	      will yield the result:




	      will yield


	      The  text	 matching the subexpression (marked by the parentheses
	      in the regexp) is	inserted in  the  result  list	where  it  was
	      found.  In  effect this means that concatenating the result of a
	      split where the whole regexp is a	single	subexpression  (as  in
	      the example above) will always result in the original string.

	      As  there	 is no matching	subexpression for the last part	in the
	      example (the "g"), there is nothing inserted after that. To make
	      the  group  of strings and the parts matching the	subexpressions
	      more obvious, one	might  use  the	 group	option,	 which	groups
	      together	the part of the	subject	string with the	parts matching
	      the subexpressions when the string was split:




	      Here the regular expression matched first	the "l", causing  "Er"
	      to  be the first part in the result. When	the regular expression
	      matched, the (only) subexpression	was bound to the "l",  so  the
	      "l"  is inserted in the group together with "Er".	The next match
	      is of the	"n", making "a"	the next part to  be  returned.	 Since
	      the  subexpression  is  bound to the substring "n" in this case,
	      the "n" is inserted into this group. The last group consists  of
	      the rest of the string, as no more matches are found.

	      By  default,  all	 parts	of  the	 string,  including  the empty
	      strings, are returned from the function. For example:


	      will return:


	      since the	matching of the	"g" in the end of the string leaves an
	      empty  rest  which is also returned. This	behaviour differs from
	      the default behaviour of the split function in Perl, where empty
	      strings at the end are by	default	removed. To get	the "trimming"
	      default behavior of Perl,	specify	trim as	an option:


	      The result will be:


	      The "trim" option	in effect says;	"give me as many parts as pos-
	      sible except the empty ones", which might	be useful in some cir-
	      cumstances. You can also specify how many	 parts	you  want,  by
	      specifying {parts,N}:


	      This will	give:


	      Note that	the last part is "ang",	not "an", as we	only specified
	      splitting	into two parts,	and the	splitting  stops  when	enough
	      parts  are  given,  which	is why the result differs from that of

	      More than	three parts are	not possible with this indata, so


	      will give	the same result	as the default,	which is to be	viewed
	      as "an infinite number of	parts".

	      Specifying 0 as the number of parts gives	the same effect	as the
	      option trim. If subexpressions are captured, empty subexpression
	      matches  at the end are also stripped from the result if trim or
	      {parts,0}	is specified.

	      If you are familiar with Perl, the  trim	behaviour  corresponds
	      exactly to the Perl default, the {parts,N} where N is a positive
	      integer corresponds exactly to the Perl behaviour	with  a	 posi-
	      tive  numerical  third  parameter	 and  the default behaviour of
	      re:split/3 corresponds to	that when the Perl routine is given  a
	      negative integer as the third parameter.

	      Summary  of  options  not	 previously described for the re:run/3

		  Specifies how	the parts of the original string are presented
		  in the result	list. The possible types are:

		    The	 variant  of  iodata() that gives the least copying of
		    data with the current implementation (often	a binary,  but
		    don't depend on it).

		    All	parts returned as binaries.

		    All	parts returned as lists	of characters ("strings").

		  Groups together the part of the string with the parts	of the
		  string matching the subexpressions of	the regexp.

		  The return value from	the function will in this  case	 be  a
		  list()  of  list()s.	Each  sublist  begins  with the	string
		  picked out of	the subject  string,  followed	by  the	 parts
		  matching  each  of the subexpressions	in order of occurrence
		  in the regular expression.

		  Specifies the	number of parts	the subject string  is	to  be
		  split	into.

		  The  number of parts should be a positive integer for	a spe-
		  cific	maximum	on the number of parts and  infinity  for  the
		  maximum  number  of parts possible (the default). Specifying
		  {parts,0} gives as many parts	as possible disregarding empty
		  parts	at the end, the	same as	specifying trim

		  Specifies that empty parts at	the end	of the result list are
		  to be	disregarded. The same as  specifying  {parts,0}.  This
		  corresponds  to  the default behaviour of the	split built in
		  function in Perl.

       The following sections  contain	reference  material  for  the  regular
       expressions  used  by  this module. The regular expression reference is
       based on	the PCRE documentation,	with changes in	 cases	where  the  re
       module behaves differently to the PCRE library.

       The  syntax and semantics of the	regular	expressions that are supported
       by PCRE are described in	detail below. Perl's regular  expressions  are
       described  in its own documentation, and	regular	expressions in general
       are covered in a	number of books, some of which have copious  examples.
       Jeffrey	 Friedl's   "Mastering	 Regular  Expressions",	 published  by
       O'Reilly, covers	regular	expressions in great detail. This  description
       of PCRE's regular expressions is	intended as reference material.

       The reference material is divided into the following sections:

	 * Special start-of-pattern items

	 * Characters and metacharacters

	 * Backslash

	 * Circumflex and dollar

	 * Full	stop (period, dot) and \N

	 * Matching a single data unit

	 * Square brackets and character classes

	 * POSIX character classes

	 * Vertical bar

	 * Internal option setting

	 * Subpatterns

	 * Duplicate subpattern	numbers

	 * Named subpatterns

	 * Repetition

	 * Atomic grouping and possessive quantifiers

	 * Back	references

	 * Assertions

	 * Conditional subpatterns

	 * Comments

	 * Recursive patterns

	 * Subpatterns as subroutines

	 * Oniguruma subroutine	syntax

	 * Backtracking	control

       A  number of options that can be	passed to re:compile/2 can also	be set
       by special items	at the start of	a pattern. These are not Perl-compati-
       ble, but	are provided to	make these options accessible to pattern writ-
       ers who are not able to change the program that processes the  pattern.
       Any  number  of	these  items may appear, but they must all be together
       right at	the start of the pattern string, and the letters  must	be  in
       upper case.

       UTF support

       Unicode	support	 is  basically UTF-8 based. To use Unicode characters,
       you either call re:compile/2/re:run/3 with the unicode option,  or  the
       pattern must start with one of these special sequences:



       Both  options  give the same effect, the	input string is	interpreted as
       UTF-8. Note that	with these instructions, the automatic	conversion  of
       lists  to  UTF-8	 is not	performed by the re functions, why using these
       options is not recommended. Add the unicode option when running re:com-
       pile/2 instead.

       Some applications that allow their users	to supply patterns may wish to
       restrict	them to	non-UTF	data for security reasons.  If	the  never_utf
       option  is  set at compile time,	(*UTF) etc. are	not allowed, and their
       appearance causes an error.

       Unicode property	support

       Another special sequence	that may appear	at the start of	a pattern is


       This has	the same effect	as setting the ucp option: it causes sequences
       such  as	 \d  and  \w  to use Unicode properties	to determine character
       types, instead of recognizing only characters with codes	less than  256
       via a lookup table.

       Disabling start-up optimizations

       If  a  pattern  starts  with (*NO_START_OPT), it	has the	same effect as
       setting the no_Start_optimize option at compile time.

       Newline conventions

       PCRE supports five different conventions	for indicating line breaks  in
       strings:	 a  single  CR (carriage return) character, a single LF	(line-
       feed) character,	the two-character sequence CRLF	,  any	of  the	 three
       preceding, or any Unicode newline sequence.

       It  is also possible to specify a newline convention by starting	a pat-
       tern string with	one of the following five sequences:

	   carriage return


	   carriage return, followed by	linefeed

	   any of the three above

	   all Unicode newline sequences

       These override the default and the options given	to  re:compile/2.  For
       example,	the pattern:


       changes the convention to CR. That pattern matches "a\nb" because LF is
       no longer a newline. If more than one of	them is	present, the last  one
       is used.

       The  newline  convention	affects	where the circumflex and dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter when dotall is not	set, and the behaviour of \N. However, it does
       not affect what the \R escape sequence matches. By default, this	is any
       Unicode	newline	sequence, for Perl compatibility. However, this	can be
       changed;	see the	description of \R in  the  section  entitled  "Newline
       sequences"  below. A change of \R setting can be	combined with a	change
       of newline convention.

       Setting match and recursion limits

       The caller of re:run/3 can set a	limit  on  the	number	of  times  the
       internal	 match() function is called and	on the maximum depth of	recur-
       sive calls. These facilities are	provided to catch runaway matches that
       are provoked by patterns	with huge matching trees (a typical example is
       a pattern with nested unlimited repeats)	and to avoid  running  out  of
       system  stack  by  too  much  recursion.	 When  one  of these limits is
       reached,	pcre_exec() gives an error return. The limits can also be  set
       by items	at the start of	the pattern of the form



       where d is any number of	decimal	digits.	However, the value of the set-
       ting must be less than the value	set by the caller of re:run/3  for  it
       to  have	 any  effect. In other words, the pattern writer can lower the
       limit set by the	programmer, but	not raise it. If there	is  more  than
       one setting of one of these limits, the lower value is used.

       The  current  default  value  for  both	the limits are 10000000	in the
       Erlang VM. Note that the	recursion limit	does not actually  affect  the
       stack  depth  of	 the  VM, as PCRE for Erlang is	compiled in such a way
       that the	match function never does recursion on the "C-stack".

       A regular expression is a pattern that is  matched  against  a  subject
       string  from  left  to right. Most characters stand for themselves in a
       pattern,	and match the corresponding characters in the  subject.	 As  a
       trivial example,	the pattern

       The quick brown fox

       matches a portion of a subject string that is identical to itself. When
       caseless	matching is  specified	(the  caseless	option),  letters  are
       matched independently of	case.

       The  power  of  regular	expressions  comes from	the ability to include
       alternatives and	repetitions in the pattern. These are encoded  in  the
       pattern by the use of metacharacters, which do not stand	for themselves
       but instead are interpreted in some special way.

       There are two different sets of metacharacters: those that  are	recog-
       nized  anywhere in the pattern except within square brackets, and those
       that are	recognized within square brackets.  Outside  square  brackets,
       the metacharacters are as follows:

	   general escape character with several uses

	   assert start	of string (or line, in multiline mode)

	   assert end of string	(or line, in multiline mode)

	   match any character except newline (by default)

	   start character class definition

	   start of alternative	branch

	   start subpattern

	   end subpattern

	   extends  the	 meaning of (, also 0 or 1 quantifier, also quantifier

	   0 or	more quantifier

	   1 or	more quantifier, also "possessive quantifier"

	   start min/max quantifier

       Part of a pattern that is in square brackets  is	 called	 a  "character
       class". In a character class the	only metacharacters are:

	   general escape character

	   negate the class, but only if the first character

	   indicates character range

	   POSIX character class (only if followed by POSIX syntax)

	   terminates the character class

       The following sections describe the use of each of the metacharacters.

       The backslash character has several uses. Firstly, if it	is followed by
       a character that	is not a number	or a letter, it	takes away any special
       meaning	that  character	 may  have. This use of	backslash as an	escape
       character applies both inside and outside character classes.

       For example, if you want	to match a * character,	you write  \*  in  the
       pattern.	 This  escaping	 action	 applies  whether or not the following
       character would otherwise be interpreted	as a metacharacter, so	it  is
       always  safe  to	 precede  a non-alphanumeric with backslash to specify
       that it stands for itself. In particular, if you	want to	match a	 back-
       slash, you write	\\.

       In  unicode mode, only ASCII numbers and	letters	have any special mean-
       ing after a backslash. All other	characters (in particular, those whose
       codepoints are greater than 127)	are treated as literals.

       If  a  pattern is compiled with the extended option, white space	in the
       pattern (other than in a	character class) and characters	 between  a  #
       outside a character class and the next newline are ignored. An escaping
       backslash can be	used to	include	a white	space or # character  as  part
       of the pattern.

       If  you	want  to remove	the special meaning from a sequence of charac-
       ters, you can do	so by putting them between \Q and \E. This is  differ-
       ent  from  Perl	in  that  $  and  @ are	handled	as literals in \Q...\E
       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
       tion. Note the following	examples:

	 Pattern	   PCRE	matches	  Perl matches

	 \Qabc$xyz\E	   abc$xyz	  abc followed by the contents of $xyz
	 \Qabc\$xyz\E	   abc\$xyz	  abc\$xyz
	 \Qabc\E\$\Qxyz\E  abc$xyz	  abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.	An isolated \E that is not preceded by \Q is ignored. If \Q is
       not  followed  by  \E  later in the pattern, the	literal	interpretation
       continues to the	end of the pattern (that is,  \E  is  assumed  at  the
       end).  If  the  isolated	\Q is inside a character class,	this causes an
       error, because the character class is not terminated.

       Non-printing characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters  in patterns in a	visible	manner.	There is no restriction	on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates  a  pattern,	but  when  a pattern is	being prepared by text
       editing,	it is  often  easier  to  use  one  of	the  following	escape
       sequences than the binary character it represents:

	   alarm, that is, the BEL character (hex 07)

	   "control-x",	where x	is any ASCII character

	 \e :
	   escape (hex 1B)

	   form	feed (hex 0C)

	   linefeed (hex 0A)

	   carriage return (hex	0D)

	 \t :
	   tab (hex 09)

	   character with octal	code ddd, or back reference

	 \xhh :
	   character with hex code hh

	   character with hex code hhh..

       The  precise effect of \cx on ASCII characters is as follows: if	x is a
       lower case letter, it is	converted to upper case. Then  bit  6  of  the
       character (hex 40) is inverted. Thus \cA	to \cZ become hex 01 to	hex 1A
       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is	7B), and  \c;  becomes
       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
       has a value greater than	127, a compile-time error occurs.  This	 locks
       out non-ASCII characters	in all modes.

       The  \c	facility  was designed for use with ASCII characters, but with
       the extension to	Unicode	it is even less	useful than it once was.

       By default, after \x, from zero to  two	hexadecimal  digits  are  read
       (letters	can be in upper	or lower case).	Any number of hexadecimal dig-
       its may appear between \x{ and }, but the character code	is constrained
       as follows:

	 8-bit non-Unicode mode:
	   less	than 0x100

	 8-bit UTF-8 mode:
	   less	than 0x10ffff and a valid codepoint

       Invalid	Unicode	 codepoints  are  the  range 0xd800 to 0xdfff (the so-
       called "surrogate" codepoints), and 0xffef.

       If characters other than	hexadecimal digits appear between \x{  and  },
       or if there is no terminating },	this form of escape is not recognized.
       Instead,	the initial \x will be	interpreted  as	 a  basic  hexadecimal
       escape,	with  no  following  digits, giving a character	whose value is

       Characters whose	value is less than 256 can be defined by either	of the
       two  syntaxes  for  \x. There is	no difference in the way they are han-
       dled. For example, \xdc is exactly the same as \x{dc}.

       After \0	up to two further octal	digits are read. If  there  are	 fewer
       than  two  digits,  just	 those	that  are  present  are	used. Thus the
       sequence	\0\x\07	specifies two binary zeros followed by a BEL character
       (code  value 7).	Make sure you supply two digits	after the initial zero
       if the pattern character	that follows is	itself an octal	digit.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated.  Outside a character class, PCRE reads it	and any	following dig-
       its as a	decimal	number.	If the number is less than  10,	 or  if	 there
       have been at least that many previous capturing left parentheses	in the
       expression, the entire  sequence	 is  taken  as	a  back	 reference.  A
       description  of how this	works is given later, following	the discussion
       of parenthesized	subpatterns.

       Inside a	character class, or if the decimal number is  greater  than  9
       and  there have not been	that many capturing subpatterns, PCRE re-reads
       up to three octal digits	following the backslash, and uses them to gen-
       erate a data character. Any subsequent digits stand for themselves. The
       value of	the character is constrained in	the  same  way	as  characters
       specified in hexadecimal. For example:

	   is another way of writing a ASCII space

	   is  the  same,  provided there are fewer than 40 previous capturing

	   is always a back reference

	    might be a back reference, or another way of writing a tab

	   is always a tab

	   is a	tab followed by	the character "3"

	   might be a back reference, otherwise	the character with octal  code

	   might be a back reference, otherwise	the value 255 (decimal)

	   is  either  a  back reference, or a binary zero followed by the two
	   characters "8" and "1"

       Note that octal values of 100 or	greater	must not be  introduced	 by  a
       leading zero, because no	more than three	octal digits are ever read.

       All the sequences that define a single character	value can be used both
       inside and outside character classes. In	addition, inside  a  character
       class, \b is interpreted	as the backspace character (hex	08).

       \N  is not allowed in a character class.	\B, \R,	and \X are not special
       inside a	character class. Like  other  unrecognized  escape  sequences,
       they are	treated	as the literal characters "B", "R", and	"X". Outside a
       character class,	these sequences	have different meanings.

       Unsupported escape sequences

       In Perl,	the sequences \l, \L, \u, and \U are recognized	by its	string
       handler	and used to modify the case of following characters. PCRE does
       not support these escape	sequences.

       Absolute	and relative back references

       The sequence \g followed	by an unsigned or a negative  number,  option-
       ally  enclosed  in braces, is an	absolute or relative back reference. A
       named back reference can	be coded as \g{name}. Back references are dis-
       cussed later, following the discussion of parenthesized subpatterns.

       Absolute	and relative subroutine	calls

       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an  alternative	syntax for referencing a subpattern as a "subroutine".
       Details are discussed  later.  Note  that  \g{...}  (Perl  syntax)  and
       \g<...>	(Oniguruma  syntax)  are  not synonymous. The former is	a back
       reference; the latter is	a subroutine call.

       Generic character types

       Another use of backslash	is for specifying generic character types:

	   any decimal digit

	   any character that is not a decimal digit

	   any horizontal white	space character

	   any character that is not a horizontal white	space character

	   any white space character

	   any character that is not a white space character

	   any vertical	white space character

	   any character that is not a vertical	white space character

	   any "word" character

	   any "non-word" character

       There is	also the single	sequence \N, which matches a non-newline char-
       acter.  This  is	 the  same as the "." metacharacter when dotall	is not
       set. Perl also uses \N to match characters by name; PCRE	does not  sup-
       port this.

       Each  pair of lower and upper case escape sequences partitions the com-
       plete set of characters into two	disjoint  sets.	 Any  given  character
       matches	one, and only one, of each pair. The sequences can appear both
       inside and outside character classes. They each match one character  of
       the  appropriate	 type.	If the current matching	point is at the	end of
       the subject string, all of them fail, because there is no character  to

       For  compatibility  with	Perl, \s does not match	the VT character (code
       11). This makes it different from the POSIX "space" class. The \s char-
       acters  are  HT (9), LF (10), FF	(12), CR (13), and space (32). If "use
       locale;"	is included in a Perl script, \s may match the	VT  character.
       In PCRE,	it never does.

       A  "word"  character is an underscore or	any character that is a	letter
       or digit. By default, the definition of	letters	 and  digits  is  con-
       trolled	by  PCRE's  low-valued character tables, in Erlang's case (and
       without the unicode option), the	ISO-Latin-1 character set.

       By default, in unicode mode, characters with values greater  than  255,
       i.e.  all characters outside the	ISO-Latin-1 character set, never match
       \d, \s, or \w, and always match \D, \S, and \W. These sequences	retain
       their  original	meanings from before UTF support was available,	mainly
       for efficiency reasons. However,	if the ucp option is set,  the	behav-
       iour  is	changed	so that	Unicode	properties are used to determine char-
       acter types, as follows:

	   any character that \p{Nd} matches (decimal digit)

	   any character that \p{Z} matches, plus HT, LF, FF, CR)

	   any character that \p{L} or \p{N} matches, plus underscore)

       The upper case escapes match the	inverse	sets of	characters. Note  that
       \d  matches  only decimal digits, whereas \w matches any	Unicode	digit,
       as well as any Unicode letter,  and  underscore.	 Note  also  that  ucp
       affects	\b,  and  \B  because  they are	defined	in terms of \w and \W.
       Matching	these sequences	is noticeably slower when ucp is set.

       The sequences \h, \H, \v, and \V	are features that were added  to  Perl
       at  release  5.10. In contrast to the other sequences, which match only
       ASCII characters	by default, these  always  match  certain  high-valued
       codepoints,  whether or not ucp is set. The horizontal space characters

	   Horizontal tab (HT)


	   Non-break space

	   Ogham space mark

	   Mongolian vowel separator

	   En quad

	   Em quad

	   En space

	   Em space

	   Three-per-em	space

	   Four-per-em space

	   Six-per-em space

	   Figure space

	   Punctuation space

	   Thin	space

	   Hair	space

	   Narrow no-break space

	   Medium mathematical space

	   Ideographic space

       The vertical space characters are:

	   Linefeed (LF)

	   Vertical tab	(VT)

	   Form	feed (FF)

	   Carriage return (CR)

	   Next	line (NEL)

	   Line	separator

	   Paragraph separator

       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
       256 are relevant.

       Newline sequences

       Outside	a  character class, by default,	the escape sequence \R matches
       any Unicode newline sequence. In	non-UTF-8 mode \R is equivalent	to the


       This  is	 an  example  of an "atomic group", details of which are given

       This particular group matches either the	two-character sequence CR fol-
       lowed  by LF, or	one of the single characters LF	(linefeed, U+000A), VT
       (vertical tab, U+000B), FF (form	feed, U+000C),	CR  (carriage  return,
       U+000D),	 or  NEL  (next	 line,	U+0085). The two-character sequence is
       treated as a single unit	that cannot be split.

       In Unicode mode,	two additional characters whose	codepoints are greater
       than 255	are added: LS (line separator, U+2028) and PS (paragraph sepa-
       rator, U+2029). Unicode character property support is  not  needed  for
       these characters	to be recognized.

       It is possible to restrict \R to	match only CR, LF, or CRLF (instead of
       the complete set	 of  Unicode  line  endings)  by  setting  the	option
       bsr_anycrlf either at compile time or when the pattern is matched. (BSR
       is an abbreviation for "backslash R".) This can	be  made  the  default
       when  PCRE  is  built;  if this is the case, the	other behaviour	can be
       requested via the bsr_unicode option. It	is also	 possible  to  specify
       these  settings	by starting a pattern string with one of the following

       (*BSR_ANYCRLF) CR, LF, or CRLF only (*BSR_UNICODE) any Unicode  newline

       These override the default and the options given	to the compiling func-
       tion, but they can themselves be	 overridden  by	 options  given	 to  a
       matching	 function.  Note  that	these  special settings, which are not
       Perl-compatible,	are recognized only at the very	start  of  a  pattern,
       and  that  they	must  be  in  upper  case. If more than	one of them is
       present,	the last one is	used. They can be combined with	 a  change  of
       newline convention; for example,	a pattern can start with:


       They  can  also	be combined with the (*UTF8), (*UTF) or	(*UCP) special
       sequences. Inside a character class, \R is treated as  an  unrecognized
       escape sequence,	and so matches the letter "R" by default.

       Unicode character properties

       Three  additional  escape sequences that	match characters with specific
       properties are available. When in 8-bit non-UTF-8 mode, these sequences
       are  of	course limited to testing characters whose codepoints are less
       than 256, but they do work in this mode.	 The  extra  escape  sequences

	   a character with the	xx property

	   a character without the xx property

	   a Unicode extended grapheme cluster

       The  property  names represented	by xx above are	limited	to the Unicode
       script names, the general category properties, "Any", which matches any
       character   (including  newline),  and  some  special  PCRE  properties
       (described in the next section).	Other Perl properties such as "InMusi-
       calSymbols" are not currently supported by PCRE.	Note that \P{Any} does
       not match any characters, so always causes a match failure.

       Sets of Unicode characters are defined as belonging to certain scripts.
       A  character from one of	these sets can be matched using	a script name.
       For example:

       \p{Greek} \P{Han}

       Those that are not part of an identified	script are lumped together  as
       "Common". The current list of scripts is:

	 * Arabic

	 * Armenian

	 * Avestan

	 * Balinese

	 * Bamum

	 * Batak

	 * Bengali

	 * Bopomofo

	 * Braille

	 * Buginese

	 * Buhid

	 * Canadian_Aboriginal

	 * Carian

	 * Chakma

	 * Cham

	 * Cherokee

	 * Common

	 * Coptic

	 * Cuneiform

	 * Cypriot

	 * Cyrillic

	 * Deseret

	 * Devanagari

	 * Egyptian_Hieroglyphs

	 * Ethiopic

	 * Georgian

	 * Glagolitic

	 * Gothic

	 * Greek

	 * Gujarati

	 * Gurmukhi

	 * Han

	 * Hangul

	 * Hanunoo

	 * Hebrew

	 * Hiragana

	 * Imperial_Aramaic

	 * Inherited

	 * Inscriptional_Pahlavi

	 * Inscriptional_Parthian

	 * Javanese

	 * Kaithi

	 * Kannada

	 * Katakana

	 * Kayah_Li

	 * Kharoshthi

	 * Khmer

	 * Lao

	 * Latin

	 * Lepcha

	 * Limbu

	 * Linear_B

	 * Lisu

	 * Lycian

	 * Lydian

	 * Malayalam

	 * Mandaic

	 * Meetei_Mayek

	 * Meroitic_Cursive

	 * Meroitic_Hieroglyphs

	 * Miao

	 * Mongolian

	 * Myanmar

	 * New_Tai_Lue

	 * Nko

	 * Ogham

	 * Old_Italic

	 * Old_Persian

	 * Oriya

	 * Old_South_Arabian

	 * Old_Turkic

	 * Ol_Chiki

	 * Osmanya

	 * Phags_Pa

	 * Phoenician

	 * Rejang

	 * Runic

	 * Samaritan

	 * Saurashtra

	 * Sharada

	 * Shavian

	 * Sinhala

	 * Sora_Sompeng

	 * Sundanese

	 * Syloti_Nagri

	 * Syriac

	 * Tagalog

	 * Tagbanwa

	 * Tai_Le

	 * Tai_Tham

	 * Tai_Viet

	 * Takri

	 * Tamil

	 * Telugu

	 * Thaana

	 * Thai

	 * Tibetan

	 * Tifinagh

	 * Ugaritic

	 * Vai

	 * Yi

       Each character has exactly one Unicode general category property, spec-
       ified by	a two-letter abbreviation. For compatibility with Perl,	 nega-
       tion  can  be  specified	 by including a	circumflex between the opening
       brace and the property name.  For  example,  \p{^Lu}  is	 the  same  as

       If only one letter is specified with \p or \P, it includes all the gen-
       eral category properties	that start with	that letter. In	this case,  in
       the  absence of negation, the curly brackets in the escape sequence are
       optional; these two examples have the same effect:

	 * \p{L}

	 * \pL

       The following general category property codes are supported:





	   Private use



	   Lower case letter

	   Modifier letter

	   Other letter

	   Title case letter

	   Upper case letter


	   Spacing mark

	   Enclosing mark

	   Non-spacing mark


	   Decimal number

	   Letter number

	   Other number


	   Connector punctuation

	   Dash	punctuation

	   Close punctuation

	   Final punctuation

	   Initial punctuation

	   Other punctuation

	   Open	punctuation


	   Currency symbol

	   Modifier symbol

	   Mathematical	symbol

	   Other symbol


	   Line	separator

	   Paragraph separator

	   Space separator

       The special property L& is also supported: it matches a character  that
       has  the	 Lu,  Ll, or Lt	property, in other words, a letter that	is not
       classified as a modifier	or "other".

       The Cs (Surrogate) property applies only	to  characters	in  the	 range
       U+D800  to U+DFFF. Such characters are not valid	in Unicode strings and
       so cannot be tested by PCRE. Perl does not support the Cs property

       The long	synonyms for  property	names  that  Perl  supports  (such  as
       \p{Letter})  are	 not  supported	by PCRE, nor is	it permitted to	prefix
       any of these properties with "Is".

       No character that is in the Unicode table has the Cn (unassigned) prop-
       erty.  Instead, this property is	assumed	for any	code point that	is not
       in the Unicode table.

       Specifying caseless matching does not affect  these  escape  sequences.
       For  example,  \p{Lu}  always  matches only upper case letters. This is
       different from the behaviour of current versions	of Perl.

       Matching	characters by Unicode property is not fast, because  PCRE  has
       to  do  a  multistage table lookup in order to find a character's prop-
       erty. That is why the traditional escape	sequences such as \d and \w do
       not use Unicode properties in PCRE by default, though you can make them
       do so by	setting	the ucp	option or by starting the pattern with (*UCP).

       Extended	grapheme clusters

       The \X escape matches any number	of Unicode  characters	that  form  an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see below). Up to and including	release	8.31, PCRE matched an earlier,
       simpler definition that was equivalent to


       That  is,  it matched a character without the "mark" property, followed
       by zero or more characters with the "mark"  property.  Characters  with
       the  "mark"  property are typically non-spacing accents that affect the
       preceding character.

       This simple definition was extended in Unicode to include more  compli-
       cated  kinds of composite character by giving each character a grapheme
       breaking	property, and creating rules  that  use	 these	properties  to
       define  the  boundaries	of  extended grapheme clusters.	In releases of
       PCRE later than 8.31, \X	matches	one of these clusters.

       \X always matches at least one character. Then it  decides  whether  to
       add additional characters according to the following rules for ending a

	   End at the end of the subject string.

	   Do not end between CR and LF; otherwise end after any control char-

	   Do  not  break  Hangul (a Korean script) syllable sequences.	Hangul
	   characters are of five types: L, V, T, LV, and LVT. An L  character
	   may	be followed by an L, V,	LV, or LVT character; an LV or V char-
	   acter may be	followed by a V	or T character;	an LVT or T  character
	   may be follwed only by a T character.

	   Do not end before extending characters or spacing marks. Characters
	   with	the "mark" property always have	the "extend" grapheme breaking

	   Do not end after prepend characters.

	   Otherwise, end the cluster.

       PCRE's additional properties

       As  well	 as the	standard Unicode properties described above, PCRE sup-
       ports four more that make it possible  to  convert  traditional	escape
       sequences  such as \w and \s and	POSIX character	classes	to use Unicode
       properties. PCRE	uses these non-standard,  non-Perl  properties	inter-
       nally  when PCRE_UCP is set. However, they may also be used explicitly.
       These properties	are:

	   Any alphanumeric character

	   Any POSIX space character

	   Any Perl space character

	   Any Perl "word" character

       Xan matches characters that have	either the L (letter) or the  N	 (num-
       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
       form feed, or carriage return, and any other character that has	the  Z
       (separator)  property. Xsp is the same as Xps, except that vertical tab
       is excluded. Xwd	matches	the same characters as Xan, plus underscore.

       There is	another	non-standard property, Xuc, which matches any  charac-
       ter  that  can  be represented by a Universal Character Name in C++ and
       other programming languages. These are the characters $,	 @,  `	(grave
       accent),	 and  all  characters with Unicode code	points greater than or
       equal to	U+00A0,	except for the surrogates U+D800 to U+DFFF. Note  that
       most  base  (ASCII) characters are excluded. (Universal Character Names
       are of the form \uHHHH or \UHHHHHHHH where H is	a  hexadecimal	digit.
       Note that the Xuc property does not match these sequences but the char-
       acters that they	represent.)

       Resetting the match start

       The escape sequence \K causes any previously matched characters not  to
       be included in the final	matched	sequence. For example, the pattern:


       matches	"foobar",  but reports that it has matched "bar". This feature
       is similar to a lookbehind assertion  (described	 below).  However,  in
       this  case, the part of the subject before the real match does not have
       to be of	fixed length, as lookbehind assertions do. The use of \K  does
       not  interfere  with  the  setting of captured substrings. For example,
       when the	pattern


       matches "foobar", the first substring is	still set to "foo".

       Perl documents that the use  of	\K  within  assertions	is  "not  well
       defined".  In  PCRE,  \K	 is  acted upon	when it	occurs inside positive
       assertions, but is ignored in negative assertions.

       Simple assertions

       The final use of	backslash is for certain simple	assertions. An	asser-
       tion  specifies a condition that	has to be met at a particular point in
       a match,	without	consuming any characters from the subject string.  The
       use  of subpatterns for more complicated	assertions is described	below.
       The backslashed assertions are:

	   matches at a	word boundary

	   matches when	not at a word boundary

	   matches at the start	of the subject

	   matches at the end of the subject also matches before a newline  at
	   the end of the subject

	   matches only	at the end of the subject

	   matches at the first	matching position in the subject

       Inside  a  character  class, \b has a different meaning;	it matches the
       backspace character. If any other of  these  assertions	appears	 in  a
       character  class, by default it matches the corresponding literal char-
       acter (for example, \B matches the letter B).

       A word boundary is a position in	the subject string where  the  current
       character  and  the previous character do not both match	\w or \W (i.e.
       one matches \w and the other matches \W), or the	start or  end  of  the
       string  if  the	first or last character	matches	\w, respectively. In a
       UTF mode, the meanings of \w and	\W can be changed by setting  the  ucp
       option.	When this is done, it also affects \b and \B. Neither PCRE nor
       Perl has	a separate "start of word" or "end of word" metasequence. How-
       ever, whatever follows \b normally determines which it is. For example,
       the fragment \ba	matches	"a" at the start of a word.

       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
       and dollar (described in	the next section) in that they only ever match
       at the very start and end of the	subject	string,	whatever  options  are
       set.  Thus,  they are independent of multiline mode. These three	asser-
       tions are not affected by the notbol or noteol  options,	 which	affect
       only  the  behaviour  of	the circumflex and dollar metacharacters. How-
       ever, if	the startoffset	argument of re:run/3 is	 non-zero,  indicating
       that  matching  is  to start at a point other than the beginning	of the
       subject,	\A can never match. The	difference between \Z and \z  is  that
       \Z  matches before a newline at the end of the string as	well as	at the
       very end, whereas \z matches only at the	end.

       The \G assertion	is true	only when the current matching position	is  at
       the  start point	of the match, as specified by the startoffset argument
       of re:run/3. It differs from \A when the	value of startoffset  is  non-
       zero.  By  calling  re:run/3 multiple times with	appropriate arguments,
       you can mimic Perl's /g option, and it is in this kind  of  implementa-
       tion where \G can be useful.

       Note,  however,	that  PCRE's interpretation of \G, as the start	of the
       current match, is subtly	different from Perl's, which defines it	as the
       end  of	the  previous  match. In Perl, these can be different when the
       previously matched string was empty. Because PCRE does just  one	 match
       at a time, it cannot reproduce this behaviour.

       If  all	the alternatives of a pattern begin with \G, the expression is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

       The  circumflex	and  dollar  metacharacters are	zero-width assertions.
       That is,	they test for a	particular condition being true	 without  con-
       suming any characters from the subject string.

       Outside a character class, in the default matching mode,	the circumflex
       character is an assertion that is true only  if	the  current  matching
       point  is  at the start of the subject string. If the startoffset argu-
       ment of re:run/3	is non-zero, circumflex	can never match	if the	multi-
       line  option  is	 unset.	 Inside	 a  character class, circumflex	has an
       entirely	different meaning (see below).

       Circumflex need not be the first	character of the pattern if  a	number
       of  alternatives	are involved, but it should be the first thing in each
       alternative in which it appears if the pattern is ever  to  match  that
       branch.	If all possible	alternatives start with	a circumflex, that is,
       if the pattern is constrained to	match only at the start	 of  the  sub-
       ject,  it  is  said  to be an "anchored"	pattern. (There	are also other
       constructs that can cause a pattern to be anchored.)

       The dollar character is an assertion that is true only if  the  current
       matching	 point	is  at	the  end of the	subject	string,	or immediately
       before a	newline	at the end of the string (by default). Note,  however,
       that  it	 does  not  actually match the newline.	Dollar need not	be the
       last character of the pattern if	a number of alternatives are involved,
       but  it should be the last item in any branch in	which it appears. Dol-
       lar has no special meaning in a character class.

       The meaning of dollar can be changed so that it	matches	 only  at  the
       very end	of the string, by setting the dollar_endonly option at compile
       time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are	changed	if the
       multiline  option  is  set. When	this is	the case, a circumflex matches
       immediately after internal newlines as well as at the start of the sub-
       ject  string. It	does not match after a newline that ends the string. A
       dollar matches before any newlines in the string, as  well  as  at  the
       very  end, when multiline is set. When newline is specified as the two-
       character sequence CRLF,	isolated CR and	LF characters do not  indicate

       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
       (where \n represents a newline) in multiline mode, but  not  otherwise.
       Consequently,  patterns	that  are anchored in single line mode because
       all branches start with ^ are not anchored in  multiline	 mode,	and  a
       match  for  circumflex  is  possible  when  the startoffset argument of
       re:run/3	is non-zero. The dollar_endonly	option is ignored if multiline
       is set.

       Note  that  the sequences \A, \Z, and \z	can be used to match the start
       and end of the subject in both modes, and if all	branches of a  pattern
       start with \A it	is always anchored, whether or not multiline is	set.

       Outside a character class, a dot	in the pattern matches any one charac-
       ter in the subject string except	(by default) a character  that	signi-
       fies the	end of a line.

       When  a line ending is defined as a single character, dot never matches
       that character; when the	two-character sequence CRLF is used, dot  does
       not  match  CR  if  it  is immediately followed by LF, but otherwise it
       matches all characters (including isolated CRs and LFs).	When any  Uni-
       code  line endings are being recognized,	dot does not match CR or LF or
       any of the other	line ending characters.

       The behaviour of	dot with regard	to newlines can	 be  changed.  If  the
       dotall  option  is set, a dot matches any one character,	without	excep-
       tion. If	the two-character sequence CRLF	 is  present  in  the  subject
       string, it takes	two dots to match it.

       The  handling of	dot is entirely	independent of the handling of circum-
       flex and	dollar,	the only relationship being  that  they	 both  involve
       newlines. Dot has no special meaning in a character class.

       The  escape  sequence  \N  behaves  like	 a  dot, except	that it	is not
       affected	by the PCRE_DOTALL option. In  other  words,  it  matches  any
       character  except  one that signifies the end of	a line.	Perl also uses
       \N to match characters by name; PCRE does not support this.

       Outside a character class, the escape sequence \C matches any one  data
       unit,  whether  or  not	a  UTF mode is set. One	data unit is one byte.
       Unlike a	dot, \C	always matches line-ending characters. The feature  is
       provided	 in Perl in order to match individual bytes in UTF-8 mode, but
       it is unclear how it can	usefully be used. Because \C breaks up charac-
       ters  into  individual  data  units, matching one unit with \C in a UTF
       mode means that the rest	of the string may start	with a	malformed  UTF
       character.  This	has undefined results, because PCRE assumes that it is
       dealing with valid UTF strings.

       PCRE does not allow \C to appear	in  lookbehind	assertions  (described
       below)  in  a UTF mode, because this would make it impossible to	calcu-
       late the	length of the lookbehind.

       In general, the \C escape sequence is best avoided. However, one	way of
       using  it that avoids the problem of malformed UTF characters is	to use
       a lookahead to check the	length of the next character, as in this  pat-
       tern,  which  could be used with	a UTF-8	string (ignore white space and
       line breaks):

	 (?| (?=[\x00-\x7f])(\C) |
	     (?=[\x80-\x{7ff}])(\C)(\C)	|
	     (?=[\x{800}-\x{ffff}])(\C)(\C)(\C)	|

       A group that starts with	(?| resets the capturing  parentheses  numbers
       in  each	 alternative  (see  "Duplicate Subpattern Numbers" below). The
       assertions at the start of each branch check the	next  UTF-8  character
       for  values  whose encoding uses	1, 2, 3, or 4 bytes, respectively. The
       character's individual bytes are	then captured by the appropriate  num-
       ber of groups.

       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial  by	default. However, if the PCRE_JAVASCRIPT_COMPAT	option is set,
       a lone closing square bracket causes a compile-time error. If a closing
       square  bracket	is required as a member	of the class, it should	be the
       first data character in the class  (after  an  initial  circumflex,  if
       present)	or escaped with	a backslash.

       A  character  class matches a single character in the subject. In a UTF
       mode, the character may be more than one	 data  unit  long.  A  matched
       character must be in the	set of characters defined by the class,	unless
       the first character in the class	definition is a	circumflex,  in	 which
       case the	subject	character must not be in the set defined by the	class.
       If a circumflex is actually required as a member	of the	class,	ensure
       it is not the first character, or escape	it with	a backslash.

       For  example, the character class [aeiou] matches any lower case	vowel,
       while [^aeiou] matches any character that is not	a  lower  case	vowel.
       Note that a circumflex is just a	convenient notation for	specifying the
       characters that are in the class	by enumerating those that are  not.  A
       class  that starts with a circumflex is not an assertion; it still con-
       sumes a character from the subject string, and therefore	 it  fails  if
       the current pointer is at the end of the	string.

       In  UTF-8 mode, characters with values greater than 255 (0xffff)	can be
       included	in a class as a	literal	string of data units, or by using  the
       \x{ escaping mechanism.

       When  caseless  matching	 is set, any letters in	a class	represent both
       their upper case	and lower case versions, so for	 example,  a  caseless
       [aeiou]	matches	 "A"  as well as "a", and a caseless [^aeiou] does not
       match "A", whereas a caseful version would. In a	UTF mode, PCRE	always
       understands  the	 concept  of case for characters whose values are less
       than 256, so caseless matching is always	possible. For characters  with
       higher  values,	the  concept  of case is supported if PCRE is compiled
       with Unicode property support, but not otherwise. If you	 want  to  use
       caseless	 matching in a UTF mode	for characters 256 and above, you must
       ensure that PCRE	is compiled with Unicode property support as  well  as
       with UTF	support.

       Characters  that	 might	indicate  line breaks are never	treated	in any
       special way  when  matching  character  classes,	 whatever  line-ending
       sequence	 is  in	 use,  and  whatever  setting  of  the PCRE_DOTALL and
       PCRE_MULTILINE options is used. A class such as [^a] always matches one
       of these	characters.

       The  minus (hyphen) character can be used to specify a range of charac-
       ters in a character  class.  For	 example,  [d-m]  matches  any	letter
       between	d  and	m,  inclusive.	If  a minus character is required in a
       class, it must be escaped with a	backslash  or  appear  in  a  position
       where  it cannot	be interpreted as indicating a range, typically	as the
       first or	last character in the class.

       It is not possible to have the literal character	"]" as the end charac-
       ter  of a range.	A pattern such as [W-]46] is interpreted as a class of
       two characters ("W" and "-") followed by	a literal string "46]",	so  it
       would  match  "W46]"  or	 "-46]". However, if the "]" is	escaped	with a
       backslash it is interpreted as the end of range,	so [W-\]46] is	inter-
       preted  as a class containing a range followed by two other characters.
       The octal or hexadecimal	representation of "]" can also be used to  end
       a range.

       Ranges  operate in the collating	sequence of character values. They can
       also  be	 used  for  characters	specified  numerically,	 for   example
       [\000-\037].  Ranges  can include any characters	that are valid for the
       current mode.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to [][\\^_`wxyzabc], matched caselessly,	and  in	 a  non-UTF  mode,  if
       character  tables  for  a French	locale are in use, [\xc8-\xcb] matches
       accented	E characters in	both cases. In UTF modes,  PCRE	 supports  the
       concept	of  case for characters	with values greater than 255 only when
       it is compiled with Unicode property support.

       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
       \w, and \W may appear in	a character class, and add the characters that
       they match to the class.	For example, [\dABCDEF]	matches	any  hexadeci-
       mal digit. In UTF modes,	the ucp	option affects the meanings of \d, \s,
       \w and their upper case partners, just as it does when they appear out-
       side  a	character class, as described in the section entitled "Generic
       character types"	above. The escape sequence \b has a different  meaning
       inside  a  character  class;  it	 matches  the backspace	character. The
       sequences \B, \N, \R, and \X are	not special inside a character	class.
       Like  any  other	unrecognized escape sequences, they are	treated	as the
       literal characters "B", "N", "R", and "X".

       A circumflex can	conveniently be	used with  the	upper  case  character
       types  to specify a more	restricted set of characters than the matching
       lower case type.	For example, the class [^\W_] matches  any  letter  or
       digit, but not underscore, whereas [\w] includes	underscore. A positive
       character class should be read as "something OR something OR ..." and a
       negative	class as "NOT something	AND NOT	something AND NOT ...".

       The  only  metacharacters  that are recognized in character classes are
       backslash, hyphen (only where it	can be	interpreted  as	 specifying  a
       range),	circumflex  (only  at the start), opening square bracket (only
       when it can be interpreted as introducing a POSIX class name - see  the
       next  section),	and  the  terminating closing square bracket. However,
       escaping	other non-alphanumeric characters does no harm.

       Perl supports the POSIX notation	for character classes. This uses names
       enclosed	 by  [:	and :] within the enclosing square brackets. PCRE also
       supports	this notation. For example,


       matches "0", "1", any alphabetic	character, or "%". The supported class
       names are:

	   letters and digits


	   character codes 0 - 127

	   space or tab	only

	   control characters

	   decimal digits (same	as \d)

	   printing characters,	excluding space

	   lower case letters

	   printing characters,	including space

	   printing characters,	excluding letters and digits and space

	   whitespace (not quite the same as \s)

	   upper case letters

	   "word" characters (same as \w)

	   hexadecimal digits

       The  "space" characters are HT (9), LF (10), VT (11), FF	(12), CR (13),
       and space (32). Notice that this	list includes the VT  character	 (code
       11). This makes "space" different to \s,	which does not include VT (for
       Perl compatibility).

       The name	"word" is a Perl extension, and	"blank"	 is  a	GNU  extension
       from  Perl  5.8.	Another	Perl extension is negation, which is indicated
       by a ^ character	after the colon. For example,


       matches "1", "2", or any	non-digit. PCRE	(and Perl) also	recognize  the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported,	and an error is	given if they are encountered.

       By default, in UTF modes, characters with values	greater	 than  255  do
       not  match any of the POSIX character classes. However, if the PCRE_UCP
       option is passed	to pcre_compile() , some of the	classes	are changed so
       that Unicode character properties are used. This	is achieved by replac-
       ing the POSIX classes by	other sequences, as follows:

	   becomes \p{Xan}

	   becomes \p{L}

	   becomes \h

	   becomes \p{Nd}

	   becomes \p{Ll}

	   becomes \p{Xps}

	   becomes \p{Lu}

	   becomes \p{Xwd}

       Negated versions, such as [:^alpha:] use	\P instead of  \p.  The	 other
       POSIX classes are unchanged, and	match only characters with code	points
       less than 256.

       Vertical	bar characters are used	to separate alternative	patterns.  For
       example,	the pattern


       matches	either "gilbert" or "sullivan".	Any number of alternatives may
       appear, and an empty  alternative  is  permitted	 (matching  the	 empty
       string).	The matching process tries each	alternative in turn, from left
       to right, and the first one that	succeeds is used. If the  alternatives
       are  within a subpattern	(defined below), "succeeds" means matching the
       rest of the main	pattern	as well	as the alternative in the subpattern.

       The settings of the caseless, multiline,	dotall,	and  extended  options
       (which are Perl-compatible) can be changed from within the pattern by a
       sequence	of Perl	option letters enclosed	 between  "(?"	and  ")".  The
       option letters are

	   for caseless

	   for multiline

	   for dotall

	   for extended

       For example, (?im) sets caseless, multiline matching. It	is also	possi-
       ble to unset these options by preceding the letter with a hyphen, and a
       combined	 setting  and  unsetting such as (?im-sx), which sets caseless
       and multiline while unsetting dotall and	extended, is  also  permitted.
       If  a  letter  appears  both before and after the hyphen, the option is

       The PCRE-specific options dupnames, ungreedy, and extra can be  changed
       in  the same way	as the Perl-compatible options by using	the characters
       J, U and	X respectively.

       When one	of these option	changes	occurs at  top	level  (that  is,  not
       inside  subpattern parentheses),	the change applies to the remainder of
       the pattern that	follows. If the	change is placed right at the start of
       a pattern, PCRE extracts	it into	the global options.

       An  option  change  within a subpattern (see below for a	description of
       subpatterns) affects only that part of the subpattern that follows  it,


       matches	abc  and  aBc  and  no other strings (assuming caseless	is not
       used). By this means, options can be made to have different settings in
       different  parts	of the pattern.	Any changes made in one	alternative do
       carry on	into subsequent	branches within	the same subpattern. For exam-


       matches	"ab",  "aB",  "c",  and	"C", even though when matching "C" the
       first branch is abandoned before	the option setting.  This  is  because
       the  effects  of	option settings	happen at compile time.	There would be
       some very weird behaviour otherwise.

       Note: There are other PCRE-specific options that	 can  be  set  by  the
       application  when  the  compiling  or matching functions	are called. In
       some cases the pattern can contain special leading  sequences  such  as
       (*CRLF)	to  override  what  the	 application  has set or what has been
       defaulted.  Details  are	 given	in  the	 section   entitled   "Newline
       sequences"  above.  There  are  also  the  (*UTF8)  and	(*UCP) leading
       sequences that can be used to set UTF and Unicode property modes;  they
       are  equivalent	to  setting  the  unicode and the ucp options, respec-
       tively. The (*UTF) sequence is a	generic	version	that can be used  with
       any  of	the  libraries.	However, the application can set the never_utf
       option, which locks out the use of the (*UTF) sequences.

       Subpatterns are delimited by parentheses	(round brackets), which	can be
       nested. Turning part of a pattern into a	subpattern does	two things:

       1. It localizes a set of	alternatives. For example, the pattern


       matches	"cataract",  "caterpillar", or "cat". Without the parentheses,
       it would	match "cataract", "erpillar" or	an empty string.

       2. It sets up the subpattern as	a  capturing  subpattern.  This	 means
       that,  when  the	 complete pattern matches, that	portion	of the subject
       string that matched the subpattern is passed back to the	caller via the
       return value of re:run/3.

       Opening parentheses are counted from left to right (starting from 1) to
       obtain numbers for the capturing	subpatterns.For	example, if the	string
       "the red	king" is matched against the pattern

       the ((red|white)	(king|queen))

       the captured substrings are "red	king", "red", and "king", and are num-
       bered 1,	2, and 3, respectively.

       The fact	that plain parentheses fulfil  two  functions  is  not	always
       helpful.	 There	are often times	when a grouping	subpattern is required
       without a capturing requirement.	If an opening parenthesis is  followed
       by  a question mark and a colon,	the subpattern does not	do any captur-
       ing, and	is not counted when computing the  number  of  any  subsequent
       capturing  subpatterns. For example, if the string "the white queen" is
       matched against the pattern

       the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1 and 2.	The maximum number of capturing	subpatterns is 65535.

       As  a  convenient shorthand, if any option settings are required	at the
       start of	a non-capturing	subpattern,  the  option  letters  may	appear
       between the "?" and the ":". Thus the two patterns

	 * (?i:saturday|sunday)

	 * (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried from left to right, and options are not reset until  the  end  of
       the  subpattern is reached, an option setting in	one branch does	affect
       subsequent branches, so the above patterns match	"SUNDAY"  as  well  as

       Perl 5.10 introduced a feature whereby each alternative in a subpattern
       uses the	same numbers for its capturing parentheses. Such a  subpattern
       starts  with (?|	and is itself a	non-capturing subpattern. For example,
       consider	this pattern:


       Because the two alternatives are	inside a (?| group, both sets of  cap-
       turing  parentheses  are	 numbered one. Thus, when the pattern matches,
       you can look at captured	substring number  one,	whichever  alternative
       matched.	 This  construct  is useful when you want to capture part, but
       not all,	of one of a number of alternatives. Inside a (?| group,	paren-
       theses  are  numbered as	usual, but the number is reset at the start of
       each branch. The	numbers	of any capturing parentheses that  follow  the
       subpattern  start after the highest number used in any branch. The fol-
       lowing example is taken from the	Perl documentation. The	numbers	under-
       neath show in which buffer the captured content will be stored.

	 # before  ---------------branch-reset----------- after
	 / ( a )  (?| x	( y ) z	| (p (q) r) | (t) u (v)	) ( z )	/x
	 # 1		2	  2  3	      2	    3	  4

       A  back	reference  to a	numbered subpattern uses the most recent value
       that is set for that number by any subpattern.  The  following  pattern
       matches "abcabc"	or "defdef":


       In  contrast,  a	subroutine call	to a numbered subpattern always	refers
       to the first one	in the pattern with the	given  number.	The  following
       pattern matches "abcabc"	or "defabc":


       If  a condition test for	a subpattern's having matched refers to	a non-
       unique number, the test is true if any of the subpatterns of that  num-
       ber have	matched.

       An  alternative approach	to using this "branch reset" feature is	to use
       duplicate named subpatterns, as described in the	next section.

       Identifying capturing parentheses by number is simple, but  it  can  be
       very  hard  to keep track of the	numbers	in complicated regular expres-
       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
       change.	To help	with this difficulty, PCRE supports the	naming of sub-
       patterns. This feature was not added to Perl until release 5.10.	Python
       had  the	 feature earlier, and PCRE introduced it at release 4.0, using
       the Python syntax. PCRE now supports both the Perl and the Python  syn-
       tax.  Perl  allows  identically	numbered subpatterns to	have different
       names, but PCRE does not.

       In PCRE,	a subpattern can be named in one of three  ways:  (?<name>...)
       or  (?'name'...)	 as in Perl, or	(?P<name>...) as in Python. References
       to capturing parentheses	from other parts of the	pattern, such as  back
       references,  recursion,	and conditions,	can be made by name as well as
       by number.

       Names consist of	up to  32  alphanumeric	 characters  and  underscores.
       Named  capturing	 parentheses  are  still  allocated numbers as well as
       names, exactly as if the	names were not present.	The capture specifica-
       tion  to	re:run/3 can use named values if they are present in the regu-
       lar expression.

       By default, a name must be unique within	a pattern, but it is  possible
       to  relax  this	constraint  by	setting	the dupnames option at compile
       time. (Duplicate	names are also always permitted	for  subpatterns  with
       the  same  number, set up as described in the previous section.)	Dupli-
       cate names can be useful	for patterns where only	one  instance  of  the
       named  parentheses  can	match. Suppose you want	to match the name of a
       weekday,	either as a 3-letter abbreviation or as	the full name, and  in
       both cases you want to extract the abbreviation.	This pattern (ignoring
       the line	breaks)	does the job:


       There are five capturing	substrings, but	only one is ever set  after  a
       match.  (An alternative way of solving this problem is to use a "branch
       reset" subpattern, as described in the previous section.)

       In case of capturing named subpatterns which names are not unique,  the
       first  matching	occurrence (counted from left to right in the subject)
       is returned from	re:exec/3, if the name is specified in the values part
       of  the capture statement. The all_names	capturing value	will match all
       of the names in the same	way.

       Warning:	You cannot use different names to distinguish between two sub-
       patterns	 with  the same	number because PCRE uses only the numbers when
       matching. For this reason, an error is given at compile time if differ-
       ent  names  are given to	subpatterns with the same number. However, you
       can give	the same name to subpatterns with the same number,  even  when
       dupnames	is not set.

       Repetition  is  specified  by  quantifiers, which can follow any	of the
       following items:

	 * a literal data character

	 * the dot metacharacter

	 * the \C escape sequence

	 * the \X escape sequence

	 * the \R escape sequence

	 * an escape such as \d	or \pL that matches a single character

	 * a character class

	 * a back reference (see next section)

	 * a parenthesized subpattern (including assertions)

	 * a subroutine	call to	a subpattern (recursive	or otherwise)

       The general repetition quantifier specifies a minimum and maximum  num-
       ber  of	permitted matches, by giving the two numbers in	curly brackets
       (braces), separated by a	comma. The numbers must	be  less  than	65536,
       and the first must be less than or equal	to the second. For example:


       matches	"zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
       special character. If the second	number is omitted, but	the  comma  is
       present,	 there	is  no upper limit; if the second number and the comma
       are both	omitted, the quantifier	specifies an exact number of  required
       matches.	Thus


       matches at least	3 successive vowels, but may match many	more, while


       matches	exactly	 8  digits. An opening curly bracket that appears in a
       position	where a	quantifier is not allowed, or one that does not	 match
       the  syntax of a	quantifier, is taken as	a literal character. For exam-
       ple, {,6} is not	a quantifier, but a literal string of four characters.

       In Unicode mode,	quantifiers apply to characters	rather than  to	 indi-
       vidual  data  units.  Thus, for example,	\x{100}{2} matches two charac-
       ters, each of which is represented by a two-byte	sequence  in  a	 UTF-8
       string.	Similarly, \X{3} matches three Unicode extended	grapheme clus-
       ters, each of which may be several data units long (and they may	be  of
       different lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This may be use-
       ful  for	 subpatterns that are referenced as subroutines	from elsewhere
       in the pattern (but see also the	section	entitled "Defining subpatterns
       for  use	 by  reference only" below). Items other than subpatterns that
       have a {0} quantifier are omitted from the compiled pattern.

       For convenience,	the three most common quantifiers have	single-charac-
       ter abbreviations:

	   is equivalent to {0,}

	   is equivalent to {1,}

	   is equivalent to {0,1}

       It  is  possible	 to construct infinite loops by	following a subpattern
       that can	match no characters with a quantifier that has no upper	limit,
       for example:


       Earlier versions	of Perl	and PCRE used to give an error at compile time
       for such	patterns. However, because there are cases where this  can  be
       useful,	such  patterns	are now	accepted, but if any repetition	of the
       subpattern does in fact match no	characters, the	loop is	forcibly  bro-

       By  default,  the quantifiers are "greedy", that	is, they match as much
       as possible (up to the maximum  number  of  permitted  times),  without
       causing	the  rest of the pattern to fail. The classic example of where
       this gives problems is in trying	to match comments in C programs. These
       appear  between	/*  and	 */ and	within the comment, individual * and /
       characters may appear. An attempt to match C comments by	 applying  the


       to the string

       /* first	comment	*/ not comment /* second comment */

       fails,  because it matches the entire string owing to the greediness of
       the .* item.

       However,	if a quantifier	is followed by a question mark,	it  ceases  to
       be greedy, and instead matches the minimum number of times possible, so
       the pattern


       does the	right thing with the C comments. The meaning  of  the  various
       quantifiers  is	not  otherwise	changed,  just the preferred number of
       matches.	Do not confuse this use	of question mark with  its  use	 as  a
       quantifier  in its own right. Because it	has two	uses, it can sometimes
       appear doubled, as in


       which matches one digit by preference, but can match two	if that	is the
       only way	the rest of the	pattern	matches.

       If  the	ungreedy  option  is  set  (an option that is not available in
       Perl), the quantifiers are not greedy by	default, but  individual  ones
       can  be	made  greedy  by following them	with a question	mark. In other
       words, it inverts the default behaviour.

       When a parenthesized subpattern is quantified  with  a  minimum	repeat
       count  that is greater than 1 or	with a limited maximum,	more memory is
       required	for the	compiled pattern, in proportion	to  the	 size  of  the
       minimum or maximum.

       If  a pattern starts with .* or .{0,} and the dotall option (equivalent
       to Perl's /s) is	set, thus allowing the dot to match newlines, the pat-
       tern  is	 implicitly  anchored,	because	whatever follows will be tried
       against every character position	in the subject string, so there	is  no
       point  in  retrying  the	overall	match at any position after the	first.
       PCRE normally treats such a pattern as though it	were preceded by \A.

       In cases	where it is known that the subject  string  contains  no  new-
       lines, it is worth setting dotall in order to obtain this optimization,
       or alternatively	using ^	to indicate anchoring explicitly.

       However,	there are some cases where the optimization  cannot  be	 used.
       When  .*	is inside capturing parentheses	that are the subject of	a back
       reference elsewhere in the pattern, a match at the start	may fail where
       a later one succeeds. Consider, for example:


       If  the subject is "xyz123abc123" the match point is the	fourth charac-
       ter. For	this reason, such a pattern is not implicitly anchored.

       Another case where implicit anchoring is	not applied is when the	 lead-
       ing  .* is inside an atomic group. Once again, a	match at the start may
       fail where a later one succeeds.	Consider this pattern:


       It matches "ab" in the subject "aab". The use of	the backtracking  con-
       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.

       When a capturing	subpattern is repeated,	the value captured is the sub-
       string that matched the final iteration.	For example, after


       has matched "tweedledum tweedledee" the value of	the captured substring
       is  "tweedledee".  However,  if there are nested	capturing subpatterns,
       the corresponding captured values may have been set in previous	itera-
       tions. For example, after


       matches "aba" the value of the second captured substring	is "b".

       With  both  maximizing ("greedy") and minimizing	("ungreedy" or "lazy")
       repetition, failure of what follows normally causes the	repeated  item
       to  be  re-evaluated to see if a	different number of repeats allows the
       rest of the pattern to match. Sometimes it is useful to	prevent	 this,
       either  to  change the nature of	the match, or to cause it fail earlier
       than it otherwise might,	when the author	of the pattern knows there  is
       no point	in carrying on.

       Consider,  for  example,	the pattern \d+foo when	applied	to the subject


       After matching all 6 digits and then failing to match "foo", the	normal
       action  of  the matcher is to try again with only 5 digits matching the
       \d+ item, and then with	4,  and	 so  on,  before  ultimately  failing.
       "Atomic	grouping"  (a  term taken from Jeffrey Friedl's	book) provides
       the means for specifying	that once a subpattern has matched, it is  not
       to be re-evaluated in this way.

       If  we  use atomic grouping for the previous example, the matcher gives
       up immediately on failing to match "foo"	the first time.	 The  notation
       is a kind of special parenthesis, starting with (?> as in this example:


       This kind of parenthesis	"locks up" the part of the pattern it contains
       once it has matched, and	a failure further into	the  pattern  is  pre-
       vented  from  backtracking  into	 it.  Backtracking past	it to previous
       items, however, works as	normal.

       An alternative description is that a subpattern of  this	 type  matches
       the  string  of	characters  that an identical standalone pattern would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be	thought	of as a	maximizing repeat that
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
       pared  to  adjust  the number of	digits they match in order to make the
       rest of the pattern match, (?>\d+) can only match an entire sequence of

       Atomic  groups in general can of	course contain arbitrarily complicated
       subpatterns, and	can be nested. However,	when  the  subpattern  for  an
       atomic group is just a single repeated item, as in the example above, a
       simpler notation, called	a "possessive quantifier" can  be  used.  This
       consists	 of  an	 additional  + character following a quantifier. Using
       this notation, the previous example can be rewritten as


       Note that a possessive quantifier can be	used with an entire group, for


       Possessive  quantifiers	are always greedy; the setting of the ungreedy
       option is ignored. They are a convenient	notation for the simpler forms
       of  atomic  group.  However, there is no	difference in the meaning of a
       possessive quantifier and the equivalent	atomic group, though there may
       be  a performance difference; possessive	quantifiers should be slightly

       The possessive quantifier syntax	is an extension	to the Perl  5.8  syn-
       tax.  Jeffrey  Friedl  originated  the idea (and	the name) in the first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built  Sun's Java package, and PCRE copied it from there. It ultimately
       found its way into Perl at release 5.10.

       PCRE has	an optimization	that automatically "possessifies" certain sim-
       ple  pattern  constructs.  For  example,	the sequence A+B is treated as
       A++B because there is no	point in backtracking into a sequence  of  A's
       when B must follow.

       When  a	pattern	 contains an unlimited repeat inside a subpattern that
       can itself be repeated an unlimited number of  times,  the  use	of  an
       atomic  group  is  the  only way	to avoid some failing matches taking a
       very long time indeed. The pattern


       matches an unlimited number of substrings that either consist  of  non-
       digits,	or  digits  enclosed in	<>, followed by	either ! or ?. When it
       matches,	it runs	quickly. However, if it	is applied to


       it takes	a long time before reporting  failure.	This  is  because  the
       string  can be divided between the internal \D+ repeat and the external
       * repeat	in a large number of ways, and all  have  to  be  tried.  (The
       example	uses  [!?]  rather than	a single character at the end, because
       both PCRE and Perl have an optimization that allows  for	 fast  failure
       when  a single character	is used. They remember the last	single charac-
       ter that	is required for	a match, and fail early	if it is  not  present
       in  the	string.)  If  the pattern is changed so	that it	uses an	atomic
       group, like this:


       sequences of non-digits cannot be broken, and failure happens quickly.

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a back reference to a	capturing sub-
       pattern earlier (that is, to its	left) in the pattern,  provided	 there
       have been that many previous capturing left parentheses.

       However,	if the decimal number following	the backslash is less than 10,
       it is always taken as a back reference, and causes  an  error  only  if
       there  are  not that many capturing left	parentheses in the entire pat-
       tern. In	other words, the parentheses that are referenced need  not  be
       to  the left of the reference for numbers less than 10. A "forward back
       reference" of this type can make	sense when a  repetition  is  involved
       and  the	 subpattern to the right has participated in an	earlier	itera-

       It is not possible to have a numerical "forward back  reference"	 to  a
       subpattern  whose  number  is  10  or  more using this syntax because a
       sequence	such as	\50 is interpreted as a	character  defined  in	octal.
       See the subsection entitled "Non-printing characters" above for further
       details of the handling of digits following a backslash.	 There	is  no
       such  problem  when named parentheses are used. A back reference	to any
       subpattern is possible using named parentheses (see below).

       Another way of avoiding the ambiguity inherent in  the  use  of	digits
       following  a  backslash	is  to use the \g escape sequence. This	escape
       must be followed	by an unsigned number or a negative number, optionally
       enclosed	in braces. These examples are all identical:

	 * (ring), \1

	 * (ring), \g1

	 * (ring), \g{1}

       An  unsigned number specifies an	absolute reference without the ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits follow the reference. A negative number is a relative reference.
       Consider	this example:


       The sequence \g{-1} is a	reference to the most recently started captur-
       ing subpattern before \g, that is, is it	equivalent to \2 in this exam-
       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
       references  can	be helpful in long patterns, and also in patterns that
       are created by  joining	together  fragments  that  contain  references
       within themselves.

       A  back	reference matches whatever actually matched the	capturing sub-
       pattern in the current subject string, rather  than  anything  matching
       the subpattern itself (see "Subpatterns as subroutines" below for a way
       of doing	that). So the pattern

       (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not  "sense and responsibility".	If caseful matching is in force	at the
       time of the back	reference, the case of letters is relevant. For	 exam-


       matches	"rah  rah"  and	 "RAH RAH", but	not "RAH rah", even though the
       original	capturing subpattern is	matched	caselessly.

       There are several different ways	of writing back	 references  to	 named
       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
       \k'name'	are supported, as is the Python	syntax (?P=name). Perl	5.10's
       unified back reference syntax, in which \g can be used for both numeric
       and named references, is	also supported.	We  could  rewrite  the	 above
       example in any of the following ways:

	 * (?<p1>(?i)rah)\s+\k<p1>

	 * (?'p1'(?i)rah)\s+\k{p1}

	 * (?P<p1>(?i)rah)\s+(?P=p1)

	 * (?<p1>(?i)rah)\s+\g{p1}

       A  subpattern  that  is	referenced  by	name may appear	in the pattern
       before or after the reference.

       There may be more than one back reference to the	same subpattern. If  a
       subpattern  has	not actually been used in a particular match, any back
       references to it	always fail. For example, the pattern


       always fails if it starts to match "a" rather than "bc".	Because	 there
       may  be	many  capturing	parentheses in a pattern, all digits following
       the backslash are taken as part of a potential back  reference  number.
       If the pattern continues	with a digit character,	some delimiter must be
       used to terminate the back reference. If	the extended  option  is  set,
       this  can  be  whitespace.  Otherwise  an empty comment (see "Comments"
       below) can be used.

       Recursive back references

       A back reference	that occurs inside the parentheses to which it	refers
       fails  when  the	subpattern is first used, so, for example, (a\1) never
       matches.	However, such references can be	useful inside repeated subpat-
       terns. For example, the pattern


       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation of	the subpattern,	 the  back  reference  matches	the  character
       string  corresponding  to  the previous iteration. In order for this to
       work, the pattern must be such that the first iteration does  not  need
       to  match the back reference. This can be done using alternation, as in
       the example above, or by	a quantifier with a minimum of zero.

       Back references of this type cause the group that they reference	to  be
       treated	as  an	atomic group. Once the whole group has been matched, a
       subsequent matching failure cannot cause	backtracking into  the	middle
       of the group.

       An  assertion  is  a  test on the characters following or preceding the
       current matching	point that does	not actually consume  any  characters.
       The  simple  assertions	coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
       described above.

       More complicated	assertions are coded as	 subpatterns.  There  are  two
       kinds:  those  that  look  ahead	of the current position	in the subject
       string, and those that look  behind  it.	 An  assertion	subpattern  is
       matched	in  the	 normal	way, except that it does not cause the current
       matching	position to be changed.

       Assertion subpatterns are not capturing subpatterns. If such an	asser-
       tion  contains  capturing  subpatterns within it, these are counted for
       the purposes of numbering the capturing subpatterns in the  whole  pat-
       tern.  However,	substring  capturing  is carried out only for positive
       assertions. (Perl sometimes, but	not always, does do capturing in nega-
       tive assertions.)

       For  compatibility  with	 Perl,	assertion subpatterns may be repeated;
       though it makes no sense	to assert the same thing  several  times,  the
       side  effect  of	 capturing  parentheses	may occasionally be useful. In
       practice, there only three cases:

	   If the quantifier is	{0}, the  assertion  is	 never	obeyed	during
	   matching.  However, it may contain internal capturing parenthesized
	   groups that are called from elsewhere via the subroutine mechanism.

	   If quantifier is {0,n} where	n is greater than zero,	it is  treated
	   as  if it were {0,1}. At run	time, the rest of the pattern match is
	   tried with and without the assertion, the order  depending  on  the
	   greediness of the quantifier.

	   If  the  minimum repetition is greater than zero, the quantifier is
	   ignored. The	assertion is obeyed just once when encountered	during

       Lookahead assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example,


       matches a word followed by a semicolon, but does	not include the	 semi-
       colon in	the match, and


       matches	any  occurrence	 of  "foo" that	is not followed	by "bar". Note
       that the	apparently similar pattern


       does not	find an	occurrence of "bar"  that  is  preceded	 by  something
       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
       the assertion (?!foo) is	always true when the next three	characters are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most convenient way to do it is	with  (?!)  because  an	 empty	string
       always  matches,	so an assertion	that requires there not	to be an empty
       string must always fail.	The backtracking control verb (*FAIL) or  (*F)
       is a synonym for	(?!).

       Lookbehind assertions

       Lookbehind  assertions start with (?<= for positive assertions and (?<!
       for negative assertions.	For example,


       does find an occurrence of "bar"	that is	not  preceded  by  "foo".  The
       contents	 of  a	lookbehind  assertion are restricted such that all the
       strings it matches must have a fixed length. However, if	there are sev-
       eral  top-level	alternatives,  they  do	 not all have to have the same
       fixed length. Thus


       is permitted, but


       causes an error at compile time.	Branches that match  different	length
       strings	are permitted only at the top level of a lookbehind assertion.
       This is an extension compared with Perl,	which requires all branches to
       match the same length of	string.	An assertion such as


       is  not	permitted,  because  its single	top-level branch can match two
       different lengths, but it is acceptable to PCRE if rewritten to use two
       top-level branches:


       In  some	 cases,	the escape sequence \K (see above) can be used instead
       of a lookbehind assertion to get	round the fixed-length restriction.

       The implementation of lookbehind	assertions is, for  each  alternative,
       to  temporarily	move the current position back by the fixed length and
       then try	to match. If there are insufficient characters before the cur-
       rent position, the assertion fails.

       In  a UTF mode, PCRE does not allow the \C escape (which	matches	a sin-
       gle data	unit even in a UTF mode) to appear in  lookbehind  assertions,
       because	it  makes it impossible	to calculate the length	of the lookbe-
       hind. The \X and	\R escapes, which can match different numbers of  data
       units, are also not permitted.

       "Subroutine"  calls  (see below)	such as	(?2) or	(?&X) are permitted in
       lookbehinds, as long as the subpattern matches a	 fixed-length  string.
       Recursion, however, is not supported.

       Possessive  quantifiers	can  be	 used  in  conjunction with lookbehind
       assertions to specify efficient matching	of fixed-length	strings	at the
       end of subject strings. Consider	a simple pattern such as


       when  applied  to  a  long string that does not match. Because matching
       proceeds	from left to right, PCRE will look for each "a"	in the subject
       and  then  see  if what follows matches the rest	of the pattern.	If the
       pattern is specified as


       the initial .* matches the entire string	at first, but when this	 fails
       (because	there is no following "a"), it backtracks to match all but the
       last character, then all	but the	last two characters, and so  on.  Once
       again  the search for "a" covers	the entire string, from	right to left,
       so we are no better off.	However, if the	pattern	is written as


       there can be no backtracking for	the .*+	item; it can  match  only  the
       entire  string.	The subsequent lookbehind assertion does a single test
       on the last four	characters. If it fails, the match fails  immediately.
       For  long  strings, this	approach makes a significant difference	to the
       processing time.

       Using multiple assertions

       Several assertions (of any sort)	may occur in succession. For example,


       matches "foo" preceded by three digits that are not "999". Notice  that
       each  of	 the  assertions is applied independently at the same point in
       the subject string. First there is a  check  that  the  previous	 three
       characters  are	all  digits,  and  then	there is a check that the same
       three characters	are not	"999". This pattern does not match "foo"  pre-
       ceded  by  six  characters,  the	first of which are digits and the last
       three of	which are not "999". For example, it  doesn't  match  "123abc-
       foo". A pattern to do that is


       This  time  the	first assertion	looks at the preceding six characters,
       checking	that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example,


       matches	an occurrence of "baz" that is preceded	by "bar" which in turn
       is not preceded by "foo", while


       is another pattern that matches "foo" preceded by three digits and  any
       three characters	that are not "999".

       It  is possible to cause	the matching process to	obey a subpattern con-
       ditionally or to	choose between two alternative subpatterns,  depending
       on  the result of an assertion, or whether a specific capturing subpat-
       tern has	already	been matched. The two possible	forms  of  conditional
       subpattern are:

	 * (?(condition)yes-pattern)

	 * (?(condition)yes-pattern|no-pattern)

       If  the	condition is satisfied,	the yes-pattern	is used; otherwise the
       no-pattern (if present) is used.	If there are more  than	 two  alterna-
       tives  in  the subpattern, a compile-time error occurs. Each of the two
       alternatives may	itself contain nested subpatterns of any form, includ-
       ing  conditional	 subpatterns;  the  restriction	 to  two  alternatives
       applies only at the level of the	condition. This	pattern	fragment is an
       example where the alternatives are complex:

       (?(1) (A|B|C) | (D | (?(2)E|F) |	E) )

       There  are  four	 kinds of condition: references	to subpatterns,	refer-
       ences to	recursion, a pseudo-condition called DEFINE, and assertions.

       Checking	for a used subpattern by number

       If the text between the parentheses consists of a sequence  of  digits,
       the condition is	true if	a capturing subpattern of that number has pre-
       viously matched.	If there is more than one  capturing  subpattern  with
       the  same  number  (see	the earlier section about duplicate subpattern
       numbers), the condition is true if any of them have matched. An	alter-
       native  notation	is to precede the digits with a	plus or	minus sign. In
       this case, the subpattern number	is relative rather than	absolute.  The
       most  recently opened parentheses can be	referenced by (?(-1), the next
       most recent by (?(-2), and so on. Inside	loops it can also  make	 sense
       to refer	to subsequent groups. The next parentheses to be opened	can be
       referenced as (?(+1), and so on.	(The value zero	in any of these	 forms
       is not used; it provokes	a compile-time error.)

       Consider	 the  following	pattern, which contains	non-significant	white-
       space to	make it	more readable (assume  the  extended  option)  and  to
       divide it into three parts for ease of discussion:

       ( \( )? [^()]+ (?(1) \) )

       The  first  part	 matches  an optional opening parenthesis, and if that
       character is present, sets it as	the first captured substring. The sec-
       ond  part  matches one or more characters that are not parentheses. The
       third part is a conditional subpattern that tests whether  or  not  the
       first  set of parentheses matched or not. If they did, that is, if sub-
       ject started with an opening parenthesis, the condition is true,	and so
       the yes-pattern is executed and a closing parenthesis is	required. Oth-
       erwise, since no-pattern	is not present,	the subpattern	matches	 noth-
       ing.  In	 other words, this pattern matches a sequence of non-parenthe-
       ses, optionally enclosed	in parentheses.

       If you were embedding this pattern in a larger one,  you	 could	use  a
       relative	reference:

       ...other	stuff... ( \( )? [^()]+	(?(-1) \) ) ...

       This  makes  the	 fragment independent of the parentheses in the	larger

       Checking	for a used subpattern by name

       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test	for  a
       used  subpattern	 by  name.  For	compatibility with earlier versions of
       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
       also  recognized. However, there	is a possible ambiguity	with this syn-
       tax, because subpattern names may  consist  entirely  of	 digits.  PCRE
       looks  first for	a named	subpattern; if it cannot find one and the name
       consists	entirely of digits, PCRE looks for a subpattern	of  that  num-
       ber,  which must	be greater than	zero. Using subpattern names that con-
       sist entirely of	digits is not recommended.

       Rewriting the above example to use a named subpattern gives this:

       (?<OPEN>	\( )? [^()]+ (?(<OPEN>)	\) )

       If the name used	in a condition of this kind is a duplicate,  the  test
       is  applied to all subpatterns of the same name,	and is true if any one
       of them has matched.

       Checking	for pattern recursion

       If the condition	is the string (R), and there is	no subpattern with the
       name  R,	the condition is true if a recursive call to the whole pattern
       or any subpattern has been made.	If digits or a name preceded by	amper-
       sand follow the letter R, for example:

       (?(R3)...) or (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a subpattern
       whose number or name is given. This condition does not check the	entire
       recursion  stack.  If  the  name	 used in a condition of	this kind is a
       duplicate, the test is applied to all subpatterns of the	same name, and
       is true if any one of them is the most recent recursion.

       At "top level", all these recursion test	conditions are false. The syn-
       tax for recursive patterns is described below.

       Defining	subpatterns for	use by reference only

       If the condition	is the string (DEFINE),	and  there  is	no  subpattern
       with  the  name	DEFINE,	 the  condition	is always false. In this case,
       there may be only one alternative  in  the  subpattern.	It  is	always
       skipped	if  control  reaches  this  point  in the pattern; the idea of
       DEFINE is that it can be	used to	define "subroutines" that can be  ref-
       erenced	from  elsewhere.  (The use of subroutines is described below.)
       For  example,  a	 pattern  to   match   an   IPv4   address   such   as
       ""	could be written like this (ignore whitespace and line

       (?(DEFINE) (?<byte> 2[0-4]\d  |	25[0-5]	 |  1\d\d  |  [1-9]?\d)	 )  \b
       (?&byte)	(\.(?&byte)){3}	\b

       The  first part of the pattern is a DEFINE group	inside which a another
       group named "byte" is defined. This matches an individual component  of
       an  IPv4	 address  (a number less than 256). When matching takes	place,
       this part of the	pattern	is skipped because DEFINE acts	like  a	 false
       condition.  The	rest of	the pattern uses references to the named group
       to match	the four dot-separated components of an	IPv4 address,  insist-
       ing on a	word boundary at each end.

       Assertion conditions

       If  the	condition  is  not  in any of the above	formats, it must be an
       assertion. This may be a	positive or negative lookahead	or  lookbehind
       assertion.  Consider  this  pattern,  again  containing non-significant
       whitespace, and with the	two alternatives on the	second line:

	 \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The condition  is  a  positive  lookahead  assertion  that  matches  an
       optional	 sequence of non-letters followed by a letter. In other	words,
       it tests	for the	presence of at least one letter	in the subject.	 If  a
       letter  is found, the subject is	matched	against	the first alternative;
       otherwise it is	matched	 against  the  second.	This  pattern  matches
       strings	in  one	 of the	two forms dd-aaa-dd or dd-dd-dd, where aaa are
       letters and dd are digits.

       There are two ways of including comments	in patterns that are processed
       by PCRE.	In both	cases, the start of the	comment	must not be in a char-
       acter class, nor	in the middle of any other sequence of related charac-
       ters  such  as  (?: or a	subpattern name	or number. The characters that
       make up a comment play no part in the pattern matching.

       The sequence (?#	marks the start	of a comment that continues up to  the
       next  closing parenthesis. Nested parentheses are not permitted.	If the
       PCRE_EXTENDED option is set, an unescaped # character also introduces a
       comment,	 which	in  this  case continues to immediately	after the next
       newline character or character sequence in the pattern.	Which  charac-
       ters are	interpreted as newlines	is controlled by the options passed to
       a compiling function or by a special sequence at	the start of the  pat-
       tern, as	described in the section entitled "Newline conventions"	above.
       Note that the end of this type of comment is a literal newline sequence
       in  the pattern;	escape sequences that happen to	represent a newline do
       not count. For example, consider	this pattern when extended is set, and
       the default newline convention is in force:

       abc #comment \n still comment

       On  encountering	 the # character, pcre_compile()  skips	along, looking
       for a newline in	the pattern. The sequence \n is	still literal at  this
       stage,  so  it does not terminate the comment. Only an actual character
       with the	code value 0x0a	(the default newline) does so.

       Consider	the problem of matching	a string in parentheses, allowing  for
       unlimited  nested  parentheses.	Without	the use	of recursion, the best
       that can	be done	is to use a pattern that  matches  up  to  some	 fixed
       depth  of  nesting.  It	is not possible	to handle an arbitrary nesting

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions  to recurse (amongst other	things). It does this by interpolating
       Perl code in the	expression at run time,	and the	code can refer to  the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The (?p{...}) item interpolates Perl code at run	time, and in this case
       refers recursively to the pattern in which it appears.

       Obviously, PCRE cannot support the interpolation	of Perl	code. Instead,
       it supports special syntax for recursion	of  the	 entire	 pattern,  and
       also  for  individual  subpattern  recursion. After its introduction in
       PCRE and	Python,	this kind of  recursion	 was  subsequently  introduced
       into Perl at release 5.10.

       A  special  item	 that consists of (? followed by a number greater than
       zero and	a closing parenthesis is a recursive subroutine	 call  of  the
       subpattern  of  the  given  number, provided that it occurs inside that
       subpattern. (If not, it is a non-recursive subroutine  call,  which  is
       described  in  the  next	 section.)  The	special	item (?R) or (?0) is a
       recursive call of the entire regular expression.

       This PCRE pattern solves	the nested  parentheses	 problem  (assume  the
       extended	option is set so that whitespace is ignored):

       \( ( [^()]++ | (?R) )* \)

       First  it matches an opening parenthesis. Then it matches any number of
       substrings which	can either be a	 sequence  of  non-parentheses,	 or  a
       recursive  match	 of the	pattern	itself (that is, a correctly parenthe-
       sized substring). Finally there is a closing parenthesis. Note the  use
       of a possessive quantifier to avoid backtracking	into sequences of non-

       If this were part of a larger pattern, you would	not  want  to  recurse
       the entire pattern, so instead you could	use this:

       ( \( ( [^()]++ |	(?1) )*	\) )

       We  have	 put the pattern into parentheses, and caused the recursion to
       refer to	them instead of	the whole pattern.

       In a larger pattern,  keeping  track  of	 parenthesis  numbers  can  be
       tricky.	This is	made easier by the use of relative references. Instead
       of (?1) in the pattern above you	can write (?-2)	to refer to the	second
       most  recently  opened  parentheses  preceding  the recursion. In other
       words, a	negative number	counts capturing  parentheses  leftwards  from
       the point at which it is	encountered.

       It  is  also  possible  to refer	to subsequently	opened parentheses, by
       writing references such as (?+2). However, these	 cannot	 be  recursive
       because	the  reference	is  not	inside the parentheses that are	refer-
       enced. They are always non-recursive subroutine calls, as described  in
       the next	section.

       An  alternative	approach is to use named parentheses instead. The Perl
       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
       supported. We could rewrite the above example as	follows:

       (?<pn> \( ( [^()]++ | (?&pn) )* \) )

       If  there  is more than one subpattern with the same name, the earliest
       one is used.

       This particular example pattern that we have been looking  at  contains
       nested unlimited	repeats, and so	the use	of a possessive	quantifier for
       matching	strings	of non-parentheses is important	when applying the pat-
       tern  to	 strings  that do not match. For example, when this pattern is
       applied to


       it yields "no match" quickly. However, if a  possessive	quantifier  is
       not  used, the match runs for a very long time indeed because there are
       so many different ways the + and	* repeats can carve  up	 the  subject,
       and all have to be tested before	failure	can be reported.

       At  the	end  of	a match, the values of capturing parentheses are those
       from the	outermost level. If the	pattern	above is matched against


       the value for the inner capturing parentheses  (numbered	 2)  is	 "ef",
       which  is the last value	taken on at the	top level. If a	capturing sub-
       pattern is not matched at the top level,	its final  captured  value  is
       unset,  even  if	 it was	(temporarily) set at a deeper level during the
       matching	process.

       Do not confuse the (?R) item with the condition (R),  which  tests  for
       recursion. Consider this	pattern, which matches text in angle brackets,
       allowing	for arbitrary nesting.	Only  digits  are  allowed  in	nested
       brackets	 (that is, when	recursing), whereas any	characters are permit-
       ted at the outer	level.

       < (?: (?(R) \d++	| [^<>]*+) | (?R)) * >

       In this pattern,	(?(R) is the start of a	conditional  subpattern,  with
       two  different  alternatives for	the recursive and non-recursive	cases.
       The (?R)	item is	the actual recursive call.

       Differences in recursion	processing between PCRE	and Perl

       Recursion processing in PCRE differs from Perl in two  important	 ways.
       In  PCRE	(like Python, but unlike Perl),	a recursive subpattern call is
       always treated as an atomic group. That is, once	it has matched some of
       the subject string, it is never re-entered, even	if it contains untried
       alternatives and	there is a subsequent matching failure.	 This  can  be
       illustrated  by the following pattern, which purports to	match a	palin-
       dromic string that contains an odd number of characters	(for  example,
       "a", "aba", "abcba", "abcdcba"):


       The idea	is that	it either matches a single character, or two identical
       characters surrounding a	sub-palindrome.	In Perl, this  pattern	works;
       in  PCRE	 it  does  not if the pattern is longer	than three characters.
       Consider	the subject string "abcba":

       At the top level, the first character is	matched, but as	it is  not  at
       the end of the string, the first	alternative fails; the second alterna-
       tive is taken and the recursion kicks in. The recursive call to subpat-
       tern  1	successfully  matches the next character ("b").	(Note that the
       beginning and end of line tests are not part of the recursion).

       Back at the top level, the next character ("c") is compared  with  what
       subpattern  2 matched, which was	"a". This fails. Because the recursion
       is treated as an	atomic group, there are	now  no	 backtracking  points,
       and  so	the  entire  match fails. (Perl	is able, at this point,	to re-
       enter the recursion and try the second alternative.)  However,  if  the
       pattern is written with the alternatives	in the other order, things are


       This time, the recursing	alternative is tried first, and	 continues  to
       recurse	until  it runs out of characters, at which point the recursion
       fails. But this time we do have	another	 alternative  to  try  at  the
       higher  level.  That  is	 the  big difference: in the previous case the
       remaining alternative is	at a deeper recursion level, which PCRE	cannot

       To  change  the pattern so that it matches all palindromic strings, not
       just those with an odd number of	characters, it is tempting  to	change
       the pattern to this:


       Again,  this  works  in Perl, but not in	PCRE, and for the same reason.
       When a deeper recursion has matched a single character,	it  cannot  be
       entered	again  in  order  to match an empty string. The	solution is to
       separate	the two	cases, and write out the odd and even cases as	alter-
       natives at the higher level:


       If  you	want  to match typical palindromic phrases, the	pattern	has to
       ignore all non-word characters, which can be done like this:


       If run with the caseless	option,	this pattern matches phrases  such  as
       "A  man,	 a  plan, a canal: Panama!" and	it works well in both PCRE and
       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
       ing  into  sequences of non-word	characters. Without this, PCRE takes a
       great deal longer (ten times or more) to	 match	typical	 phrases,  and
       Perl takes so long that you think it has	gone into a loop.

       WARNING:	 The  palindrome-matching patterns above work only if the sub-
       ject string does	not start with a palindrome that is shorter  than  the
       entire  string.	For example, although "abcba" is correctly matched, if
       the subject is "ababa", PCRE finds the palindrome "aba" at  the	start,
       then  fails at top level	because	the end	of the string does not follow.
       Once again, it cannot jump back into the	recursion to try other	alter-
       natives,	so the entire match fails.

       The  second  way	 in which PCRE and Perl	differ in their	recursion pro-
       cessing is in the handling of captured values. In Perl, when a  subpat-
       tern  is	 called	recursively or as a subpattern (see the	next section),
       it has no access	to any values that were	captured  outside  the	recur-
       sion,  whereas  in  PCRE	 these values can be referenced. Consider this


       In PCRE,	this pattern matches "bab". The	 first	capturing  parentheses
       match  "b",  then in the	second group, when the back reference \1 fails
       to match	"b", the second	alternative matches "a"	and then recurses.  In
       the  recursion,	\1 does	now match "b" and so the whole match succeeds.
       In Perl,	the pattern fails to match because inside the  recursive  call
       \1 cannot access	the externally set value.

       If  the	syntax for a recursive subpattern call (either by number or by
       name) is	used outside the parentheses to	which it refers,  it  operates
       like  a subroutine in a programming language. The called	subpattern may
       be defined before or after the reference. A numbered reference  can  be
       absolute	or relative, as	in these examples:

	 * (...(absolute)...)...(?2)...

	 * (...(relative)...)...(?-1)...

	 * (...(?+1)...(relative)...

       An earlier example pointed out that the pattern

       (sens|respons)e and \1ibility

       matches	"sense and sensibility"	and "response and responsibility", but
       not "sense and responsibility". If instead the pattern

       (sens|respons)e and (?1)ibility

       is used,	it does	match "sense and responsibility" as well as the	 other
       two  strings.  Another  example	is  given  in the discussion of	DEFINE

       All subroutine calls, whether recursive or not, are always  treated  as
       atomic  groups. That is,	once a subroutine has matched some of the sub-
       ject string, it is never	re-entered, even if it contains	untried	alter-
       natives	and  there  is	a  subsequent  matching	failure. Any capturing
       parentheses that	are set	during the subroutine  call  revert  to	 their
       previous	values afterwards.

       Processing  options  such as case-independence are fixed	when a subpat-
       tern is defined,	so if it is used as a subroutine, such options	cannot
       be changed for different	calls. For example, consider this pattern:


       It  matches  "abcabc". It does not match	"abcABC" because the change of
       processing option does not affect the called subpattern.

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an alternative syntax for referencing a	subpattern  as	a  subroutine,
       possibly	 recursively. Here are two of the examples used	above, rewrit-
       ten using this syntax:

       (?<pn> \( ( (?>[^()]+) |	\g<pn> )* \) )

       (sens|respons)e and \g'1'ibility

       PCRE supports an	extension to Oniguruma:	if a number is preceded	 by  a
       plus or a minus sign it is taken	as a relative reference. For example:


       Note  that \g{...} (Perl	syntax)	and \g<...> (Oniguruma syntax) are not
       synonymous. The former is a back	reference; the latter is a  subroutine

       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
       which are still described in the	Perl  documentation  as	 "experimental
       and  subject to change or removal in a future version of	Perl". It goes
       on to say: "Their usage in production code should  be  noted  to	 avoid
       problems	 during	upgrades." The same remarks apply to the PCRE features
       described in this section.

       The new verbs make use of what was previously invalid syntax: an	 open-
       ing parenthesis followed	by an asterisk.	They are generally of the form
       (*VERB) or (*VERB:NAME).	Some may take either form,  possibly  behaving
       differently  depending  on  whether or not a name is present. A name is
       any sequence of characters that does not	include	a closing parenthesis.
       The maximum length of name is 255 in the	8-bit library and 65535	in the
       16-bit and 32-bit libraries. If the name	is  empty,  that  is,  if  the
       closing	parenthesis immediately	follows	the colon, the effect is as if
       the colon were not there. Any number of these verbs may occur in	a pat-

       The  behaviour  of  these  verbs	in repeated groups, assertions,	and in
       subpatterns called as subroutines (whether or not recursively) is docu-
       mented below.

       Optimizations that affect backtracking verbs

       PCRE  contains some optimizations that are used to speed	up matching by
       running some checks at the start	of each	match attempt. For example, it
       may  know  the minimum length of	matching subject, or that a particular
       character must be present. When one of these optimizations bypasses the
       running	of  a  match,  any  included  backtracking  verbs will not, of
       course, be processed. You can suppress the start-of-match optimizations
       by  setting  the	 no_start_optimize option when calling re:compile/2 or
       re:run/3, or by starting	the pattern with (*NO_START_OPT).

       Experiments with	Perl suggest that it too  has  similar	optimizations,
       sometimes leading to anomalous results.

       Verbs that act immediately

       The  following  verbs act as soon as they are encountered. They may not
       be followed by a	name.


       This verb causes	the match to end successfully, skipping	the  remainder
       of  the pattern.	However, when it is inside a subpattern	that is	called
       as a subroutine,	only that subpattern is	ended  successfully.  Matching
       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
       tive assertion, the assertion succeeds; in a  negative  assertion,  the
       assertion fails.

       If  (*ACCEPT)  is inside	capturing parentheses, the data	so far is cap-
       tured. For example:


       This matches "AB", "AAD", or "ACD"; when	it matches "AB", "B"  is  cap-
       tured by	the outer parentheses.

       (*FAIL) or (*F)

       This  verb causes a matching failure, forcing backtracking to occur. It
       is equivalent to	(?!) but easier	to read. The Perl documentation	 notes
       that  it	 is  probably  useful only when	combined with (?{}) or (??{}).
       Those are, of course, Perl features that	are not	present	in  PCRE.  The
       nearest	equivalent is the callout feature, as for example in this pat-


       A match with the	string "aaaa" always fails, but	the callout  is	 taken
       before each backtrack happens (in this example, 10 times).

       Recording which path was	taken

       There  is  one  verb  whose  main  purpose  is to track how a match was
       arrived at, though it also has a	 secondary  use	 in  conjunction  with
       advancing the match starting point (see (*SKIP) below).

       In  Erlang, there is no interface to retrieve a mark with re:run/{2,3],
       so only the secondary purpose is	relevant to the	Erlang programmer!

       The rest	of this	section	is  therefore  deliberately  not  adapted  for
       reading	by  the	 Erlang	programmer, however the	examples might help in
       understanding NAMES as they can be used by (*SKIP).

       (*MARK:NAME) or (*:NAME)

       A name is always	 required  with	 this  verb.  There  may  be  as  many
       instances  of  (*MARK) as you like in a pattern,	and their names	do not
       have to be unique.

       When a match succeeds, the name of the  last-encountered	 (*MARK:NAME),
       (*PRUNE:NAME),  or  (*THEN:NAME)	on the matching	path is	passed back to
       the caller as  described	 in  the  section  entitled  "Extra  data  for
       pcre_exec()"  in	 the  pcreapi  documentation.  Here  is	 an example of
       pcretest	output,	where the /K modifier requests the retrieval and  out-
       putting of (*MARK) data:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XY
	  0: XY
	 MK: A
	  0: XZ
	 MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple it indicates	which of the two alternatives matched. This is a  more
       efficient  way of obtaining this	information than putting each alterna-
       tive in its own capturing parentheses.

       If a verb with a	name is	encountered in a positive  assertion  that  is
       true,  the  name	 is recorded and passed	back if	it is the last-encoun-
       tered. This does	not happen for negative	assertions or failing positive

       After  a	 partial match or a failed match, the last encountered name in
       the entire match	process	is returned. For example:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XP
	 No match, mark	= B

       Note that in this unanchored example the	 mark  is  retained  from  the
       match attempt that started at the letter	"X" in the subject. Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, but nevertheless do not reset it.

       Verbs that act after backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues with what	follows, but if	there is no subsequent match,  causing
       a  backtrack  to	 the  verb, a failure is forced. That is, backtracking
       cannot pass to the left of the verb. However, when one of  these	 verbs
       appears inside an atomic	group or an assertion that is true, its	effect
       is confined to that group, because once the  group  has	been  matched,
       there  is never any backtracking	into it. In this situation, backtrack-
       ing can "jump back" to the left of the entire atomic  group  or	asser-
       tion.  (Remember	 also,	as  stated  above, that	this localization also
       applies in subroutine calls.)

       These verbs differ in exactly what kind of failure  occurs  when	 back-
       tracking	 reaches  them.	 The behaviour described below is what happens
       when the	verb is	not in a subroutine or an assertion.  Subsequent  sec-
       tions cover these special cases.


       This  verb, which may not be followed by	a name,	causes the whole match
       to fail outright	if there is a later matching failure that causes back-
       tracking	 to  reach  it.	 Even if the pattern is	unanchored, no further
       attempts	to find	a match	by advancing the starting point	take place. If
       (*COMMIT)  is  the  only	backtracking verb that is encountered, once it
       has been	passed re:run/{2,3} is committed to finding  a	match  at  the
       current starting	point, or not at all. For example:


       This  matches  "xxaab" but not "aacaab".	It can be thought of as	a kind
       of dynamic anchor, or "I've started, so I must finish." The name	of the
       most  recently passed (*MARK) in	the path is passed back	when (*COMMIT)
       forces a	match failure.

       If there	is more	than one backtracking verb in a	pattern,  a  different
       one  that  follows  (*COMMIT) may be triggered first, so	merely passing
       (*COMMIT) during	a match	does not always	guarantee that a match must be
       at this starting	point.

       Note  that  (*COMMIT)  at  the start of a pattern is not	the same as an
       anchor, unless PCRE's start-of-match optimizations are turned  off,  as
       shown in	this example:

	 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
	 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).

       PCRE  knows  that  any  match  must start with "a", so the optimization
       skips along the subject to "a" before running the first match  attempt,
       which succeeds. When the	optimization is	disabled by the	no_start_opti-
       mize option, the	match starts at	"x" and	so the (*COMMIT) causes	it  to
       fail without trying any other starting points.

       (*PRUNE)	or (*PRUNE:NAME)

       This  verb causes the match to fail at the current starting position in
       the subject if there is a later matching	failure	that causes backtrack-
       ing  to	reach it. If the pattern is unanchored,	the normal "bumpalong"
       advance to the next starting character then happens.  Backtracking  can
       occur  as  usual	to the left of (*PRUNE), before	it is reached, or when
       matching	to the right of	(*PRUNE), but if there	is  no	match  to  the
       right,  backtracking cannot cross (*PRUNE). In simple cases, the	use of
       (*PRUNE)	is just	an alternative to an atomic group or possessive	 quan-
       tifier, but there are some uses of (*PRUNE) that	cannot be expressed in
       any other way. In an anchored pattern (*PRUNE) has the same  effect  as

       The   behaviour	 of   (*PRUNE:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*PRUNE). It	is like	 (*MARK:NAME)  in  that	 the  name  is
       remembered  for	passing	 back  to  the	caller.	 However, (*SKIP:NAME)
       searches	only for names set with	(*MARK).

       The fact	that (*PRUNE:NAME) remembers the name is useless to the	Erlang
       programmer, as names can	not be retrieved.


       This  verb, when	given without a	name, is like (*PRUNE),	except that if
       the pattern is unanchored, the "bumpalong" advance is not to  the  next
       character, but to the position in the subject where (*SKIP) was encoun-
       tered. (*SKIP) signifies	that whatever text was matched leading	up  to
       it cannot be part of a successful match.	Consider:


       If  the	subject	 is  "aaaac...",  after	 the first match attempt fails
       (starting at the	first character	in the	string),  the  starting	 point
       skips on	to start the next attempt at "c". Note that a possessive quan-
       tifer does not have the same effect as this example; although it	 would
       suppress	 backtracking  during  the  first  match  attempt,  the	second
       attempt would start at the second character instead of skipping	on  to


       When (*SKIP) has	an associated name, its	behaviour is modified. When it
       is triggered, the previous path through the pattern is searched for the
       most  recent  (*MARK)  that  has	 the  same  name. If one is found, the
       "bumpalong" advance is to the subject position that corresponds to that
       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
       a matching name is found, the (*SKIP) is	ignored.

       Note that (*SKIP:NAME) searches only for	names set by (*MARK:NAME).  It
       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).

       (*THEN) or (*THEN:NAME)

       This  verb  causes  a skip to the next innermost	alternative when back-
       tracking	reaches	it. That  is,  it  cancels  any	 further  backtracking
       within  the  current  alternative.  Its name comes from the observation
       that it can be used for a pattern-based if-then-else block:

       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If the COND1 pattern matches, FOO is tried (and possibly	further	 items
       after  the  end	of the group if	FOO succeeds); on failure, the matcher
       skips to	the second alternative and tries COND2,	 without  backtracking
       into  COND1.  If	that succeeds and BAR fails, COND3 is tried. If	subse-
       quently BAZ fails, there	are no more alternatives, so there is a	 back-
       track  to  whatever  came  before  the  entire group. If	(*THEN)	is not
       inside an alternation, it acts like (*PRUNE).

       The   behaviour	 of   (*THEN:NAME)   is	  the	not   the   same    as
       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the	name is	remem-
       bered for passing back to the caller.  However,	(*SKIP:NAME)  searches
       only for	names set with (*MARK).

       The  fact that (*THEN:NAME) remembers the name is useless to the	Erlang
       programmer, as names can	not be retrieved.

       A subpattern that does not contain a | character	is just	a part of  the
       enclosing  alternative;	it  is	not a nested alternation with only one
       alternative. The	effect of (*THEN) extends beyond such a	subpattern  to
       the  enclosing alternative. Consider this pattern, where	A, B, etc. are
       complex pattern fragments that do not contain any | characters at  this

       A (B(*THEN)C) | D

       If  A and B are matched,	but there is a failure in C, matching does not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	 if the	subpattern containing (*THEN) is given an alternative,
       it behaves differently:

       A (B(*THEN)C | (*FAIL)) | D

       The effect of (*THEN) is	now confined to	the inner subpattern. After  a
       failure in C, matching moves to (*FAIL),	which causes the whole subpat-
       tern to fail because there are no more alternatives  to	try.  In  this
       case, matching does now backtrack into A.

       Note  that  a  conditional  subpattern  is not considered as having two
       alternatives, because only one is ever used.  In	 other	words,	the  |
       character in a conditional subpattern has a different meaning. Ignoring
       white space, consider:

       ^.*? (?(?=a) a |	b(*THEN)c )

       If the subject is "ba", this pattern does not  match.  Because  .*?  is
       ungreedy,  it  initially	 matches  zero characters. The condition (?=a)
       then fails, the character "b" is	matched,  but  "c"  is	not.  At  this
       point,  matching	does not backtrack to .*? as might perhaps be expected
       from the	presence of the	| character.  The  conditional	subpattern  is
       part of the single alternative that comprises the whole pattern,	and so
       the match fails.	(If there was a	backtrack into	.*?,  allowing	it  to
       match "b", the match would succeed.)

       The  verbs just described provide four different	"strengths" of control
       when subsequent matching	fails. (*THEN) is the weakest, carrying	on the
       match  at  the next alternative.	(*PRUNE) comes next, failing the match
       at the current starting position, but allowing an advance to  the  next
       character  (for an unanchored pattern). (*SKIP) is similar, except that
       the advance may be more than one	character. (*COMMIT) is	the strongest,
       causing the entire match	to fail.

       More than one backtracking verb

       If  more	 than  one  backtracking verb is present in a pattern, the one
       that is backtracked onto	first acts. For	example,  consider  this  pat-
       tern, where A, B, etc. are complex pattern fragments:


       If  A matches but B fails, the backtrack	to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN)	causes	the next alternative (ABD) to be tried.	This behaviour
       is consistent, but is not always	the same as Perl's. It means  that  if
       two  or	more backtracking verbs	appear in succession, all the the last
       of them has no effect. Consider this example:


       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       cases it	to be triggered, and its action	is taken. There	can never be a
       backtrack onto (*COMMIT).

       Backtracking verbs in repeated groups

       PCRE differs from  Perl	in  its	 handling  of  backtracking  verbs  in
       repeated	groups.	For example, consider:


       If  the	subject	 is  "abac",  Perl matches, but	PCRE fails because the
       (*COMMIT) in the	second repeat of the group acts.

       Backtracking verbs in assertions

       (*FAIL) in an assertion has its normal effect: it forces	 an  immediate

       (*ACCEPT) in a positive assertion causes	the assertion to succeed with-
       out any further processing. In a	negative assertion,  (*ACCEPT)	causes
       the assertion to	fail without any further processing.

       The  other  backtracking	verbs are not treated specially	if they	appear
       in a positive assertion.	In  particular,	 (*THEN)  skips	 to  the  next
       alternative  in	the  innermost	enclosing group	that has alternations,
       whether or not this is within the assertion.

       Negative	assertions are,	however, different, in order  to  ensure  that
       changing	 a  positive  assertion	 into a	negative assertion changes its
       result. Backtracking into (*COMMIT), (*SKIP), or	(*PRUNE) causes	a neg-
       ative assertion to be true, without considering any further alternative
       branches	in the assertion. Backtracking into (*THEN) causes it to  skip
       to  the next enclosing alternative within the assertion (the normal be-
       haviour), but if	the assertion  does  not  have	such  an  alternative,
       (*THEN) behaves like (*PRUNE).

       Backtracking verbs in subroutines

       These  behaviours  occur	whether	or not the subpattern is called	recur-
       sively. Perl's treatment	of subroutines is different in some cases.

       (*FAIL) in a subpattern called as a subroutine has its  normal  effect:
       it forces an immediate backtrack.

       (*ACCEPT)  in a subpattern called as a subroutine causes	the subroutine
       match to	succeed	without	any further processing.	Matching then  contin-
       ues after the subroutine	call.

       (*COMMIT), (*SKIP), and (*PRUNE)	in a subpattern	called as a subroutine
       cause the subroutine match to fail.

       (*THEN) skips to	the next alternative in	the innermost enclosing	 group
       within  the subpattern that has alternatives. If	there is no such group
       within the subpattern, (*THEN) causes the subroutine match to fail.

Ericsson AB			  stdlib 2.4				 re(3)


Want to link to this manual page? Use this URL:

home | help