Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
re(3)			   Erlang Module Definition			 re(3)

       re - Perl like regular expressions for Erlang

       This  module contains regular expression	matching functions for strings
       and binaries.

       The regular expression syntax and semantics resemble that of Perl.

       The library's matching algorithms are currently based on	the  PCRE  li-
       brary,  but not all of the PCRE library is interfaced and some parts of
       the library go beyond what PCRE offers. The sections of the PCRE	 docu-
       mentation which are relevant to this module are included	here.

       The  Erlang literal syntax for strings uses the "\" (backslash) charac-
       ter as an escape	code.  You  need  to  escape  backslashes  in  literal
       strings,	 both  in your code and	in the shell, with an additional back-
       slash, i.e.: "\\".

       mp() = {re_pattern, term(), term(), term(), term()}

	      Opaque datatype containing a compiled  regular  expression.  The
	      mp()  is guaranteed to be	a tuple() having the atom 're_pattern'
	      as its first element, to allow for matching in guards. The arity
	      of  the tuple() or the content of	the other fields may change in
	      future releases.

       nl_spec() = cr |	crlf | lf | anycrlf | any

       compile_option()	= unicode
			| anchored
			| caseless
			| dollar_endonly
			| dotall
			| extended
			| firstline
			| multiline
			| no_auto_capture
			| dupnames
			| ungreedy
			| {newline, nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| no_start_optimize
			| ucp
			| never_utf

       compile(Regexp) -> {ok, MP} | {error, ErrSpec}


		 Regexp	= iodata()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      The same as compile(Regexp,[])

       compile(Regexp, Options)	-> {ok,	MP} | {error, ErrSpec}


		 Regexp	= iodata() | unicode:charlist()
		 Options = [Option]
		 Option	= compile_option()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      This function compiles a regular expression with the syntax  de-
	      scribed  below into an internal format to	be used	later as a pa-
	      rameter to the run/2,3 functions.

	      Compiling	the regular expression before matching	is  useful  if
	      the  same	 expression is to be used in matching against multiple
	      subjects during the program's lifetime. Compiling	once and  exe-
	      cuting many times	is far more efficient than compiling each time
	      one wants	to match.

	      When the unicode option is given,	the regular expression	should
	      be  given	 as a valid Unicode charlist(),	otherwise as any valid

	      The options have the following meanings:

		  The regular expression is given as a Unicode charlist()  and
		  the resulting	regular	expression code	is to be run against a
		  valid	Unicode	charlist() subject. Also consider the ucp  op-
		  tion when using Unicode characters.

		  The  pattern is forced to be "anchored", that	is, it is con-
		  strained to match only at the	first matching	point  in  the
		  string  that	is being searched (the "subject	string"). This
		  effect can also be achieved by appropriate constructs	in the
		  pattern itself.

		  Letters  in the pattern match	both upper and lower case let-
		  ters.	It is equivalent to Perl's /i option, and  it  can  be
		  changed within a pattern by a	(?i) option setting. Uppercase
		  and lowercase	letters	are defined as in the ISO-8859-1 char-
		  acter	set.

		  A  dollar  metacharacter  in the pattern matches only	at the
		  end of the subject string. Without  this  option,  a	dollar
		  also	matches	immediately before a newline at	the end	of the
		  string (but not before any other newlines).  The  dollar_en-
		  donly	 option	 is ignored if multiline is given. There is no
		  equivalent option in Perl, and no way	to  set	 it  within  a

		  A dot	in the pattern matches all characters, including those
		  that indicate	newline. Without it, a dot does	not match when
		  the current position is at a newline.	This option is equiva-
		  lent to Perl's /s option, and	it can	be  changed  within  a
		  pattern  by  a (?s) option setting. A	negative class such as
		  [^a] always matches newline characters, independent of  this
		  option's setting.

		  Whitespace data characters in	the pattern are	ignored	except
		  when escaped or inside a character  class.  Whitespace  does
		  not  include the VT character	(ASCII 11). In addition, char-
		  acters between an unescaped #	outside	a character class  and
		  the  next  newline,  inclusive,  are	also  ignored. This is
		  equivalent to	Perl's /x option, and it can be	changed	within
		  a  pattern  by  a  (?x) option setting. This option makes it
		  possible to include comments	inside	complicated  patterns.
		  Note,	 however,  that	 this applies only to data characters.
		  Whitespace characters	may never appear within	special	 char-
		  acter	 sequences  in	a  pattern, for	example	within the se-
		  quence (?( which introduces a	conditional subpattern.

		  An unanchored	pattern	is required to match before or at  the
		  first	newline	in the subject string, though the matched text
		  may continue over the	newline.

		  By default, PCRE treats the subject string as	consisting  of
		  a  single  line  of characters (even if it actually contains
		  newlines). The "start	of  line"  metacharacter  (^)  matches
		  only	at  the	 start	of the string, while the "end of line"
		  metacharacter	($) matches only at the	end of the string,  or
		  before  a  terminating  newline  (unless  dollar_endonly  is
		  given). This is the same as Perl.

		  When multiline is given, the "start of  line"	 and  "end  of
		  line"	 constructs match immediately following	or immediately
		  before internal newlines  in	the  subject  string,  respec-
		  tively, as well as at	the very start and end.	This is	equiv-
		  alent	to Perl's /m option, and it can	be  changed  within  a
		  pattern  by  a (?m) option setting. If there are no newlines
		  in a subject string, or no occurrences of ^ or $ in  a  pat-
		  tern,	setting	multiline has no effect.

		  Disables  the	 use  of numbered capturing parentheses	in the
		  pattern. Any opening parenthesis that	is not followed	 by  ?
		  behaves  as  if it were followed by ?: but named parentheses
		  can still be used for	capturing (and they acquire numbers in
		  the  usual  way).  There  is no equivalent of	this option in

		  Names	used to	identify capturing  subpatterns	 need  not  be
		  unique.  This	 can  be  helpful for certain types of pattern
		  when it is known that	only one instance of the named subpat-
		  tern	can  ever  be matched. There are more details of named
		  subpatterns below

		  This option inverts the "greediness" of the  quantifiers  so
		  that	they  are  not greedy by default, but become greedy if
		  followed by "?". It is not compatible	with Perl. It can also
		  be set by a (?U) option setting within the pattern.

		{newline, NLSpec}:
		  Override  the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline is indicated by a single character LF (ASCII  10),
		    the	default

		    Newline  is	 indicated by the two-character	CRLF (ASCII 13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	of the newline sequences above,	plus the  Unicode  se-
		    quences  VT	(vertical tab, U+000B),	FF (formfeed, U+000C),
		    NEL	(next line, U+0085), LS	(line separator, U+2028),  and
		    PS (paragraph separator, U+2029).

		  Specifies  specifically  that	\R is to match only the	cr, lf
		  or crlf sequences, not the Unicode specific newline  charac-

		  Specifies  specifically  that	\R is to match all the Unicode
		  newline characters (including	crlf etc, the default).

		  This option disables optimization that  may  malfunction  if
		  "Special  start-of-pattern items" are	present	in the regular
		  expression. A	typical	example	would be  when	matching  "DE-
		  FABC"	 against  "(*COMMIT)ABC", where	the start optimization
		  of PCRE would	skip the subject up to the "A" and would never
		  realize  that	the (*COMMIT) instruction should have made the
		  matching fail. This option  is  only	relevant  if  you  use
		  "start-of-pattern  items", as	discussed in the section "PCRE
		  regular expression details" below.

		  Specifies that Unicode Character Properties should  be  used
		  when	resolving  \B,	\b, \D,	\d, \S,	\s, \W and \w. Without
		  this flag, only ISO-Latin-1 properties are used. Using  Uni-
		  code	properties hurts performance, but is semantically cor-
		  rect when working with Unicode characters  beyond  the  ISO-
		  Latin-1 range.

		  Specifies  that  the (*UTF) and/or (*UTF8) "start-of-pattern
		  items" are forbidden.	This flag can  not  be	combined  with
		  unicode.  Useful  if	ISO-Latin-1  patterns from an external
		  source are to	be compiled.

       inspect(MP, Item) -> {namelist, [binary()]}


		 MP = mp()
		 Item =	namelist

	      This function takes a compiled regular expression	and  an	 item,
	      returning	 the  relevant	data from the regular expression. Cur-
	      rently the only supported	item is	namelist,  which  returns  the
	      tuple  {namelist,	 [  binary()]},	 containing  the  names	of all
	      (unique) named subpatterns in the	regular	expression.


	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> re:inspect(MP,namelist).
	      3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
	      4> re:inspect(MPD,namelist).

	      Note specifically	in the second example that the duplicate  name
	      only  occurs  once in the	returned list, and that	the list is in
	      alphabetical order regardless of where the names are  positioned
	      in the regular expression. The order of the names	is the same as
	      the order	of captured subexpressions if {capture,	all_names}  is
	      given as an option to re:run/3. You can therefore	create a name-
	      to-value mapping from the	result of re:run/3 like	this:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> {namelist, N} = re:inspect(MP,namelist).
	      3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
	      4> NameMap = lists:zip(N,L).

	      More items are expected to be added in the future.

       run(Subject, RE)	-> {match, Captured} | nomatch


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Captured = [CaptureData]
		 CaptureData = {integer(), integer()}

	      The same as run(Subject,RE,[]).

       run(Subject, RE,	Options) ->
	      {match, Captured}	| match	| nomatch | {error, ErrType}


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| report_errors
			| {offset, integer() >=	0}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| {newline, NLSpec :: nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| {capture, ValueSpec}
			| {capture, ValueSpec, Type}
			| CompileOpt
		 Type =	index |	list | binary
		 ValueSpec = all
			   | all_but_first
			   | all_names
			   | first
			   | none
			   | ValueList
		 ValueList = [ValueID]
		 ValueID = integer() | string()	| atom()
		 CompileOpt = compile_option()
		   See compile/2 above.
		 Captured = [CaptureData] | [[CaptureData]]
		 CaptureData = {integer(), integer()}
			     | ListConversionData
			     | binary()
		 ListConversionData = string()
				    | {error, string(),	binary()}
				    | {incomplete, string(), binary()}
		 ErrType = match_limit
			 | match_limit_recursion
			 | {compile, CompileErr}
		 CompileErr =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Executes a regexp	matching, returning match/{match, Captured} or
	      nomatch.	The regular expression can be given either as iodata()
	      in which case it is automatically	compiled (as by	 re:compile/2)
	      and executed, or as a pre-compiled mp() in which case it is exe-
	      cuted against the	subject	directly.

	      When compilation is involved, the	exception badarg is thrown  if
	      a	compilation error occurs. Call re:compile/2 to get information
	      about the	location of the	error in the regular expression.

	      If the regular expression	is  previously	compiled,  the	option
	      list  can	only contain the options anchored, global, notbol, no-
	      teol, report_errors, notempty, notempty_atstart, {offset,	 inte-
	      ger()  _=	0}, {match_limit, integer() _= 0}, {match_limit_recur-
	      sion, integer() _= 0}, {newline, NLSpec}	and  {capture,	Value-
	      Spec}/{capture,  ValueSpec,  Type}.  Otherwise all options valid
	      for the re:compile/2 function are	allowed	as well.  Options  al-
	      lowed  both for compilation and execution	of a match, namely an-
	      chored and {newline, NLSpec}, will affect	both  the  compilation
	      and  execution if	present	together with a	non pre-compiled regu-
	      lar expression.

	      If the regular expression	was previously compiled	with  the  op-
	      tion  unicode, the Subject should	be provided as a valid Unicode
	      charlist(), otherwise any	iodata() will do.  If  compilation  is
	      involved	and  the option	unicode	is given, both the Subject and
	      the  regular  expression	should	be  given  as  valid   Unicode

	      The {capture, ValueSpec}/{capture, ValueSpec, Type} defines what
	      to return	from the function upon successful matching.  The  cap-
	      ture  tuple may contain both a value specification telling which
	      of the captured substrings are to	be returned, and a type	speci-
	      fication,	telling	how captured substrings	are to be returned (as
	      index tuples, lists or binaries).	The capture option  makes  the
	      function	quite flexible and powerful. The different options are
	      described	in detail below.

	      If the capture options describe that no substring	 capturing  at
	      all  is  to  be done ({capture, none}), the function will	return
	      the single atom match upon successful  matching,	otherwise  the
	      tuple {match, ValueList} is returned. Disabling capturing	can be
	      done either by specifying	none or	an empty list as ValueSpec.

	      The report_errors	option adds the	possibility that an error  tu-
	      ple is returned. The tuple will either indicate a	matching error
	      (match_limit or match_limit_recursion) or	a  compilation	error,
	      where  the  error	 tuple	has  the format	{error,	{compile, Com-
	      pileErr}}. Note that if the option report_errors is  not	given,
	      the  function never returns error	tuples,	but will report	compi-
	      lation errors as a badarg	exception and failed  matches  due  to
	      exceeded match limits simply as nomatch.

	      The options relevant for execution are:

		  Limits  re:run/3 to matching at the first matching position.
		  If a pattern was compiled with anchored, or turned out to be
		  anchored  by virtue of its contents, it cannot be made unan-
		  chored at matching time, hence there is  no  unanchored  op-

		  Implements  global (repetitive) search (the g	flag in	Perl).
		  Each match is	returned as a separate list()  containing  the
		  specific match as well as any	matching subexpressions	(or as
		  specified by the capture option). The	Captured part  of  the
		  return value will hence be a list() of list()s when this op-
		  tion is given.

		  The interaction of the global	option with a regular  expres-
		  sion	which  matches	an  empty string surprises some	users.
		  When the global option  is  given,  re:run/3	handles	 empty
		  matches  in the same way as Perl: a zero-length match	at any
		  point	 will  be  retried   with   the	  options   [anchored,
		  notempty_atstart]  as	well. If that search gives a result of
		  length > 0, the result is included. For example:


		  The following	matching will be performed:

		  At offset 0:
		    The	regexp (|at) will first	match at the initial  position
		    of	the  string  cat,  giving the result set [{0,0},{0,0}]
		    (the second	{0,0} is due to	the  subexpression  marked  by
		    the	 parentheses).	As  the	 length	 of the	match is 0, we
		    don't advance to the next position yet.

		  At offset 0 with [anchored, notempty_atstart]:
		     The  search  is  retried  with  the  options   [anchored,
		    notempty_atstart]  at  the	same  position,	which does not
		    give any interesting  result  of  longer  length,  so  the
		    search position is now advanced to the next	character (a).

		  At offset 1:
		    This  time,	 the  search results in	[{1,0},{1,0}], so this
		    search will	also be	repeated with the extra	options.

		  At offset 1 with [anchored, notempty_atstart]:
		    Now	the ab alternative is found and	 the  result  will  be
		    [{1,2},{1,2}].  The	result is added	to the list of results
		    and	the position in	the  search  string  is	 advanced  two

		  At offset 3:
		    The	search now once	again matches the empty	string,	giving

		  At offset 1 with [anchored, notempty_atstart]:
		    This will give no result of	length > 0 and we are  at  the
		    last position, so the global search	is complete.

		  The result of	the call is:


		  An  empty  string  is	 not considered	to be a	valid match if
		  this option is given.	If there are alternatives in the  pat-
		  tern,	 they  are  tried.  If	all the	alternatives match the
		  empty	string,	the entire match fails.	For  example,  if  the


		  is  applied  to  a  string not beginning with	"a" or "b", it
		  would	normally match the empty string	at the	start  of  the
		  subject.  With the notempty option, this match is not	valid,
		  so re:run/3 searches further into the	string for occurrences
		  of "a" or "b".

		  This	is  like  notempty,  except that an empty string match
		  that is not at the start of the subject is permitted.	If the
		  pattern is anchored, such a match can	occur only if the pat-
		  tern contains	\K.

		  Perl has no direct equivalent	of  notempty  or  notempty_at-
		  start, but it	does make a special case of a pattern match of
		  the empty string within its split() function,	and when using
		  the  /g  modifier. It	is possible to emulate Perl's behavior
		  after	matching a null	string by first	trying the match again
		  at  the  same	offset with notempty_atstart and anchored, and
		  then,	if that	fails, by advancing the	starting  offset  (see
		  below) and trying an ordinary	match again.

		  This	option	specifies that the first character of the sub-
		  ject string is not the beginning of a	line, so  the  circum-
		  flex	metacharacter should not match before it. Setting this
		  without multiline (at	compile	time) causes circumflex	 never
		  to  match. This option only affects the behavior of the cir-
		  cumflex metacharacter. It does not affect \A.

		  This option specifies	that the end of	the subject string  is
		  not  the  end	 of a line, so the dollar metacharacter	should
		  not match it nor (except in multiline	mode) a	newline	 imme-
		  diately  before  it. Setting this without multiline (at com-
		  pile time) causes dollar never to match. This	option affects
		  only	the  behavior of the dollar metacharacter. It does not
		  affect \Z or \z.

		  This option gives better control of the  error  handling  in
		  re:run/3. When it is given, compilation errors (if the regu-
		  lar expression isn't already compiled) as well  as  run-time
		  errors are explicitly	returned as an error tuple.

		  The possible run-time	errors are:

		    The	PCRE library sets a limit on how many times the	inter-
		    nal	match function can be called. The  default  value  for
		    this  is  10000000	in the library compiled	for Erlang. If
		    {error, match_limit} is returned, it means that the	execu-
		    tion  of  the  regular  expression has reached this	limit.
		    Normally this is to	be regarded as a nomatch, which	is the
		    default  return value when this happens, but by specifying
		    report_errors, you will get	informed when the match	 fails
		    due	to to many internal calls.

		    This error is very similar to match_limit, but occurs when
		    the	internal  match	 function  of  PCRE  is	 "recursively"
		    called  more times than the	"match_limit_recursion"	limit,
		    which is by	default	10000000 as well. Note that as long as
		    the	match_limit and	match_limit_default values are kept at
		    the	default	values,	the  match_limit_recursion  error  can
		    not	occur, as the match_limit error	will occur before that
		    (each recursive call is also a call, but not vice  versa).
		    Both limits	can however be changed,	either by setting lim-
		    its	directly in the	regular	expression string (see	refer-
		    ence section below)	or by giving options to	re:run/3

		  It  is  important  to	understand that	what is	referred to as
		  "recursion" when limiting matches is not actually  recursion
		  on  the  C stack of the Erlang machine, neither is it	recur-
		  sion on the Erlang process stack. The	version	of  PCRE  com-
		  piled	into the Erlang	VM uses	machine	"heap" memory to store
		  values that needs to be kept over recursion in  regular  ex-
		  pression matches.

		{match_limit, integer()	_= 0}:
		  This	option	limits the execution time of a match in	an im-
		  plementation-specific	way. It	is described in	the  following
		  way by the PCRE documentation:

		The match_limit	field provides a means of preventing PCRE from using
		up a vast amount of resources when running patterns that are not going
		to match, but which have a very	large number of	possibilities in their
		search trees. The classic example is a pattern that uses nested
		unlimited repeats.

		Internally, pcre_exec()	uses a function	called match(),	which it calls
		repeatedly (sometimes recursively). The	limit set by match_limit is
		imposed	on the number of times this function is	called during a	match,
		which has the effect of	limiting the amount of backtracking that can
		take place. For	patterns that are not anchored,	the count restarts
		from zero for each position in the subject string.

		  This	means that runaway regular expression matches can fail
		  faster if the	limit is lowered using this  option.  The  de-
		  fault	 value	compiled  into	the  Erlang virtual machine is

		This option does in no way affect the execution	of the	Erlang
		virtual	 machine  in terms of "long running BIF's". re:run al-
		ways give control back to the scheduler	of Erlang processes at
		intervals  that	ensures	the real time properties of the	Erlang

		{match_limit_recursion,	integer() _= 0}:
		  This option limits the execution time	and memory consumption
		  of  a	 match in an implementation-specific way, very similar
		  to match_limit. It is	described in the following way by  the
		  PCRE documentation:

		The match_limit_recursion field	is similar to match_limit, but instead
		of limiting the	total number of	times that match() is called, it
		limits the depth of recursion. The recursion depth is a	smaller	number
		than the total number of calls,	because	not all	calls to match() are
		recursive. This	limit is of use	only if	it is set smaller than

		Limiting the recursion depth limits the	amount of machine stack	that
		can be used, or, when PCRE has been compiled to	use memory on the heap
		instead	of the stack, the amount of heap memory	that can be

		  The  Erlang  virtual	machine	uses a PCRE library where heap
		  memory is used when regular expression match recursion  hap-
		  pens,	 why  this  limits  the	 usage	of machine heap, not C

		  Specifying a lower value may result in matches with deep re-
		  cursion failing, when	they should actually have matched:

		1> re:run("aaaaaaaaaaaaaz","(a+)*z").
		2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
		3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).

		  This	option,	 as well as the	match_limit option should only
		  be used in very rare cases. Understanding of	the  PCRE  li-
		  brary	 internals  is recommended before tampering with these

		{offset, integer() _= 0}:
		  Start	matching at the	offset (position) given	in the subject
		  string.  The	offset	is  zero-based,	so that	the default is
		  {offset,0} (all of the subject string).

		{newline, NLSpec}:
		  Override the default definition of a newline in the  subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline  is	indicated by a single character	LF (ASCII 10),
		    the	default

		    Newline is indicated by the	two-character CRLF  (ASCII  13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	 of  the newline sequences above, plus the Unicode se-
		    quences VT (vertical tab, U+000B), FF (formfeed,  U+000C),
		    NEL	 (next line, U+0085), LS (line separator, U+2028), and
		    PS (paragraph separator, U+2029).

		  Specifies specifically that \R is to match only the  cr,  lf
		  or  crlf sequences, not the Unicode specific newline charac-
		  ters.	(overrides compilation option)

		  Specifies specifically that \R is to match all  the  Unicode
		  newline  characters (including crlf etc, the default).(over-
		  rides	compilation option)

		{capture, ValueSpec}/{capture, ValueSpec, Type}:
		  Specifies which captured substrings are returned and in what
		  format.  By  default,	 re:run/3 captures all of the matching
		  part of the substring	as well	as all	capturing  subpatterns
		  (all	of the pattern is automatically	captured). The default
		  return type is (zero-based) indexes of the captured parts of
		  the  string,	given as {Offset,Length} pairs (the index Type
		  of capturing).

		  As an	example	of the default behavior, the following call:


		  returns, as first and	only captured string the matching part
		  of the subject ("abcd" in the	middle)	as a index pair	{3,4},
		  where	character positions are	zero based, just  as  in  off-
		  sets.	The return value of the	call above would then be:


		  Another (and quite common) case is where the regular expres-
		  sion matches all of the subject, as in:


		  where	the return value correspondingly will point out	all of
		  the  string,	beginning  at  index 0 and being 10 characters


		  If the regular expression  contains  capturing  subpatterns,
		  like in the following	case:


		  all  of the matched subject is captured, as well as the cap-
		  tured	substrings:


		  the complete matching	pattern	always giving the first	return
		  value	 in  the  list	and  the rest of the subpatterns being
		  added	in the order they occurred in the regular expression.

		  The capture tuple is built up	as follows:

		    Specifies which captured (sub)patterns are to be returned.
		    The	 ValueSpec  can	 either	be an atom describing a	prede-
		    fined set of return	values,	or a  list  containing	either
		    the	 indexes  or  the names	of specific subpatterns	to re-

		    The	predefined sets	of subpatterns are:

		      All captured subpatterns including the complete matching
		      string. This is the default.

		      All named	subpatterns in the regular expression, as if a
		      list() of	all the	names in alphabetical order was	given.
		      The list of all names can	also be	retrieved with the in-
		      spect/2 function.

		      Only the first captured subpattern, which	is always  the
		      complete	matching  part	of the subject.	All explicitly
		      captured subpatterns are discarded.

		      All but the first	matching subpattern, i.e. all  explic-
		      itly captured subpatterns, but not the complete matching
		      part of the subject string. This is useful if the	 regu-
		      lar  expression  as  a whole matches a large part	of the
		      subject, but the part you're interested in is in an  ex-
		      plicitly captured	subpattern. If the return type is list
		      or binary, not returning subpatterns you're  not	inter-
		      ested in is a good way to	optimize.

		      Do  not return matching subpatterns at all, yielding the
		      single atom match	as the return value  of	 the  function
		      when   matching  successfully  instead  of  the  {match,
		      list()} return. Specifying an empty list gives the  same

		    The	value list is a	list of	indexes	for the	subpatterns to
		    return, where index	0 is for all of	the pattern, and 1  is
		    for	the first explicit capturing subpattern	in the regular
		    expression,	and so forth. When using named	captured  sub-
		    patterns  (see  below)  in the regular expression, one can
		    use	atom()s	or string()s to	specify	the subpatterns	to  be
		    returned. For example, consider the	regular	expression:


		    matched  against  the  string "ABCabcdABC",	capturing only
		    the	"abcd" part (the first explicit	subpattern):


		    The	call will yield	the following result:


		    as the first explicitly captured subpattern	 is  "(abcd)",
		    matching  "abcd"  in the subject, at (zero-based) position
		    3, of length 4.

		    Now	consider the same regular  expression,	but  with  the
		    subpattern explicitly named	'FOO':


		    With this expression, we could still give the index	of the
		    subpattern with the	following call:


		    giving the same result as before. But, since  the  subpat-
		    tern  is  named, we	can also specify its name in the value


		    which would	yield the same result as the earlier examples,


		    The	values list might specify indexes or names not present
		    in the regular expression, in which	case the return	values
		    vary  depending on the type. If the	type is	index, the tu-
		    ple	{-1,0} is returned for values having no	 corresponding
		    subpattern	in the regexp, but for the other types (binary
		    and	list), the values are the empty	binary or list respec-

		    Optionally specifies how captured substrings are to	be re-
		    turned. If omitted,	the default of index is	used. The Type
		    can	be one of the following:

		      Return captured substrings as pairs of byte indexes into
		      the subject string and length of the matching string  in
		      the subject (as if the subject string was	flattened with
		      iolist_to_binary/1   or	unicode:characters_to_binary/2
		      prior to matching). Note that the	unicode	option results
		      in byte-oriented indexes in a (possibly  virtual)	 UTF-8
		      encoded binary. A	byte index tuple {0,2} might therefore
		      represent	one or two characters when unicode is  in  ef-
		      fect.  This  might  seem counter-intuitive, but has been
		      deemed the most effective	and useful way to  way	to  do
		      it. To return lists instead might	result in simpler code
		      if that is desired. This return type is the default.

		      Return matching substrings as lists of  characters  (Er-
		      lang string()s). It the unicode option is	used in	combi-
		      nation with the \C sequence in the regular expression, a
		      captured subpattern can contain bytes that are not valid
		      UTF-8 (\C	matches	bytes regardless of  character	encod-
		      ing).  In	that case the list capturing may result	in the
		      same types of tuples  that  unicode:characters_to_list/2
		      can  return, namely three-tuples with the	tag incomplete
		      or error,	the successfully converted characters and  the
		      invalid  UTF-8  tail  of the conversion as a binary. The
		      best strategy is to avoid	using  the  \C	sequence  when
		      capturing	lists.

		      Return  matching	substrings as binaries.	If the unicode
		      option is	used, these binaries are in UTF-8. If  the  \C
		      sequence	is used	together with unicode the binaries may
		      be invalid UTF-8.

		  In general, subpatterns that were not	assigned  a  value  in
		  the  match are returned as the tuple {-1,0} when type	is in-
		  dex. Unassigned subpatterns are returned as the empty	binary
		  or  list, respectively, for other return types. Consider the
		  regular expression:


		  There	are three explicitly capturing subpatterns, where  the
		  opening parenthesis position determines the order in the re-
		  sult,	hence ((?_FOO_abdd)|a(..d))  is	 subpattern  index  1,
		  (?_FOO_abdd)	is  subpattern index 2 and (..d) is subpattern
		  index	3. When	matched	against	the following string:


		  the subpattern at index 2 won't  match,  as  "abdd"  is  not
		  present in the string, but the complete pattern matches (due
		  to the alternative a(..d). The  subpattern  at  index	 2  is
		  therefore unassigned and the default return value will be:


		  Setting the capture Type to binary would give	the following:


		  where	the empty binary (____)	represents the unassigned sub-
		  pattern. In the binary  case,	 some  information  about  the
		  matching  is	therefore lost,	the ____ might just as well be
		  an empty string captured.

		  If differentiation between empty matches  and	 non  existing
		  subpatterns is necessary, use	the type index and do the con-
		  version to the final type in Erlang code.

		  When the option global is given, the	capture	 specification
		  affects each match separately, so that:


		  gives	the result:


	      The  options solely affecting the	compilation step are described
	      in the re:compile/2 function.

       replace(Subject,	RE, Replacement) -> iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Replacement = iodata()	| unicode:charlist()

	      The same as replace(Subject,RE,Replacement,[]).

       replace(Subject,	RE, Replacement, Options) ->
		  iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Replacement = iodata()	| unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, NLSpec}
			| bsr_anycrlf
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_unicode
			| {return, ReturnType}
			| CompileOpt
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		 NLSpec	= cr | crlf | lf | anycrlf | any

	      Replaces the matched part	of the Subject string  with  the  con-
	      tents of Replacement.

	      The  permissible	options	 are  the same as for re:run/3,	except
	      that the capture option is not allowed. Instead a	 {return,  Re-
	      turnType}	 is  present.  The default return type is iodata, con-
	      structed in a way	to minimize copying. The iodata	result can  be
	      used  directly  in  many I/O-operations. If a flat list()	is de-
	      sired, specify {return, list} and	 if  a	binary	is  preferred,
	      specify {return, binary}.

	      As  in  the re:run/3 function, an	mp() compiled with the unicode
	      option requires the Subject to be	a Unicode charlist(). If  com-
	      pilation	is  done implicitly and	the unicode compilation	option
	      is given to this function, both the regular expression  and  the
	      Subject should be	given as valid Unicode charlist()s.

	      The  replacement	string	can  contain  the special character _,
	      which inserts the	whole matching expression in the  result,  and
	      the  special  sequence  \N  (where  N is an integer > 0),	\gN or
	      \g{N} resulting in the subexpression number N will  be  inserted
	      in the result. If	no subexpression with that number is generated
	      by the regular expression, nothing is inserted.

	      To insert	an _ or	\ in the result, precede it  with  a  \.  Note
	      that  Erlang  already  gives  a  special meaning to \ in literal
	      strings, so a single \ has to be written as "\\" and therefore a
	      double \ as "\\\\". Example:








	      As with re:run/3,	compilation errors raise the badarg exception,
	      re:compile/2 can be used to get more information about  the  er-

       split(Subject, RE) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 SplitList = [iodata() | unicode:charlist()]

	      The same as split(Subject,RE,[]).

       split(Subject, RE, Options) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, nl_spec()}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_anycrlf
			| bsr_unicode
			| {return, ReturnType}
			| {parts, NumParts}
			| group
			| trim
			| CompileOpt
		 NumParts = integer() >= 0 | infinity
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		   See compile/2 above.
		 SplitList = [RetData] | [GroupedRetData]
		 GroupedRetData	= [RetData]
		 RetData = iodata() | unicode:charlist() | binary() | list()

	      This  function splits the	input into parts by finding tokens ac-
	      cording to the regular expression	supplied.

	      The splitting is done basically by running a global regexp match
	      and  dividing  the  initial  string wherever a match occurs. The
	      matching part of the string is removed from the output.

	      As in the	re:run/3 function, an mp() compiled with  the  unicode
	      option  requires the Subject to be a Unicode charlist(). If com-
	      pilation is done implicitly and the unicode  compilation	option
	      is  given	 to this function, both	the regular expression and the
	      Subject should be	given as valid Unicode charlist()s.

	      The result is given  as  a  list	of  "strings",	the  preferred
	      datatype given in	the return option (default iodata).

	      If  subexpressions  are  given  in  the  regular expression, the
	      matching subexpressions are returned in the  resulting  list  as
	      well. An example:


	      will yield the result:




	      will yield


	      The  text	 matching the subexpression (marked by the parentheses
	      in the regexp) is	inserted in  the  result  list	where  it  was
	      found.  In  effect this means that concatenating the result of a
	      split where the whole regexp is a	single	subexpression  (as  in
	      the example above) will always result in the original string.

	      As  there	 is no matching	subexpression for the last part	in the
	      example (the "g"), there is nothing inserted after that. To make
	      the  group  of strings and the parts matching the	subexpressions
	      more obvious, one	might use the group option, which  groups  to-
	      gether  the  part	 of the	subject	string with the	parts matching
	      the subexpressions when the string was split:




	      Here the regular expression matched first	the "l", causing  "Er"
	      to  be the first part in the result. When	the regular expression
	      matched, the (only) subexpression	was bound to the "l",  so  the
	      "l"  is inserted in the group together with "Er".	The next match
	      is of the	"n", making "a"	the next part to  be  returned.	 Since
	      the  subexpression  is  bound to the substring "n" in this case,
	      the "n" is inserted into this group. The last group consists  of
	      the rest of the string, as no more matches are found.

	      By  default,  all	 parts	of  the	 string,  including  the empty
	      strings, are returned from the function. For example:


	      will return:


	      since the	matching of the	"g" in the end of the string leaves an
	      empty  rest  which is also returned. This	behaviour differs from
	      the default behaviour of the split function in Perl, where empty
	      strings at the end are by	default	removed. To get	the "trimming"
	      default behavior of Perl,	specify	trim as	an option:


	      The result will be:


	      The "trim" option	in effect says;	"give me as many parts as pos-
	      sible except the empty ones", which might	be useful in some cir-
	      cumstances. You can also specify how many	 parts	you  want,  by
	      specifying {parts,N}:


	      This will	give:


	      Note that	the last part is "ang",	not "an", as we	only specified
	      splitting	into two parts,	and the	splitting  stops  when	enough
	      parts  are  given,  which	is why the result differs from that of

	      More than	three parts are	not possible with this indata, so


	      will give	the same result	as the default,	which is to be	viewed
	      as "an infinite number of	parts".

	      Specifying 0 as the number of parts gives	the same effect	as the
	      option trim. If subexpressions are captured, empty subexpression
	      matches  at the end are also stripped from the result if trim or
	      {parts,0}	is specified.

	      If you are familiar with Perl, the  trim	behaviour  corresponds
	      exactly to the Perl default, the {parts,N} where N is a positive
	      integer corresponds exactly to the Perl behaviour	with  a	 posi-
	      tive  numerical  third  parameter	 and  the default behaviour of
	      re:split/3 corresponds to	that when the Perl routine is given  a
	      negative integer as the third parameter.

	      Summary  of  options  not	 previously described for the re:run/3

		  Specifies how	the parts of the original string are presented
		  in the result	list. The possible types are:

		    The	 variant  of  iodata() that gives the least copying of
		    data with the current implementation (often	a binary,  but
		    don't depend on it).

		    All	parts returned as binaries.

		    All	parts returned as lists	of characters ("strings").

		  Groups together the part of the string with the parts	of the
		  string matching the subexpressions of	the regexp.

		  The return value from	the function will in this  case	 be  a
		  list()  of  list()s.	Each  sublist  begins  with the	string
		  picked out of	the subject  string,  followed	by  the	 parts
		  matching  each  of the subexpressions	in order of occurrence
		  in the regular expression.

		  Specifies the	number of parts	the subject string  is	to  be
		  split	into.

		  The  number of parts should be a positive integer for	a spe-
		  cific	maximum	on the number of parts and  infinity  for  the
		  maximum  number  of parts possible (the default). Specifying
		  {parts,0} gives as many parts	as possible disregarding empty
		  parts	at the end, the	same as	specifying trim

		  Specifies that empty parts at	the end	of the result list are
		  to be	disregarded. The same as  specifying  {parts,0}.  This
		  corresponds  to  the default behaviour of the	split built in
		  function in Perl.

       The following sections contain reference	material for the  regular  ex-
       pressions  used	by  this  module.  The regular expression reference is
       based on	the PCRE documentation,	with changes in	 cases	where  the  re
       module behaves differently to the PCRE library.

       The  syntax and semantics of the	regular	expressions that are supported
       by PCRE are described in	detail below. Perl's regular  expressions  are
       described  in its own documentation, and	regular	expressions in general
       are covered in a	number of books, some of which have copious  examples.
       Jeffrey	 Friedl's   "Mastering	 Regular  Expressions",	 published  by
       O'Reilly, covers	regular	expressions in great detail. This  description
       of PCRE's regular expressions is	intended as reference material.

       The reference material is divided into the following sections:

	 * Special start-of-pattern items

	 * Characters and metacharacters

	 * Backslash

	 * Circumflex and dollar

	 * Full	stop (period, dot) and \N

	 * Matching a single data unit

	 * Square brackets and character classes

	 * POSIX character classes

	 * Vertical bar

	 * Internal option setting

	 * Subpatterns

	 * Duplicate subpattern	numbers

	 * Named subpatterns

	 * Repetition

	 * Atomic grouping and possessive quantifiers

	 * Back	references

	 * Assertions

	 * Conditional subpatterns

	 * Comments

	 * Recursive patterns

	 * Subpatterns as subroutines

	 * Oniguruma subroutine	syntax

	 * Backtracking	control

       A  number of options that can be	passed to re:compile/2 can also	be set
       by special items	at the start of	a pattern. These are not Perl-compati-
       ble, but	are provided to	make these options accessible to pattern writ-
       ers who are not able to change the program that processes the  pattern.
       Any  number  of	these  items may appear, but they must all be together
       right at	the start of the pattern string, and the letters  must	be  in
       upper case.

       UTF support

       Unicode	support	 is  basically UTF-8 based. To use Unicode characters,
       you either call re:compile/2/re:run/3 with the unicode option,  or  the
       pattern must start with one of these special sequences:



       Both  options  give the same effect, the	input string is	interpreted as
       UTF-8. Note that	with these instructions, the automatic	conversion  of
       lists  to  UTF-8	 is not	performed by the re functions, why using these
       options is not recommended. Add the unicode option when running re:com-
       pile/2 instead.

       Some applications that allow their users	to supply patterns may wish to
       restrict	them to	non-UTF	data for security reasons.  If	the  never_utf
       option  is  set at compile time,	(*UTF) etc. are	not allowed, and their
       appearance causes an error.

       Unicode property	support

       Another special sequence	that may appear	at the start of	a pattern is


       This has	the same effect	as setting the ucp option: it causes sequences
       such  as	 \d  and  \w  to use Unicode properties	to determine character
       types, instead of recognizing only characters with codes	less than  256
       via a lookup table.

       Disabling start-up optimizations

       If  a  pattern  starts  with (*NO_START_OPT), it	has the	same effect as
       setting the no_Start_optimize option at compile time.

       Newline conventions

       PCRE supports five different conventions	for indicating line breaks  in
       strings:	 a  single  CR (carriage return) character, a single LF	(line-
       feed) character,	the two-character sequence CRLF	,  any	of  the	 three
       preceding, or any Unicode newline sequence.

       It  is also possible to specify a newline convention by starting	a pat-
       tern string with	one of the following five sequences:

	   carriage return


	   carriage return, followed by	linefeed

	   any of the three above

	   all Unicode newline sequences

       These override the default and the options given	to  re:compile/2.  For
       example,	the pattern:


       changes the convention to CR. That pattern matches "a\nb" because LF is
       no longer a newline. If more than one of	them is	present, the last  one
       is used.

       The  newline  convention	affects	where the circumflex and dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter when dotall is not	set, and the behaviour of \N. However, it does
       not affect what the \R escape sequence matches. By default, this	is any
       Unicode	newline	sequence, for Perl compatibility. However, this	can be
       changed;	see the	description of \R in the section entitled "Newline se-
       quences"	below. A change	of \R setting can be combined with a change of
       newline convention.

       Setting match and recursion limits

       The caller of re:run/3 can set a	limit on the number of times  the  in-
       ternal match() function is called and on	the maximum depth of recursive
       calls. These facilities are provided to catch runaway matches that  are
       provoked	 by  patterns with huge	matching trees (a typical example is a
       pattern with nested unlimited repeats) and to avoid running out of sys-
       tem  stack  by too much recursion. When one of these limits is reached,
       pcre_exec() gives an error return. The limits can also be set by	 items
       at the start of the pattern of the form



       where d is any number of	decimal	digits.	However, the value of the set-
       ting must be less than the value	set by the caller of re:run/3  for  it
       to  have	 any  effect. In other words, the pattern writer can lower the
       limit set by the	programmer, but	not raise it. If there	is  more  than
       one setting of one of these limits, the lower value is used.

       The  current  default value for both the	limits are 10000000 in the Er-
       lang VM.	Note that the recursion	limit does  not	 actually  affect  the
       stack  depth  of	 the  VM, as PCRE for Erlang is	compiled in such a way
       that the	match function never does recursion on the "C-stack".

       A regular expression is a pattern that is  matched  against  a  subject
       string  from  left  to right. Most characters stand for themselves in a
       pattern,	and match the corresponding characters in the  subject.	 As  a
       trivial example,	the pattern

       The quick brown fox

       matches a portion of a subject string that is identical to itself. When
       caseless	matching is  specified	(the  caseless	option),  letters  are
       matched independently of	case.

       The  power of regular expressions comes from the	ability	to include al-
       ternatives and repetitions in the pattern. These	 are  encoded  in  the
       pattern by the use of metacharacters, which do not stand	for themselves
       but instead are interpreted in some special way.

       There are two different sets of metacharacters: those that  are	recog-
       nized  anywhere in the pattern except within square brackets, and those
       that are	recognized within square brackets.  Outside  square  brackets,
       the metacharacters are as follows:

	   general escape character with several uses

	   assert start	of string (or line, in multiline mode)

	   assert end of string	(or line, in multiline mode)

	   match any character except newline (by default)

	   start character class definition

	   start of alternative	branch

	   start subpattern

	   end subpattern

	   extends  the	 meaning of (, also 0 or 1 quantifier, also quantifier

	   0 or	more quantifier

	   1 or	more quantifier, also "possessive quantifier"

	   start min/max quantifier

       Part of a pattern that is in square brackets  is	 called	 a  "character
       class". In a character class the	only metacharacters are:

	   general escape character

	   negate the class, but only if the first character

	   indicates character range

	   POSIX character class (only if followed by POSIX syntax)

	   terminates the character class

       The following sections describe the use of each of the metacharacters.

       The backslash character has several uses. Firstly, if it	is followed by
       a character that	is not a number	or a letter, it	takes away any special
       meaning	that  character	 may  have. This use of	backslash as an	escape
       character applies both inside and outside character classes.

       For example, if you want	to match a * character,	you write  \*  in  the
       pattern.	 This  escaping	 action	 applies  whether or not the following
       character would otherwise be interpreted	as a metacharacter, so	it  is
       always  safe  to	 precede  a non-alphanumeric with backslash to specify
       that it stands for itself. In particular, if you	want to	match a	 back-
       slash, you write	\\.

       In  unicode mode, only ASCII numbers and	letters	have any special mean-
       ing after a backslash. All other	characters (in particular, those whose
       codepoints are greater than 127)	are treated as literals.

       If  a  pattern is compiled with the extended option, white space	in the
       pattern (other than in a	character class) and characters	 between  a  #
       outside a character class and the next newline are ignored. An escaping
       backslash can be	used to	include	a white	space or # character  as  part
       of the pattern.

       If  you	want  to remove	the special meaning from a sequence of charac-
       ters, you can do	so by putting them between \Q and \E. This is  differ-
       ent  from  Perl	in that	$ and @	are handled as literals	in \Q...\E se-
       quences in PCRE,	whereas	in Perl, $ and @ cause variable	interpolation.
       Note the	following examples:

	 Pattern	   PCRE	matches	  Perl matches

	 \Qabc$xyz\E	   abc$xyz	  abc followed by the contents of $xyz
	 \Qabc\$xyz\E	   abc\$xyz	  abc\$xyz
	 \Qabc\E\$\Qxyz\E  abc$xyz	  abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.	An isolated \E that is not preceded by \Q is ignored. If \Q is
       not  followed  by  \E  later in the pattern, the	literal	interpretation
       continues to the	end of the pattern (that is,  \E  is  assumed  at  the
       end).  If  the  isolated	\Q is inside a character class,	this causes an
       error, because the character class is not terminated.

       Non-printing characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters  in patterns in a	visible	manner.	There is no restriction	on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates  a  pattern,	but  when  a pattern is	being prepared by text
       editing,	it is often easier to use one  of  the	following  escape  se-
       quences than the	binary character it represents:

	   alarm, that is, the BEL character (hex 07)

	   "control-x",	where x	is any ASCII character

	 \e :
	   escape (hex 1B)

	   form	feed (hex 0C)

	   linefeed (hex 0A)

	   carriage return (hex	0D)

	 \t :
	   tab (hex 09)

	   character with octal	code ddd, or back reference

	 \xhh :
	   character with hex code hh

	   character with hex code hhh..

       The  precise effect of \cx on ASCII characters is as follows: if	x is a
       lower case letter, it is	converted to upper case. Then  bit  6  of  the
       character (hex 40) is inverted. Thus \cA	to \cZ become hex 01 to	hex 1A
       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is	7B), and  \c;  becomes
       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
       has a value greater than	127, a compile-time error occurs.  This	 locks
       out non-ASCII characters	in all modes.

       The  \c	facility  was designed for use with ASCII characters, but with
       the extension to	Unicode	it is even less	useful than it once was.

       By default, after \x, from zero to  two	hexadecimal  digits  are  read
       (letters	can be in upper	or lower case).	Any number of hexadecimal dig-
       its may appear between \x{ and }, but the character code	is constrained
       as follows:

	 8-bit non-Unicode mode:
	   less	than 0x100

	 8-bit UTF-8 mode:
	   less	than 0x10ffff and a valid codepoint

       Invalid	Unicode	 codepoints  are  the  range 0xd800 to 0xdfff (the so-
       called "surrogate" codepoints), and 0xffef.

       If characters other than	hexadecimal digits appear between \x{  and  },
       or if there is no terminating },	this form of escape is not recognized.
       Instead,	the initial \x will be interpreted as a	basic hexadecimal  es-
       cape, with no following digits, giving a	character whose	value is zero.

       Characters whose	value is less than 256 can be defined by either	of the
       two syntaxes for	\x. There is no	difference in the way  they  are  han-
       dled. For example, \xdc is exactly the same as \x{dc}.

       After  \0  up  to two further octal digits are read. If there are fewer
       than two	digits,	just those that	are present are	 used.	Thus  the  se-
       quence  \0\x\07	specifies two binary zeros followed by a BEL character
       (code value 7). Make sure you supply two	digits after the initial  zero
       if the pattern character	that follows is	itself an octal	digit.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated. Outside a	character class, PCRE reads it and any following  dig-
       its  as	a  decimal  number. If the number is less than 10, or if there
       have been at least that many previous capturing left parentheses	in the
       expression,  the	 entire	 sequence  is taken as a back reference. A de-
       scription of how	this works is given later, following the discussion of
       parenthesized subpatterns.

       Inside  a  character  class, or if the decimal number is	greater	than 9
       and there have not been that many capturing subpatterns,	PCRE  re-reads
       up to three octal digits	following the backslash, and uses them to gen-
       erate a data character. Any subsequent digits stand for themselves. The
       value  of  the  character  is constrained in the	same way as characters
       specified in hexadecimal. For example:

	   is another way of writing a ASCII space

	   is the same,	provided there are fewer than  40  previous  capturing

	   is always a back reference

	    might be a back reference, or another way of writing a tab

	   is always a tab

	   is a	tab followed by	the character "3"

	   might  be a back reference, otherwise the character with octal code

	   might be a back reference, otherwise	the value 255 (decimal)

	   is either a back reference, or a binary zero	followed  by  the  two
	   characters "8" and "1"

       Note  that  octal  values of 100	or greater must	not be introduced by a
       leading zero, because no	more than three	octal digits are ever read.

       All the sequences that define a single character	value can be used both
       inside  and  outside character classes. In addition, inside a character
       class, \b is interpreted	as the backspace character (hex	08).

       \N is not allowed in a character	class. \B, \R, and \X are not  special
       inside  a  character  class.  Like other	unrecognized escape sequences,
       they are	treated	as the literal characters "B", "R", and	"X". Outside a
       character class,	these sequences	have different meanings.

       Unsupported escape sequences

       In  Perl, the sequences \l, \L, \u, and \U are recognized by its	string
       handler and used	to modify the case of following	characters. PCRE  does
       not support these escape	sequences.

       Absolute	and relative back references

       The  sequence  \g followed by an	unsigned or a negative number, option-
       ally enclosed in	braces,	is an absolute or relative back	 reference.  A
       named back reference can	be coded as \g{name}. Back references are dis-
       cussed later, following the discussion of parenthesized subpatterns.

       Absolute	and relative subroutine	calls

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an alternative syntax for referencing a subpattern as  a	 "subroutine".
       Details	are  discussed	later.	Note  that  \g{...}  (Perl syntax) and
       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
       reference; the latter is	a subroutine call.

       Generic character types

       Another use of backslash	is for specifying generic character types:

	   any decimal digit

	   any character that is not a decimal digit

	   any horizontal white	space character

	   any character that is not a horizontal white	space character

	   any white space character

	   any character that is not a white space character

	   any vertical	white space character

	   any character that is not a vertical	white space character

	   any "word" character

	   any "non-word" character

       There is	also the single	sequence \N, which matches a non-newline char-
       acter. This is the same as the "." metacharacter	 when  dotall  is  not
       set.  Perl also uses \N to match	characters by name; PCRE does not sup-
       port this.

       Each pair of lower and upper case escape	sequences partitions the  com-
       plete  set  of  characters  into	two disjoint sets. Any given character
       matches one, and	only one, of each pair.	The sequences can appear  both
       inside  and outside character classes. They each	match one character of
       the appropriate type. If	the current matching point is at  the  end  of
       the  subject string, all	of them	fail, because there is no character to

       For compatibility with Perl, \s does not	match the VT  character	 (code
       11). This makes it different from the POSIX "space" class. The \s char-
       acters are HT (9), LF (10), FF (12), CR (13), and space (32).  If  "use
       locale;"	 is  included in a Perl	script,	\s may match the VT character.
       In PCRE,	it never does.

       A "word"	character is an	underscore or any character that is  a	letter
       or  digit.  By  default,	 the  definition of letters and	digits is con-
       trolled by PCRE's low-valued character tables, in  Erlang's  case  (and
       without the unicode option), the	ISO-Latin-1 character set.

       By  default,  in	unicode	mode, characters with values greater than 255,
       i.e. all	characters outside the ISO-Latin-1 character set, never	 match
       \d,  \s,	or \w, and always match	\D, \S,	and \W.	These sequences	retain
       their original meanings from before UTF support was  available,	mainly
       for  efficiency	reasons. However, if the ucp option is set, the	behav-
       iour is changed so that Unicode properties are used to determine	 char-
       acter types, as follows:

	   any character that \p{Nd} matches (decimal digit)

	   any character that \p{Z} matches, plus HT, LF, FF, CR)

	   any character that \p{L} or \p{N} matches, plus underscore)

       The  upper case escapes match the inverse sets of characters. Note that
       \d matches only decimal digits, whereas \w matches any  Unicode	digit,
       as  well	 as any	Unicode	letter,	and underscore.	Note also that ucp af-
       fects \b, and \B	because	they are defined in terms of \w	and \W.	Match-
       ing these sequences is noticeably slower	when ucp is set.

       The  sequences  \h, \H, \v, and \V are features that were added to Perl
       at release 5.10.	In contrast to the other sequences, which  match  only
       ASCII  characters  by  default,	these always match certain high-valued
       codepoints, whether or not ucp is set. The horizontal space  characters

	   Horizontal tab (HT)


	   Non-break space

	   Ogham space mark

	   Mongolian vowel separator

	   En quad

	   Em quad

	   En space

	   Em space

	   Three-per-em	space

	   Four-per-em space

	   Six-per-em space

	   Figure space

	   Punctuation space

	   Thin	space

	   Hair	space

	   Narrow no-break space

	   Medium mathematical space

	   Ideographic space

       The vertical space characters are:

	   Linefeed (LF)

	   Vertical tab	(VT)

	   Form	feed (FF)

	   Carriage return (CR)

	   Next	line (NEL)

	   Line	separator

	   Paragraph separator

       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
       256 are relevant.

       Newline sequences

       Outside a character class, by default, the escape sequence  \R  matches
       any Unicode newline sequence. In	non-UTF-8 mode \R is equivalent	to the


       This is an example of an	"atomic	group",	details	of which are given be-

       This particular group matches either the	two-character sequence CR fol-
       lowed by	LF, or one of the single characters LF (linefeed, U+000A),  VT
       (vertical  tab,	U+000B),  FF (form feed, U+000C), CR (carriage return,
       U+000D),	or NEL (next line,  U+0085).  The  two-character  sequence  is
       treated as a single unit	that cannot be split.

       In Unicode mode,	two additional characters whose	codepoints are greater
       than 255	are added: LS (line separator, U+2028) and PS (paragraph sepa-
       rator,  U+2029).	 Unicode  character property support is	not needed for
       these characters	to be recognized.

       It is possible to restrict \R to	match only CR, LF, or CRLF (instead of
       the  complete  set  of  Unicode	line  endings)	by  setting the	option
       bsr_anycrlf either at compile time or when the pattern is matched. (BSR
       is  an  abbreviation  for  "backslash R".) This can be made the default
       when PCRE is built; if this is the case,	the other behaviour can	be re-
       quested	via  the  bsr_unicode  option.	It is also possible to specify
       these settings by starting a pattern string with	one of	the  following

       (*BSR_ANYCRLF)  CR, LF, or CRLF only (*BSR_UNICODE) any Unicode newline

       These override the default and the options given	to the compiling func-
       tion,  but  they	 can  themselves  be  overridden by options given to a
       matching	function. Note that these  special  settings,  which  are  not
       Perl-compatible,	 are  recognized  only at the very start of a pattern,
       and that	they must be in	upper case.  If	 more  than  one  of  them  is
       present,	 the  last  one	is used. They can be combined with a change of
       newline convention; for example,	a pattern can start with:


       They can	also be	combined with the (*UTF8), (*UTF)  or  (*UCP)  special
       sequences.  Inside  a character class, \R is treated as an unrecognized
       escape sequence,	and so matches the letter "R" by default.

       Unicode character properties

       Three additional	escape sequences that match characters	with  specific
       properties are available. When in 8-bit non-UTF-8 mode, these sequences
       are of course limited to	testing	characters whose codepoints  are  less
       than  256,  but	they  do work in this mode. The	extra escape sequences

	   a character with the	xx property

	   a character without the xx property

	   a Unicode extended grapheme cluster

       The property names represented by xx above are limited to  the  Unicode
       script names, the general category properties, "Any", which matches any
       character (including newline), and some special	PCRE  properties  (de-
       scribed in the next section). Other Perl	properties such	as "InMusical-
       Symbols"	are not	currently supported by PCRE. Note  that	 \P{Any}  does
       not match any characters, so always causes a match failure.

       Sets of Unicode characters are defined as belonging to certain scripts.
       A character from	one of these sets can be matched using a script	 name.
       For example:

       \p{Greek} \P{Han}

       Those  that are not part	of an identified script	are lumped together as
       "Common". The current list of scripts is:

	 * Arabic

	 * Armenian

	 * Avestan

	 * Balinese

	 * Bamum

	 * Batak

	 * Bengali

	 * Bopomofo

	 * Braille

	 * Buginese

	 * Buhid

	 * Canadian_Aboriginal

	 * Carian

	 * Chakma

	 * Cham

	 * Cherokee

	 * Common

	 * Coptic

	 * Cuneiform

	 * Cypriot

	 * Cyrillic

	 * Deseret

	 * Devanagari

	 * Egyptian_Hieroglyphs

	 * Ethiopic

	 * Georgian

	 * Glagolitic

	 * Gothic

	 * Greek

	 * Gujarati

	 * Gurmukhi

	 * Han

	 * Hangul

	 * Hanunoo

	 * Hebrew

	 * Hiragana

	 * Imperial_Aramaic

	 * Inherited

	 * Inscriptional_Pahlavi

	 * Inscriptional_Parthian

	 * Javanese

	 * Kaithi

	 * Kannada

	 * Katakana

	 * Kayah_Li

	 * Kharoshthi

	 * Khmer

	 * Lao

	 * Latin

	 * Lepcha

	 * Limbu

	 * Linear_B

	 * Lisu

	 * Lycian

	 * Lydian

	 * Malayalam

	 * Mandaic

	 * Meetei_Mayek

	 * Meroitic_Cursive

	 * Meroitic_Hieroglyphs

	 * Miao

	 * Mongolian

	 * Myanmar

	 * New_Tai_Lue

	 * Nko

	 * Ogham

	 * Old_Italic

	 * Old_Persian

	 * Oriya

	 * Old_South_Arabian

	 * Old_Turkic

	 * Ol_Chiki

	 * Osmanya

	 * Phags_Pa

	 * Phoenician

	 * Rejang

	 * Runic

	 * Samaritan

	 * Saurashtra

	 * Sharada

	 * Shavian

	 * Sinhala

	 * Sora_Sompeng

	 * Sundanese

	 * Syloti_Nagri

	 * Syriac

	 * Tagalog

	 * Tagbanwa

	 * Tai_Le

	 * Tai_Tham

	 * Tai_Viet

	 * Takri

	 * Tamil

	 * Telugu

	 * Thaana

	 * Thai

	 * Tibetan

	 * Tifinagh

	 * Ugaritic

	 * Vai

	 * Yi

       Each character has exactly one Unicode general category property, spec-
       ified  by a two-letter abbreviation. For	compatibility with Perl, nega-
       tion can	be specified by	including a  circumflex	 between  the  opening
       brace  and  the	property  name.	 For  example,	\p{^Lu}	is the same as

       If only one letter is specified with \p or \P, it includes all the gen-
       eral  category properties that start with that letter. In this case, in
       the absence of negation,	the curly brackets in the escape sequence  are
       optional; these two examples have the same effect:

	 * \p{L}

	 * \pL

       The following general category property codes are supported:





	   Private use



	   Lower case letter

	   Modifier letter

	   Other letter

	   Title case letter

	   Upper case letter


	   Spacing mark

	   Enclosing mark

	   Non-spacing mark


	   Decimal number

	   Letter number

	   Other number


	   Connector punctuation

	   Dash	punctuation

	   Close punctuation

	   Final punctuation

	   Initial punctuation

	   Other punctuation

	   Open	punctuation


	   Currency symbol

	   Modifier symbol

	   Mathematical	symbol

	   Other symbol


	   Line	separator

	   Paragraph separator

	   Space separator

       The  special property L&	is also	supported: it matches a	character that
       has the Lu, Ll, or Lt property, in other	words, a letter	 that  is  not
       classified as a modifier	or "other".

       The  Cs	(Surrogate)  property  applies only to characters in the range
       U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
       so cannot be tested by PCRE. Perl does not support the Cs property

       The  long  synonyms  for	 property  names  that	Perl supports (such as
       \p{Letter}) are not supported by	PCRE, nor is it	 permitted  to	prefix
       any of these properties with "Is".

       No character that is in the Unicode table has the Cn (unassigned) prop-
       erty. Instead, this property is assumed for any code point that is  not
       in the Unicode table.

       Specifying  caseless  matching  does not	affect these escape sequences.
       For example, \p{Lu} always matches only upper  case  letters.  This  is
       different from the behaviour of current versions	of Perl.

       Matching	 characters  by	Unicode	property is not	fast, because PCRE has
       to do a multistage table	lookup in order	to find	 a  character's	 prop-
       erty. That is why the traditional escape	sequences such as \d and \w do
       not use Unicode properties in PCRE by default, though you can make them
       do so by	setting	the ucp	option or by starting the pattern with (*UCP).

       Extended	grapheme clusters

       The  \X	escape	matches	 any number of Unicode characters that form an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see below). Up to and including	release	8.31, PCRE matched an earlier,
       simpler definition that was equivalent to


       That is,	it matched a character without the "mark"  property,  followed
       by  zero	 or  more characters with the "mark" property. Characters with
       the "mark" property are typically non-spacing accents that  affect  the
       preceding character.

       This  simple definition was extended in Unicode to include more compli-
       cated kinds of composite	character by giving each character a  grapheme
       breaking	 property, and creating	rules that use these properties	to de-
       fine the	boundaries of extended grapheme	clusters. In releases of  PCRE
       later than 8.31,	\X matches one of these	clusters.

       \X  always  matches  at least one character. Then it decides whether to
       add additional characters according to the following rules for ending a

	   End at the end of the subject string.

	   Do not end between CR and LF; otherwise end after any control char-

	   Do not break	Hangul (a Korean script)  syllable  sequences.	Hangul
	   characters  are of five types: L, V,	T, LV, and LVT.	An L character
	   may be followed by an L, V, LV, or LVT character; an	LV or V	 char-
	   acter  may be followed by a V or T character; an LVT	or T character
	   may be follwed only by a T character.

	   Do not end before extending characters or spacing marks. Characters
	   with	the "mark" property always have	the "extend" grapheme breaking

	   Do not end after prepend characters.

	   Otherwise, end the cluster.

       PCRE's additional properties

       As well as the standard Unicode properties described above,  PCRE  sup-
       ports four more that make it possible to	convert	traditional escape se-
       quences such as \w and \s and POSIX character classes  to  use  Unicode
       properties.  PCRE  uses	these non-standard, non-Perl properties	inter-
       nally when PCRE_UCP is set. However, they may also be used  explicitly.
       These properties	are:

	   Any alphanumeric character

	   Any POSIX space character

	   Any Perl space character

	   Any Perl "word" character

       Xan  matches  characters	that have either the L (letter)	or the N (num-
       ber) property. Xps matches the characters tab, linefeed,	vertical  tab,
       form  feed,  or carriage	return,	and any	other character	that has the Z
       (separator) property. Xsp is the	same as	Xps, except that vertical  tab
       is excluded. Xwd	matches	the same characters as Xan, plus underscore.

       There  is another non-standard property,	Xuc, which matches any charac-
       ter that	can be represented by a	Universal Character Name  in  C++  and
       other  programming  languages.  These are the characters	$, @, `	(grave
       accent),	and all	characters with	Unicode	code points  greater  than  or
       equal  to U+00A0, except	for the	surrogates U+D800 to U+DFFF. Note that
       most base (ASCII) characters are	excluded. (Universal  Character	 Names
       are  of	the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal	digit.
       Note that the Xuc property does not match these sequences but the char-
       acters that they	represent.)

       Resetting the match start

       The  escape sequence \K causes any previously matched characters	not to
       be included in the final	matched	sequence. For example, the pattern:


       matches "foobar", but reports that it has matched "bar".	 This  feature
       is  similar  to	a  lookbehind assertion	(described below). However, in
       this case, the part of the subject before the real match	does not  have
       to  be of fixed length, as lookbehind assertions	do. The	use of \K does
       not interfere with the setting of  captured  substrings.	 For  example,
       when the	pattern


       matches "foobar", the first substring is	still set to "foo".

       Perl  documents	that  the use of \K within assertions is "not well de-
       fined". In PCRE,	\K is acted upon when it occurs	inside positive	asser-
       tions, but is ignored in	negative assertions.

       Simple assertions

       The  final use of backslash is for certain simple assertions. An	asser-
       tion specifies a	condition that has to be met at	a particular point  in
       a  match, without consuming any characters from the subject string. The
       use of subpatterns for more complicated assertions is described	below.
       The backslashed assertions are:

	   matches at a	word boundary

	   matches when	not at a word boundary

	   matches at the start	of the subject

	   matches  at the end of the subject also matches before a newline at
	   the end of the subject

	   matches only	at the end of the subject

	   matches at the first	matching position in the subject

       Inside a	character class, \b has	a different meaning;  it  matches  the
       backspace  character.  If  any  other  of these assertions appears in a
       character class,	by default it matches the corresponding	literal	 char-
       acter (for example, \B matches the letter B).

       A  word	boundary is a position in the subject string where the current
       character and the previous character do not both	match \w or  \W	 (i.e.
       one  matches  \w	 and the other matches \W), or the start or end	of the
       string if the first or last character matches \w,  respectively.	 In  a
       UTF  mode,  the meanings	of \w and \W can be changed by setting the ucp
       option. When this is done, it also affects \b and \B. Neither PCRE  nor
       Perl has	a separate "start of word" or "end of word" metasequence. How-
       ever, whatever follows \b normally determines which it is. For example,
       the fragment \ba	matches	"a" at the start of a word.

       The  \A,	 \Z,  and \z assertions	differ from the	traditional circumflex
       and dollar (described in	the next section) in that they only ever match
       at  the	very start and end of the subject string, whatever options are
       set. Thus, they are independent of multiline mode. These	 three	asser-
       tions  are  not	affected by the	notbol or noteol options, which	affect
       only the	behaviour of the circumflex and	 dollar	 metacharacters.  How-
       ever,  if  the startoffset argument of re:run/3 is non-zero, indicating
       that matching is	to start at a point other than the  beginning  of  the
       subject,	 \A  can never match. The difference between \Z	and \z is that
       \Z matches before a newline at the end of the string as well as at  the
       very end, whereas \z matches only at the	end.

       The  \G assertion is true only when the current matching	position is at
       the start point of the match, as	specified by the startoffset  argument
       of  re:run/3.  It differs from \A when the value	of startoffset is non-
       zero. By	calling	re:run/3 multiple times	 with  appropriate  arguments,
       you  can	 mimic Perl's /g option, and it	is in this kind	of implementa-
       tion where \G can be useful.

       Note, however, that PCRE's interpretation of \G,	as the	start  of  the
       current match, is subtly	different from Perl's, which defines it	as the
       end of the previous match. In Perl, these can  be  different  when  the
       previously  matched  string was empty. Because PCRE does	just one match
       at a time, it cannot reproduce this behaviour.

       If all the alternatives of a pattern begin with \G, the	expression  is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

       The circumflex and dollar  metacharacters  are  zero-width  assertions.
       That  is,  they test for	a particular condition being true without con-
       suming any characters from the subject string.

       Outside a character class, in the default matching mode,	the circumflex
       character  is  an  assertion  that is true only if the current matching
       point is	at the start of	the subject string. If the  startoffset	 argu-
       ment  of	re:run/3 is non-zero, circumflex can never match if the	multi-
       line option is unset. Inside a character	class, circumflex has  an  en-
       tirely different	meaning	(see below).

       Circumflex  need	 not be	the first character of the pattern if a	number
       of alternatives are involved, but it should be the first	thing in  each
       alternative  in	which  it appears if the pattern is ever to match that
       branch. If all possible alternatives start with a circumflex, that  is,
       if  the	pattern	 is constrained	to match only at the start of the sub-
       ject, it	is said	to be an "anchored" pattern.  (There  are  also	 other
       constructs that can cause a pattern to be anchored.)

       The  dollar  character is an assertion that is true only	if the current
       matching	point is at the	end of the subject string, or immediately  be-
       fore  a	newline	 at the	end of the string (by default).	Note, however,
       that it does not	actually match the newline. Dollar  need  not  be  the
       last character of the pattern if	a number of alternatives are involved,
       but it should be	the last item in any branch in which it	appears.  Dol-
       lar has no special meaning in a character class.

       The  meaning  of	 dollar	 can be	changed	so that	it matches only	at the
       very end	of the string, by setting the dollar_endonly option at compile
       time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are	changed	if the
       multiline option	is set.	When this is the case,	a  circumflex  matches
       immediately after internal newlines as well as at the start of the sub-
       ject string. It does not	match after a newline that ends	the string.  A
       dollar  matches	before	any  newlines in the string, as	well as	at the
       very end, when multiline	is set.	When newline is	specified as the  two-
       character  sequence CRLF, isolated CR and LF characters do not indicate

       For example, the	pattern	/^abc$/	matches	the subject string  "def\nabc"
       (where  \n  represents a	newline) in multiline mode, but	not otherwise.
       Consequently, patterns that are anchored	in single  line	 mode  because
       all  branches  start  with  ^ are not anchored in multiline mode, and a
       match for circumflex is	possible  when	the  startoffset  argument  of
       re:run/3	is non-zero. The dollar_endonly	option is ignored if multiline
       is set.

       Note that the sequences \A, \Z, and \z can be used to match  the	 start
       and  end	of the subject in both modes, and if all branches of a pattern
       start with \A it	is always anchored, whether or not multiline is	set.

       Outside a character class, a dot	in the pattern matches any one charac-
       ter  in	the subject string except (by default) a character that	signi-
       fies the	end of a line.

       When a line ending is defined as	a single character, dot	never  matches
       that  character;	when the two-character sequence	CRLF is	used, dot does
       not match CR if it is immediately followed  by  LF,  but	 otherwise  it
       matches	all characters (including isolated CRs and LFs). When any Uni-
       code line endings are being recognized, dot does	not match CR or	LF  or
       any of the other	line ending characters.

       The  behaviour  of  dot	with regard to newlines	can be changed.	If the
       dotall option is	set, a dot matches any one character,  without	excep-
       tion.  If  the  two-character  sequence	CRLF is	present	in the subject
       string, it takes	two dots to match it.

       The handling of dot is entirely independent of the handling of  circum-
       flex  and  dollar,  the	only relationship being	that they both involve
       newlines. Dot has no special meaning in a character class.

       The escape sequence \N behaves like a dot, except that it  is  not  af-
       fected  by the PCRE_DOTALL option. In other words, it matches any char-
       acter except one	that signifies the end of a line. Perl also uses \N to
       match characters	by name; PCRE does not support this.

       Outside	a character class, the escape sequence \C matches any one data
       unit, whether or	not a UTF mode is set. One data	unit is	one byte.  Un-
       like  a	dot,  \C always	matches	line-ending characters.	The feature is
       provided	in Perl	in order to match individual bytes in UTF-8 mode,  but
       it is unclear how it can	usefully be used. Because \C breaks up charac-
       ters into individual data units,	matching one unit with	\C  in	a  UTF
       mode  means  that the rest of the string	may start with a malformed UTF
       character. This has undefined results, because PCRE assumes that	it  is
       dealing with valid UTF strings.

       PCRE  does  not	allow \C to appear in lookbehind assertions (described
       below) in a UTF mode, because this would	make it	impossible  to	calcu-
       late the	length of the lookbehind.

       In general, the \C escape sequence is best avoided. However, one	way of
       using it	that avoids the	problem	of malformed UTF characters is to  use
       a  lookahead to check the length	of the next character, as in this pat-
       tern, which could be used with a	UTF-8 string (ignore white  space  and
       line breaks):

	 (?| (?=[\x00-\x7f])(\C) |
	     (?=[\x80-\x{7ff}])(\C)(\C)	|
	     (?=[\x{800}-\x{ffff}])(\C)(\C)(\C)	|

       A  group	 that starts with (?| resets the capturing parentheses numbers
       in each alternative (see	"Duplicate Subpattern Numbers" below). The as-
       sertions	at the start of	each branch check the next UTF-8 character for
       values whose encoding uses 1, 2,	3, or 4	bytes, respectively. The char-
       acter's individual bytes	are then captured by the appropriate number of

       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial by default.	However, if the	PCRE_JAVASCRIPT_COMPAT option is  set,
       a lone closing square bracket causes a compile-time error. If a closing
       square bracket is required as a member of the class, it should  be  the
       first  data  character  in  the	class (after an	initial	circumflex, if
       present)	or escaped with	a backslash.

       A character class matches a single character in the subject. In	a  UTF
       mode,  the  character  may  be  more than one data unit long. A matched
       character must be in the	set of characters defined by the class,	unless
       the  first  character in	the class definition is	a circumflex, in which
       case the	subject	character must not be in the set defined by the	class.
       If  a  circumflex is actually required as a member of the class,	ensure
       it is not the first character, or escape	it with	a backslash.

       For example, the	character class	[aeiou]	matches	any lower case	vowel,
       while  [^aeiou]	matches	 any character that is not a lower case	vowel.
       Note that a circumflex is just a	convenient notation for	specifying the
       characters  that	 are in	the class by enumerating those that are	not. A
       class that starts with a	circumflex is not an assertion;	it still  con-
       sumes  a	 character  from the subject string, and therefore it fails if
       the current pointer is at the end of the	string.

       In UTF-8	mode, characters with values greater than 255 (0xffff) can  be
       included	 in a class as a literal string	of data	units, or by using the
       \x{ escaping mechanism.

       When caseless matching is set, any letters in a	class  represent  both
       their  upper  case  and lower case versions, so for example, a caseless
       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
       match  "A", whereas a caseful version would. In a UTF mode, PCRE	always
       understands the concept of case for characters whose  values  are  less
       than  256, so caseless matching is always possible. For characters with
       higher values, the concept of case is supported	if  PCRE  is  compiled
       with  Unicode  property	support, but not otherwise. If you want	to use
       caseless	matching in a UTF mode for characters 256 and above, you  must
       ensure  that  PCRE is compiled with Unicode property support as well as
       with UTF	support.

       Characters that might indicate line breaks are  never  treated  in  any
       special	way  when matching character classes, whatever line-ending se-
       quence is in use, and whatever setting of the PCRE_DOTALL and PCRE_MUL-
       TILINE  options	is  used.  A  class such as [^a] always	matches	one of
       these characters.

       The minus (hyphen) character can	be used	to specify a range of  charac-
       ters  in	 a  character class. For example, [d-m]	matches	any letter be-
       tween d and m, inclusive. If a minus character is required in a	class,
       it  must	 be  escaped with a backslash or appear	in a position where it
       cannot be interpreted as	indicating a range, typically as the first  or
       last character in the class.

       It is not possible to have the literal character	"]" as the end charac-
       ter of a	range. A pattern such as [W-]46] is interpreted	as a class  of
       two  characters ("W" and	"-") followed by a literal string "46]", so it
       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
       backslash  it is	interpreted as the end of range, so [W-\]46] is	inter-
       preted as a class containing a range followed by	two other  characters.
       The  octal or hexadecimal representation	of "]" can also	be used	to end
       a range.

       Ranges operate in the collating sequence	of character values. They  can
       also   be  used	for  characters	 specified  numerically,  for  example
       [\000-\037]. Ranges can include any characters that are valid  for  the
       current mode.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to  [][\\^_`wxyzabc],  matched  caselessly,  and	 in a non-UTF mode, if
       character tables	for a French locale are	in  use,  [\xc8-\xcb]  matches
       accented	 E  characters	in both	cases. In UTF modes, PCRE supports the
       concept of case for characters with values greater than 255  only  when
       it is compiled with Unicode property support.

       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
       \w, and \W may appear in	a character class, and add the characters that
       they  match to the class. For example, [\dABCDEF] matches any hexadeci-
       mal digit. In UTF modes,	the ucp	option affects the meanings of \d, \s,
       \w and their upper case partners, just as it does when they appear out-
       side a character	class, as described in the section  entitled  "Generic
       character  types" above.	The escape sequence \b has a different meaning
       inside a	character class; it matches the	backspace character.  The  se-
       quences	\B,  \N,  \R, and \X are not special inside a character	class.
       Like any	other unrecognized escape sequences, they are treated  as  the
       literal characters "B", "N", "R", and "X".

       A  circumflex  can  conveniently	 be used with the upper	case character
       types to	specify	a more restricted set of characters than the  matching
       lower  case  type.  For example,	the class [^\W_] matches any letter or
       digit, but not underscore, whereas [\w] includes	underscore. A positive
       character class should be read as "something OR something OR ..." and a
       negative	class as "NOT something	AND NOT	something AND NOT ...".

       The only	metacharacters that are	recognized in  character  classes  are
       backslash,  hyphen  (only  where	 it can	be interpreted as specifying a
       range), circumflex (only	at the start), opening	square	bracket	 (only
       when  it	can be interpreted as introducing a POSIX class	name - see the
       next section), and the terminating closing square bracket. However, es-
       caping other non-alphanumeric characters	does no	harm.

       Perl supports the POSIX notation	for character classes. This uses names
       enclosed	by [: and :] within the	enclosing square brackets.  PCRE  also
       supports	this notation. For example,


       matches "0", "1", any alphabetic	character, or "%". The supported class
       names are:

	   letters and digits


	   character codes 0 - 127

	   space or tab	only

	   control characters

	   decimal digits (same	as \d)

	   printing characters,	excluding space

	   lower case letters

	   printing characters,	including space

	   printing characters,	excluding letters and digits and space

	   whitespace (not quite the same as \s)

	   upper case letters

	   "word" characters (same as \w)

	   hexadecimal digits

       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR	 (13),
       and  space  (32). Notice	that this list includes	the VT character (code
       11). This makes "space" different to \s,	which does not include VT (for
       Perl compatibility).

       The  name  "word"  is  a	Perl extension,	and "blank" is a GNU extension
       from Perl 5.8. Another Perl extension is	negation, which	 is  indicated
       by a ^ character	after the colon. For example,


       matches	"1", "2", or any non-digit. PCRE (and Perl) also recognize the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported,	and an error is	given if they are encountered.

       By  default,  in	 UTF modes, characters with values greater than	255 do
       not match any of	the POSIX character classes. However, if the  PCRE_UCP
       option is passed	to pcre_compile() , some of the	classes	are changed so
       that Unicode character properties are used. This	is achieved by replac-
       ing the POSIX classes by	other sequences, as follows:

	   becomes \p{Xan}

	   becomes \p{L}

	   becomes \h

	   becomes \p{Nd}

	   becomes \p{Ll}

	   becomes \p{Xps}

	   becomes \p{Lu}

	   becomes \p{Xwd}

       Negated	versions,  such	 as [:^alpha:] use \P instead of \p. The other
       POSIX classes are unchanged, and	match only characters with code	points
       less than 256.

       Vertical	 bar characters	are used to separate alternative patterns. For
       example,	the pattern


       matches either "gilbert"	or "sullivan". Any number of alternatives  may
       appear,	and  an	 empty	alternative  is	 permitted (matching the empty
       string).	The matching process tries each	alternative in turn, from left
       to  right, and the first	one that succeeds is used. If the alternatives
       are within a subpattern (defined	below),	"succeeds" means matching  the
       rest of the main	pattern	as well	as the alternative in the subpattern.

       The  settings  of the caseless, multiline, dotall, and extended options
       (which are Perl-compatible) can be changed from within the pattern by a
       sequence	 of Perl option	letters	enclosed between "(?" and ")". The op-
       tion letters are

	   for caseless

	   for multiline

	   for dotall

	   for extended

       For example, (?im) sets caseless, multiline matching. It	is also	possi-
       ble to unset these options by preceding the letter with a hyphen, and a
       combined	setting	and unsetting such as (?im-sx),	 which	sets  caseless
       and  multiline  while unsetting dotall and extended, is also permitted.
       If a letter appears both	before and after the hyphen, the option	is un-

       The  PCRE-specific options dupnames, ungreedy, and extra	can be changed
       in the same way as the Perl-compatible options by using the  characters
       J, U and	X respectively.

       When  one of these option changes occurs	at top level (that is, not in-
       side subpattern parentheses), the change	applies	to  the	 remainder  of
       the pattern that	follows. If the	change is placed right at the start of
       a pattern, PCRE extracts	it into	the global options.

       An option change	within a subpattern (see below for  a  description  of
       subpatterns)  affects only that part of the subpattern that follows it,


       matches abc and aBc and no other	 strings  (assuming  caseless  is  not
       used). By this means, options can be made to have different settings in
       different parts of the pattern. Any changes made	in one alternative  do
       carry on	into subsequent	branches within	the same subpattern. For exam-


       matches "ab", "aB", "c",	and "C", even though  when  matching  "C"  the
       first  branch  is  abandoned before the option setting. This is because
       the effects of option settings happen at	compile	time. There  would  be
       some very weird behaviour otherwise.

       Note:  There are	other PCRE-specific options that can be	set by the ap-
       plication when the compiling or matching	functions are called. In  some
       cases the pattern can contain special leading sequences such as (*CRLF)
       to override what	the application	has set	or what	 has  been  defaulted.
       Details	are  given  in the section entitled "Newline sequences"	above.
       There are also the (*UTF8) and (*UCP) leading  sequences	 that  can  be
       used to set UTF and Unicode property modes; they	are equivalent to set-
       ting the	unicode	and the	ucp options, respectively. The (*UTF) sequence
       is  a  generic version that can be used with any	of the libraries. How-
       ever, the application can set the never_utf option, which locks out the
       use of the (*UTF) sequences.

       Subpatterns are delimited by parentheses	(round brackets), which	can be
       nested. Turning part of a pattern into a	subpattern does	two things:

       1. It localizes a set of	alternatives. For example, the pattern


       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
       it would	match "cataract", "erpillar" or	an empty string.

       2.  It  sets  up	 the  subpattern as a capturing	subpattern. This means
       that, when the complete pattern matches,	that portion  of  the  subject
       string that matched the subpattern is passed back to the	caller via the
       return value of re:run/3.

       Opening parentheses are counted from left to right (starting from 1) to
       obtain numbers for the capturing	subpatterns.For	example, if the	string
       "the red	king" is matched against the pattern

       the ((red|white)	(king|queen))

       the captured substrings are "red	king", "red", and "king", and are num-
       bered 1,	2, and 3, respectively.

       The  fact  that	plain  parentheses  fulfil two functions is not	always
       helpful.	There are often	times when a grouping subpattern  is  required
       without	a capturing requirement. If an opening parenthesis is followed
       by a question mark and a	colon, the subpattern does not do any  captur-
       ing,  and  is  not  counted when	computing the number of	any subsequent
       capturing subpatterns. For example, if the string "the white queen"  is
       matched against the pattern

       the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1 and 2.	The maximum number of capturing	subpatterns is 65535.

       As a convenient shorthand, if any option	settings are required  at  the
       start  of a non-capturing subpattern, the option	letters	may appear be-
       tween the "?" and the ":". Thus the two patterns

	 * (?i:saturday|sunday)

	 * (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried  from  left  to right, and	options	are not	reset until the	end of
       the subpattern is reached, an option setting in one branch does	affect
       subsequent  branches,  so  the above patterns match "SUNDAY" as well as

       Perl 5.10 introduced a feature whereby each alternative in a subpattern
       uses  the same numbers for its capturing	parentheses. Such a subpattern
       starts with (?| and is itself a non-capturing subpattern. For  example,
       consider	this pattern:


       Because	the two	alternatives are inside	a (?| group, both sets of cap-
       turing parentheses are numbered one. Thus, when	the  pattern  matches,
       you  can	 look  at captured substring number one, whichever alternative
       matched.	This construct is useful when you want to  capture  part,  but
       not all,	of one of a number of alternatives. Inside a (?| group,	paren-
       theses are numbered as usual, but the number is reset at	the  start  of
       each  branch.  The numbers of any capturing parentheses that follow the
       subpattern start	after the highest number used in any branch. The  fol-
       lowing example is taken from the	Perl documentation. The	numbers	under-
       neath show in which buffer the captured content will be stored.

	 # before  ---------------branch-reset----------- after
	 / ( a )  (?| x	( y ) z	| (p (q) r) | (t) u (v)	) ( z )	/x
	 # 1		2	  2  3	      2	    3	  4

       A back reference	to a numbered subpattern uses the  most	 recent	 value
       that  is	 set  for that number by any subpattern. The following pattern
       matches "abcabc"	or "defdef":


       In contrast, a subroutine call to a numbered subpattern	always	refers
       to  the	first  one in the pattern with the given number. The following
       pattern matches "abcabc"	or "defabc":


       If a condition test for a subpattern's having matched refers to a  non-
       unique  number, the test	is true	if any of the subpatterns of that num-
       ber have	matched.

       An alternative approach to using	this "branch reset" feature is to  use
       duplicate named subpatterns, as described in the	next section.

       Identifying  capturing  parentheses  by number is simple, but it	can be
       very hard to keep track of the numbers in complicated  regular  expres-
       sions.  Furthermore,  if	 an  expression	 is  modified, the numbers may
       change. To help with this difficulty, PCRE supports the naming of  sub-
       patterns. This feature was not added to Perl until release 5.10.	Python
       had the feature earlier,	and PCRE introduced it at release  4.0,	 using
       the  Python syntax. PCRE	now supports both the Perl and the Python syn-
       tax. Perl allows	identically numbered  subpatterns  to  have  different
       names, but PCRE does not.

       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
       to  capturing parentheses from other parts of the pattern, such as back
       references, recursion, and conditions, can be made by name as  well  as
       by number.

       Names  consist  of  up  to  32 alphanumeric characters and underscores.
       Named capturing parentheses are still  allocated	 numbers  as  well  as
       names, exactly as if the	names were not present.	The capture specifica-
       tion to re:run/3	can use	named values if	they are present in the	 regu-
       lar expression.

       By  default, a name must	be unique within a pattern, but	it is possible
       to relax	this constraint	by setting  the	 dupnames  option  at  compile
       time.  (Duplicate  names	are also always	permitted for subpatterns with
       the same	number,	set up as described in the previous  section.)	Dupli-
       cate  names  can	 be useful for patterns	where only one instance	of the
       named parentheses can match. Suppose you	want to	match the  name	 of  a
       weekday,	 either	as a 3-letter abbreviation or as the full name,	and in
       both cases you want to extract the abbreviation.	This pattern (ignoring
       the line	breaks)	does the job:


       There  are  five	capturing substrings, but only one is ever set after a
       match. (An alternative way of solving this problem is to	use a  "branch
       reset" subpattern, as described in the previous section.)

       In  case	of capturing named subpatterns which names are not unique, the
       first matching occurrence (counted from left to right in	 the  subject)
       is returned from	re:exec/3, if the name is specified in the values part
       of the capture statement. The all_names capturing value will match  all
       of the names in the same	way.

       Warning:	You cannot use different names to distinguish between two sub-
       patterns	with the same number because PCRE uses only the	 numbers  when
       matching. For this reason, an error is given at compile time if differ-
       ent names are given to subpatterns with the same	number.	 However,  you
       can  give  the same name	to subpatterns with the	same number, even when
       dupnames	is not set.

       Repetition is specified by quantifiers, which can  follow  any  of  the
       following items:

	 * a literal data character

	 * the dot metacharacter

	 * the \C escape sequence

	 * the \X escape sequence

	 * the \R escape sequence

	 * an escape such as \d	or \pL that matches a single character

	 * a character class

	 * a back reference (see next section)

	 * a parenthesized subpattern (including assertions)

	 * a subroutine	call to	a subpattern (recursive	or otherwise)

       The  general repetition quantifier specifies a minimum and maximum num-
       ber of permitted	matches, by giving the two numbers in  curly  brackets
       (braces),  separated  by	 a comma. The numbers must be less than	65536,
       and the first must be less than or equal	to the second. For example:


       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is	not  a
       special	character.  If	the second number is omitted, but the comma is
       present,	there is no upper limit; if the	second number  and  the	 comma
       are  both omitted, the quantifier specifies an exact number of required
       matches.	Thus


       matches at least	3 successive vowels, but may match many	more, while


       matches exactly 8 digits. An opening curly bracket that	appears	 in  a
       position	 where a quantifier is not allowed, or one that	does not match
       the syntax of a quantifier, is taken as a literal character. For	 exam-
       ple, {,6} is not	a quantifier, but a literal string of four characters.

       In  Unicode  mode, quantifiers apply to characters rather than to indi-
       vidual data units. Thus,	for example, \x{100}{2}	 matches  two  charac-
       ters,  each  of	which is represented by	a two-byte sequence in a UTF-8
       string. Similarly, \X{3}	matches	three Unicode extended grapheme	 clus-
       ters,  each of which may	be several data	units long (and	they may be of
       different lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This may be use-
       ful for subpatterns that	are referenced as subroutines  from  elsewhere
       in the pattern (but see also the	section	entitled "Defining subpatterns
       for use by reference only" below). Items	other  than  subpatterns  that
       have a {0} quantifier are omitted from the compiled pattern.

       For  convenience, the three most	common quantifiers have	single-charac-
       ter abbreviations:

	   is equivalent to {0,}

	   is equivalent to {1,}

	   is equivalent to {0,1}

       It is possible to construct infinite loops by  following	 a  subpattern
       that can	match no characters with a quantifier that has no upper	limit,
       for example:


       Earlier versions	of Perl	and PCRE used to give an error at compile time
       for  such  patterns. However, because there are cases where this	can be
       useful, such patterns are now accepted, but if any  repetition  of  the
       subpattern  does	in fact	match no characters, the loop is forcibly bro-

       By default, the quantifiers are "greedy", that is, they match  as  much
       as  possible  (up  to  the  maximum number of permitted times), without
       causing the rest	of the pattern to fail.	The classic example  of	 where
       this gives problems is in trying	to match comments in C programs. These
       appear between /* and */	and within the comment,	 individual  *	and  /
       characters  may	appear.	An attempt to match C comments by applying the


       to the string

       /* first	comment	*/ not comment /* second comment */

       fails, because it matches the entire string owing to the	greediness  of
       the .* item.

       However,	 if  a quantifier is followed by a question mark, it ceases to
       be greedy, and instead matches the minimum number of times possible, so
       the pattern


       does  the  right	 thing with the	C comments. The	meaning	of the various
       quantifiers is not otherwise changed,  just  the	 preferred  number  of
       matches.	 Do  not  confuse  this	use of question	mark with its use as a
       quantifier in its own right. Because it has two uses, it	can  sometimes
       appear doubled, as in


       which matches one digit by preference, but can match two	if that	is the
       only way	the rest of the	pattern	matches.

       If the ungreedy option is set (an  option  that	is  not	 available  in
       Perl),  the  quantifiers	are not	greedy by default, but individual ones
       can be made greedy by following them with a  question  mark.  In	 other
       words, it inverts the default behaviour.

       When  a	parenthesized  subpattern  is quantified with a	minimum	repeat
       count that is greater than 1 or with a limited maximum, more memory  is
       required	 for  the  compiled  pattern, in proportion to the size	of the
       minimum or maximum.

       If a pattern starts with	.* or .{0,} and	the dotall option  (equivalent
       to Perl's /s) is	set, thus allowing the dot to match newlines, the pat-
       tern is implicitly anchored, because whatever  follows  will  be	 tried
       against	every character	position in the	subject	string,	so there is no
       point in	retrying the overall match at any position  after  the	first.
       PCRE normally treats such a pattern as though it	were preceded by \A.

       In  cases  where	 it  is	known that the subject string contains no new-
       lines, it is worth setting dotall in order to obtain this optimization,
       or alternatively	using ^	to indicate anchoring explicitly.

       However,	 there	are  some cases	where the optimization cannot be used.
       When .* is inside capturing parentheses that are	the subject of a  back
       reference elsewhere in the pattern, a match at the start	may fail where
       a later one succeeds. Consider, for example:


       If the subject is "xyz123abc123"	the match point	is the fourth  charac-
       ter. For	this reason, such a pattern is not implicitly anchored.

       Another	case where implicit anchoring is not applied is	when the lead-
       ing .* is inside	an atomic group. Once again, a match at	the start  may
       fail where a later one succeeds.	Consider this pattern:


       It  matches "ab"	in the subject "aab". The use of the backtracking con-
       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.

       When a capturing	subpattern is repeated,	the value captured is the sub-
       string that matched the final iteration.	For example, after


       has matched "tweedledum tweedledee" the value of	the captured substring
       is "tweedledee".	However, if there are  nested  capturing  subpatterns,
       the  corresponding captured values may have been	set in previous	itera-
       tions. For example, after


       matches "aba" the value of the second captured substring	is "b".

       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
       repetition,  failure  of	what follows normally causes the repeated item
       to be re-evaluated to see if a different	number of repeats  allows  the
       rest  of	 the pattern to	match. Sometimes it is useful to prevent this,
       either to change	the nature of the match, or to cause it	 fail  earlier
       than  it	otherwise might, when the author of the	pattern	knows there is
       no point	in carrying on.

       Consider, for example, the pattern \d+foo when applied to  the  subject


       After matching all 6 digits and then failing to match "foo", the	normal
       action of the matcher is	to try again with only 5 digits	 matching  the
       \d+  item,  and	then  with  4,	and  so	on, before ultimately failing.
       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
       the  means for specifying that once a subpattern	has matched, it	is not
       to be re-evaluated in this way.

       If we use atomic	grouping for the previous example, the	matcher	 gives
       up  immediately	on failing to match "foo" the first time. The notation
       is a kind of special parenthesis, starting with (?> as in this example:


       This kind of parenthesis	"locks up" the part of the pattern it contains
       once  it	 has  matched,	and a failure further into the pattern is pre-
       vented from backtracking	into it.  Backtracking	past  it  to  previous
       items, however, works as	normal.

       An  alternative	description  is	that a subpattern of this type matches
       the string of characters	that an	 identical  standalone	pattern	 would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be	thought	of as a	maximizing repeat that
       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
       pared to	adjust the number of digits they match in order	 to  make  the
       rest of the pattern match, (?>\d+) can only match an entire sequence of

       Atomic groups in	general	can of course contain arbitrarily  complicated
       subpatterns,  and  can  be  nested. However, when the subpattern	for an
       atomic group is just a single repeated item, as in the example above, a
       simpler	notation,  called  a "possessive quantifier" can be used. This
       consists	of an additional + character  following	 a  quantifier.	 Using
       this notation, the previous example can be rewritten as


       Note that a possessive quantifier can be	used with an entire group, for


       Possessive quantifiers are always greedy; the setting of	 the  ungreedy
       option is ignored. They are a convenient	notation for the simpler forms
       of atomic group.	However, there is no difference	in the	meaning	 of  a
       possessive quantifier and the equivalent	atomic group, though there may
       be a performance	difference; possessive quantifiers should be  slightly

       The  possessive	quantifier syntax is an	extension to the Perl 5.8 syn-
       tax. Jeffrey Friedl originated the idea (and the	 name)	in  the	 first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built Sun's Java	package, and PCRE copied it from there.	It  ultimately
       found its way into Perl at release 5.10.

       PCRE has	an optimization	that automatically "possessifies" certain sim-
       ple pattern constructs. For example, the	sequence  A+B  is  treated  as
       A++B  because  there is no point	in backtracking	into a sequence	of A's
       when B must follow.

       When a pattern contains an unlimited repeat inside  a  subpattern  that
       can  itself  be	repeated  an  unlimited	number of times, the use of an
       atomic group is the only	way to avoid some  failing  matches  taking  a
       very long time indeed. The pattern


       matches	an  unlimited number of	substrings that	either consist of non-
       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
       matches,	it runs	quickly. However, if it	is applied to


       it  takes  a  long  time	 before	reporting failure. This	is because the
       string can be divided between the internal \D+ repeat and the  external
       *  repeat in a large number of ways, and	all have to be tried. (The ex-
       ample uses [!?] rather than a single character at the end, because both
       PCRE  and Perl have an optimization that	allows for fast	failure	when a
       single character	is used. They remember the last	single character  that
       is  required  for  a  match, and	fail early if it is not	present	in the
       string.)	If the pattern is changed so that it  uses  an	atomic	group,
       like this:


       sequences of non-digits cannot be broken, and failure happens quickly.

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a back reference to a	capturing sub-
       pattern	earlier	 (that is, to its left)	in the pattern,	provided there
       have been that many previous capturing left parentheses.

       However,	if the decimal number following	the backslash is less than 10,
       it  is  always  taken  as a back	reference, and causes an error only if
       there are not that many capturing left parentheses in the  entire  pat-
       tern.  In  other	words, the parentheses that are	referenced need	not be
       to the left of the reference for	numbers	less than 10. A	"forward  back
       reference"  of  this  type can make sense when a	repetition is involved
       and the subpattern to the right has participated	in an  earlier	itera-

       It  is  not  possible to	have a numerical "forward back reference" to a
       subpattern whose	number is 10 or	more using this	syntax because	a  se-
       quence  such as \50 is interpreted as a character defined in octal. See
       the subsection entitled "Non-printing characters" above for further de-
       tails of	the handling of	digits following a backslash. There is no such
       problem when named parentheses are used.	A back reference to  any  sub-
       pattern is possible using named parentheses (see	below).

       Another	way  of	 avoiding  the ambiguity inherent in the use of	digits
       following a backslash is	to use the \g  escape  sequence.  This	escape
       must be followed	by an unsigned number or a negative number, optionally
       enclosed	in braces. These examples are all identical:

	 * (ring), \1

	 * (ring), \g1

	 * (ring), \g{1}

       An unsigned number specifies an absolute	reference without the  ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits follow the reference. A negative number is a relative reference.
       Consider	this example:


       The sequence \g{-1} is a	reference to the most recently started captur-
       ing subpattern before \g, that is, is it	equivalent to \2 in this exam-
       ple.  Similarly,	 \g{-2}	would be equivalent to \1. The use of relative
       references can be helpful in long patterns, and also in	patterns  that
       are  created  by	 joining  together  fragments  that contain references
       within themselves.

       A back reference	matches	whatever actually matched the  capturing  sub-
       pattern	in  the	 current subject string, rather	than anything matching
       the subpattern itself (see "Subpatterns as subroutines" below for a way
       of doing	that). So the pattern

       (sens|respons)e and \1ibility

       matches	"sense and sensibility"	and "response and responsibility", but
       not "sense and responsibility". If caseful matching is in force at  the
       time  of	the back reference, the	case of	letters	is relevant. For exam-


       matches "rah rah" and "RAH RAH",	but not	"RAH  rah",  even  though  the
       original	capturing subpattern is	matched	caselessly.

       There  are  several  different ways of writing back references to named
       subpatterns. The	.NET syntax \k{name} and the Perl syntax  \k<name>  or
       \k'name'	 are supported,	as is the Python syntax	(?P=name). Perl	5.10's
       unified back reference syntax, in which \g can be used for both numeric
       and named references, is	also supported.	We could rewrite the above ex-
       ample in	any of the following ways:

	 * (?<p1>(?i)rah)\s+\k<p1>

	 * (?'p1'(?i)rah)\s+\k{p1}

	 * (?P<p1>(?i)rah)\s+(?P=p1)

	 * (?<p1>(?i)rah)\s+\g{p1}

       A subpattern that is referenced by name may appear in the  pattern  be-
       fore or after the reference.

       There  may be more than one back	reference to the same subpattern. If a
       subpattern has not actually been	used in	a particular match,  any  back
       references to it	always fail. For example, the pattern


       always  fails if	it starts to match "a" rather than "bc". Because there
       may be many capturing parentheses in a pattern,	all  digits  following
       the  backslash  are taken as part of a potential	back reference number.
       If the pattern continues	with a digit character,	some delimiter must be
       used  to	 terminate  the	back reference.	If the extended	option is set,
       this can	be whitespace. Otherwise an empty comment (see "Comments"  be-
       low) can	be used.

       Recursive back references

       A  back reference that occurs inside the	parentheses to which it	refers
       fails when the subpattern is first used,	so, for	example,  (a\1)	 never
       matches.	However, such references can be	useful inside repeated subpat-
       terns. For example, the pattern


       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation  of  the  subpattern,  the	 back  reference matches the character
       string corresponding to the previous iteration. In order	 for  this  to
       work,  the  pattern must	be such	that the first iteration does not need
       to match	the back reference. This can be	done using alternation,	as  in
       the example above, or by	a quantifier with a minimum of zero.

       Back  references	of this	type cause the group that they reference to be
       treated as an atomic group. Once	the whole group	has  been  matched,  a
       subsequent  matching  failure cannot cause backtracking into the	middle
       of the group.

       An assertion is a test on the characters	 following  or	preceding  the
       current	matching  point	that does not actually consume any characters.
       The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^	and $ are  de-
       scribed above.

       More  complicated  assertions  are  coded as subpatterns. There are two
       kinds: those that look ahead of the current  position  in  the  subject
       string,	and  those  that  look	behind	it. An assertion subpattern is
       matched in the normal way, except that it does not  cause  the  current
       matching	position to be changed.

       Assertion  subpatterns are not capturing	subpatterns. If	such an	asser-
       tion contains capturing subpatterns within it, these  are  counted  for
       the  purposes  of numbering the capturing subpatterns in	the whole pat-
       tern. However, substring	capturing is carried out only for positive as-
       sertions.  (Perl	 sometimes, but	not always, does do capturing in nega-
       tive assertions.)

       For compatibility with Perl, assertion  subpatterns  may	 be  repeated;
       though  it  makes  no sense to assert the same thing several times, the
       side effect of capturing	parentheses may	 occasionally  be  useful.  In
       practice, there only three cases:

	   If  the  quantifier	is  {0},  the assertion	is never obeyed	during
	   matching. However, it may contain internal capturing	 parenthesized
	   groups that are called from elsewhere via the subroutine mechanism.

	   If  quantifier is {0,n} where n is greater than zero, it is treated
	   as if it were {0,1}.	At run time, the rest of the pattern match  is
	   tried  with	and  without the assertion, the	order depending	on the
	   greediness of the quantifier.

	   If the minimum repetition is	greater	than zero, the	quantifier  is
	   ignored.  The assertion is obeyed just once when encountered	during

       Lookahead assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example,


       matches	a word followed	by a semicolon,	but does not include the semi-
       colon in	the match, and


       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
       that the	apparently similar pattern


       does  not  find	an  occurrence	of "bar" that is preceded by something
       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
       the assertion (?!foo) is	always true when the next three	characters are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most  convenient	 way to	do it is with (?!) because an empty string al-
       ways matches, so	an assertion that requires there not to	 be  an	 empty
       string  must always fail. The backtracking control verb (*FAIL) or (*F)
       is a synonym for	(?!).

       Lookbehind assertions

       Lookbehind assertions start with	(?<= for positive assertions and  (?<!
       for negative assertions.	For example,


       does  find  an  occurrence  of "bar" that is not	preceded by "foo". The
       contents	of a lookbehind	assertion are restricted  such	that  all  the
       strings it matches must have a fixed length. However, if	there are sev-
       eral top-level alternatives, they do not	all  have  to  have  the  same
       fixed length. Thus


       is permitted, but


       causes  an  error at compile time. Branches that	match different	length
       strings are permitted only at the top level of a	lookbehind  assertion.
       This is an extension compared with Perl,	which requires all branches to
       match the same length of	string.	An assertion such as


       is not permitted, because its single top-level  branch  can  match  two
       different lengths, but it is acceptable to PCRE if rewritten to use two
       top-level branches:


       In some cases, the escape sequence \K (see above) can be	 used  instead
       of a lookbehind assertion to get	round the fixed-length restriction.

       The  implementation  of lookbehind assertions is, for each alternative,
       to temporarily move the current position	back by	the fixed  length  and
       then try	to match. If there are insufficient characters before the cur-
       rent position, the assertion fails.

       In a UTF	mode, PCRE does	not allow the \C escape	(which matches a  sin-
       gle  data  unit even in a UTF mode) to appear in	lookbehind assertions,
       because it makes	it impossible to calculate the length of  the  lookbe-
       hind.  The \X and \R escapes, which can match different numbers of data
       units, are also not permitted.

       "Subroutine" calls (see below) such as (?2) or (?&X) are	 permitted  in
       lookbehinds,  as	 long as the subpattern	matches	a fixed-length string.
       Recursion, however, is not supported.

       Possessive quantifiers can be used in conjunction with  lookbehind  as-
       sertions	 to  specify efficient matching	of fixed-length	strings	at the
       end of subject strings. Consider	a simple pattern such as


       when applied to a long string that does	not  match.  Because  matching
       proceeds	from left to right, PCRE will look for each "a"	in the subject
       and then	see if what follows matches the	rest of	the  pattern.  If  the
       pattern is specified as


       the  initial .* matches the entire string at first, but when this fails
       (because	there is no following "a"), it backtracks to match all but the
       last  character,	 then all but the last two characters, and so on. Once
       again the search	for "a"	covers the entire string, from right to	 left,
       so we are no better off.	However, if the	pattern	is written as


       there  can  be  no backtracking for the .*+ item; it can	match only the
       entire string. The subsequent lookbehind	assertion does a  single  test
       on  the last four characters. If	it fails, the match fails immediately.
       For long	strings, this approach makes a significant difference  to  the
       processing time.

       Using multiple assertions

       Several assertions (of any sort)	may occur in succession. For example,


       matches	"foo" preceded by three	digits that are	not "999". Notice that
       each of the assertions is applied independently at the  same  point  in
       the  subject  string.  First  there  is a check that the	previous three
       characters are all digits, and then there is  a	check  that  the  same
       three  characters are not "999".	This pattern does not match "foo" pre-
       ceded by	six characters,	the first of which are	digits	and  the  last
       three  of  which	 are not "999".	For example, it	doesn't	match "123abc-
       foo". A pattern to do that is


       This time the first assertion looks at the  preceding  six  characters,
       checking	that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example,


       matches an occurrence of	"baz" that is preceded by "bar"	which in  turn
       is not preceded by "foo", while


       is  another pattern that	matches	"foo" preceded by three	digits and any
       three characters	that are not "999".

       It is possible to cause the matching process to obey a subpattern  con-
       ditionally  or to choose	between	two alternative	subpatterns, depending
       on the result of	an assertion, or whether a specific capturing  subpat-
       tern  has  already  been	matched. The two possible forms	of conditional
       subpattern are:

	 * (?(condition)yes-pattern)

	 * (?(condition)yes-pattern|no-pattern)

       If the condition	is satisfied, the yes-pattern is used;	otherwise  the
       no-pattern  (if	present)  is used. If there are	more than two alterna-
       tives in	the subpattern,	a compile-time error occurs. Each of  the  two
       alternatives may	itself contain nested subpatterns of any form, includ-
       ing conditional subpatterns; the	restriction to	two  alternatives  ap-
       plies  only  at the level of the	condition. This	pattern	fragment is an
       example where the alternatives are complex:

       (?(1) (A|B|C) | (D | (?(2)E|F) |	E) )

       There are four kinds of condition: references  to  subpatterns,	refer-
       ences to	recursion, a pseudo-condition called DEFINE, and assertions.

       Checking	for a used subpattern by number

       If  the	text between the parentheses consists of a sequence of digits,
       the condition is	true if	a capturing subpattern of that number has pre-
       viously	matched.  If  there is more than one capturing subpattern with
       the same	number (see the	earlier	 section  about	 duplicate  subpattern
       numbers),  the condition	is true	if any of them have matched. An	alter-
       native notation is to precede the digits	with a plus or minus sign.  In
       this  case, the subpattern number is relative rather than absolute. The
       most recently opened parentheses	can be referenced by (?(-1), the  next
       most  recent  by	(?(-2),	and so on. Inside loops	it can also make sense
       to refer	to subsequent groups. The next parentheses to be opened	can be
       referenced  as (?(+1), and so on. (The value zero in any	of these forms
       is not used; it provokes	a compile-time error.)

       Consider	the following pattern, which contains  non-significant	white-
       space  to make it more readable (assume the extended option) and	to di-
       vide it into three parts	for ease of discussion:

       ( \( )? [^()]+ (?(1) \) )

       The first part matches an optional opening  parenthesis,	 and  if  that
       character is present, sets it as	the first captured substring. The sec-
       ond part	matches	one or more characters that are	not  parentheses.  The
       third  part  is	a conditional subpattern that tests whether or not the
       first set of parentheses	matched	or not.	If they	did, that is, if  sub-
       ject started with an opening parenthesis, the condition is true,	and so
       the yes-pattern is executed and a closing parenthesis is	required. Oth-
       erwise,	since  no-pattern is not present, the subpattern matches noth-
       ing. In other words, this pattern matches a sequence  of	 non-parenthe-
       ses, optionally enclosed	in parentheses.

       If  you	were  embedding	 this pattern in a larger one, you could use a
       relative	reference:

       ...other	stuff... ( \( )? [^()]+	(?(-1) \) ) ...

       This makes the fragment independent of the parentheses  in  the	larger

       Checking	for a used subpattern by name

       Perl  uses  the	syntax	(?(<name>)...) or (?('name')...) to test for a
       used subpattern by name.	For compatibility  with	 earlier  versions  of
       PCRE,  which  had this facility before Perl, the	syntax (?(name)...) is
       also recognized.	However, there is a possible ambiguity with this  syn-
       tax,  because  subpattern  names	 may  consist entirely of digits. PCRE
       looks first for a named subpattern; if it cannot	find one and the  name
       consists	 entirely  of digits, PCRE looks for a subpattern of that num-
       ber, which must be greater than zero. Using subpattern names that  con-
       sist entirely of	digits is not recommended.

       Rewriting the above example to use a named subpattern gives this:

       (?<OPEN>	\( )? [^()]+ (?(<OPEN>)	\) )

       If  the	name used in a condition of this kind is a duplicate, the test
       is applied to all subpatterns of	the same name, and is true if any  one
       of them has matched.

       Checking	for pattern recursion

       If the condition	is the string (R), and there is	no subpattern with the
       name R, the condition is	true if	a recursive call to the	whole  pattern
       or any subpattern has been made.	If digits or a name preceded by	amper-
       sand follow the letter R, for example:

       (?(R3)...) or (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a subpattern
       whose number or name is given. This condition does not check the	entire
       recursion stack.	If the name used in a condition	of this	kind is	a  du-
       plicate,	 the  test is applied to all subpatterns of the	same name, and
       is true if any one of them is the most recent recursion.

       At "top level", all these recursion test	conditions are false. The syn-
       tax for recursive patterns is described below.

       Defining	subpatterns for	use by reference only

       If  the	condition  is  the string (DEFINE), and	there is no subpattern
       with the	name DEFINE, the condition is  always  false.  In  this	 case,
       there  may  be  only  one  alternative  in the subpattern. It is	always
       skipped if control reaches this point in	the pattern; the idea  of  DE-
       FINE  is	that it	can be used to define "subroutines" that can be	refer-
       enced from elsewhere. (The use of subroutines is	described below.)  For
       example,	 a  pattern  to	match an IPv4 address such as ""
       could be	written	like this (ignore whitespace and line breaks):

       (?(DEFINE) (?<byte> 2[0-4]\d  |	25[0-5]	 |  1\d\d  |  [1-9]?\d)	 )  \b
       (?&byte)	(\.(?&byte)){3}	\b

       The  first part of the pattern is a DEFINE group	inside which a another
       group named "byte" is defined. This matches an individual component  of
       an  IPv4	 address  (a number less than 256). When matching takes	place,
       this part of the	pattern	is skipped because DEFINE acts	like  a	 false
       condition.  The	rest of	the pattern uses references to the named group
       to match	the four dot-separated components of an	IPv4 address,  insist-
       ing on a	word boundary at each end.

       Assertion conditions

       If  the condition is not	in any of the above formats, it	must be	an as-
       sertion.	This may be a positive or negative lookahead or	lookbehind as-
       sertion.	Consider this pattern, again containing	non-significant	white-
       space, and with the two alternatives on the second line:

	 \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The condition is	a positive lookahead assertion	that  matches  an  op-
       tional sequence of non-letters followed by a letter. In other words, it
       tests for the presence of at least one letter in	the subject. If	a let-
       ter  is	found,	the  subject is	matched	against	the first alternative;
       otherwise it is	matched	 against  the  second.	This  pattern  matches
       strings	in  one	 of the	two forms dd-aaa-dd or dd-dd-dd, where aaa are
       letters and dd are digits.

       There are two ways of including comments	in patterns that are processed
       by PCRE.	In both	cases, the start of the	comment	must not be in a char-
       acter class, nor	in the middle of any other sequence of related charac-
       ters  such  as  (?: or a	subpattern name	or number. The characters that
       make up a comment play no part in the pattern matching.

       The sequence (?#	marks the start	of a comment that continues up to  the
       next  closing parenthesis. Nested parentheses are not permitted.	If the
       PCRE_EXTENDED option is set, an unescaped # character also introduces a
       comment,	 which	in  this  case continues to immediately	after the next
       newline character or character sequence in the pattern.	Which  charac-
       ters are	interpreted as newlines	is controlled by the options passed to
       a compiling function or by a special sequence at	the start of the  pat-
       tern, as	described in the section entitled "Newline conventions"	above.
       Note that the end of this type of comment is a literal newline sequence
       in  the pattern;	escape sequences that happen to	represent a newline do
       not count. For example, consider	this pattern when extended is set, and
       the default newline convention is in force:

       abc #comment \n still comment

       On  encountering	 the # character, pcre_compile()  skips	along, looking
       for a newline in	the pattern. The sequence \n is	still literal at  this
       stage,  so  it does not terminate the comment. Only an actual character
       with the	code value 0x0a	(the default newline) does so.

       Consider	the problem of matching	a string in parentheses, allowing  for
       unlimited  nested  parentheses.	Without	the use	of recursion, the best
       that can	be done	is to use a pattern that  matches  up  to  some	 fixed
       depth  of  nesting.  It	is not possible	to handle an arbitrary nesting

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions  to recurse (amongst other	things). It does this by interpolating
       Perl code in the	expression at run time,	and the	code can refer to  the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The (?p{...}) item interpolates Perl code at run	time, and in this case
       refers recursively to the pattern in which it appears.

       Obviously, PCRE cannot support the interpolation	of Perl	code. Instead,
       it supports special syntax for recursion	of  the	 entire	 pattern,  and
       also  for  individual  subpattern  recursion. After its introduction in
       PCRE and	Python,	this kind of  recursion	 was  subsequently  introduced
       into Perl at release 5.10.

       A  special  item	 that consists of (? followed by a number greater than
       zero and	a closing parenthesis is a recursive subroutine	 call  of  the
       subpattern  of  the  given  number, provided that it occurs inside that
       subpattern. (If not, it is a non-recursive subroutine  call,  which  is
       described  in the next section.)	The special item (?R) or (?0) is a re-
       cursive call of the entire regular expression.

       This PCRE pattern solves	the nested parentheses problem (assume the ex-
       tended option is	set so that whitespace is ignored):

       \( ( [^()]++ | (?R) )* \)

       First  it matches an opening parenthesis. Then it matches any number of
       substrings which	can either be a	sequence of non-parentheses, or	a  re-
       cursive match of	the pattern itself (that is, a correctly parenthesized
       substring). Finally there is a closing parenthesis. Note	the use	 of  a
       possessive  quantifier  to  avoid  backtracking	into sequences of non-

       If this were part of a larger pattern, you would	not  want  to  recurse
       the entire pattern, so instead you could	use this:

       ( \( ( [^()]++ |	(?1) )*	\) )

       We  have	 put the pattern into parentheses, and caused the recursion to
       refer to	them instead of	the whole pattern.

       In a larger pattern,  keeping  track  of	 parenthesis  numbers  can  be
       tricky.	This is	made easier by the use of relative references. Instead
       of (?1) in the pattern above you	can write (?-2)	to refer to the	second
       most  recently  opened  parentheses  preceding  the recursion. In other
       words, a	negative number	counts capturing  parentheses  leftwards  from
       the point at which it is	encountered.

       It  is  also  possible  to refer	to subsequently	opened parentheses, by
       writing references such as (?+2). However, these	 cannot	 be  recursive
       because	the  reference	is  not	inside the parentheses that are	refer-
       enced. They are always non-recursive subroutine calls, as described  in
       the next	section.

       An  alternative	approach is to use named parentheses instead. The Perl
       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
       supported. We could rewrite the above example as	follows:

       (?<pn> \( ( [^()]++ | (?&pn) )* \) )

       If  there  is more than one subpattern with the same name, the earliest
       one is used.

       This particular example pattern that we have been looking  at  contains
       nested unlimited	repeats, and so	the use	of a possessive	quantifier for
       matching	strings	of non-parentheses is important	when applying the pat-
       tern  to	 strings  that do not match. For example, when this pattern is
       applied to


       it yields "no match" quickly. However, if a  possessive	quantifier  is
       not  used, the match runs for a very long time indeed because there are
       so many different ways the + and	* repeats can carve  up	 the  subject,
       and all have to be tested before	failure	can be reported.

       At  the	end  of	a match, the values of capturing parentheses are those
       from the	outermost level. If the	pattern	above is matched against


       the value for the inner capturing parentheses  (numbered	 2)  is	 "ef",
       which  is the last value	taken on at the	top level. If a	capturing sub-
       pattern is not matched at the top level,	its final  captured  value  is
       unset,  even  if	 it was	(temporarily) set at a deeper level during the
       matching	process.

       Do not confuse the (?R) item with the condition (R),  which  tests  for
       recursion. Consider this	pattern, which matches text in angle brackets,
       allowing	for arbitrary nesting.	Only  digits  are  allowed  in	nested
       brackets	 (that is, when	recursing), whereas any	characters are permit-
       ted at the outer	level.

       < (?: (?(R) \d++	| [^<>]*+) | (?R)) * >

       In this pattern,	(?(R) is the start of a	conditional  subpattern,  with
       two  different  alternatives for	the recursive and non-recursive	cases.
       The (?R)	item is	the actual recursive call.

       Differences in recursion	processing between PCRE	and Perl

       Recursion processing in PCRE differs from Perl in two  important	 ways.
       In  PCRE	(like Python, but unlike Perl),	a recursive subpattern call is
       always treated as an atomic group. That is, once	it has matched some of
       the subject string, it is never re-entered, even	if it contains untried
       alternatives and	there is a subsequent matching failure.	 This  can  be
       illustrated  by the following pattern, which purports to	match a	palin-
       dromic string that contains an odd number of characters	(for  example,
       "a", "aba", "abcba", "abcdcba"):


       The idea	is that	it either matches a single character, or two identical
       characters surrounding a	sub-palindrome.	In Perl, this  pattern	works;
       in  PCRE	 it  does  not if the pattern is longer	than three characters.
       Consider	the subject string "abcba":

       At the top level, the first character is	matched, but as	it is  not  at
       the end of the string, the first	alternative fails; the second alterna-
       tive is taken and the recursion kicks in. The recursive call to subpat-
       tern  1	successfully  matches the next character ("b").	(Note that the
       beginning and end of line tests are not part of the recursion).

       Back at the top level, the next character ("c") is compared  with  what
       subpattern  2 matched, which was	"a". This fails. Because the recursion
       is treated as an	atomic group, there are	now  no	 backtracking  points,
       and  so the entire match	fails. (Perl is	able, at this point, to	re-en-
       ter the recursion and try the second alternative.) However, if the pat-
       tern  is	 written  with the alternatives	in the other order, things are


       This time, the recursing	alternative is tried first, and	 continues  to
       recurse	until  it runs out of characters, at which point the recursion
       fails. But this time we do have	another	 alternative  to  try  at  the
       higher  level. That is the big difference: in the previous case the re-
       maining alternative is at a deeper recursion level, which  PCRE	cannot

       To  change  the pattern so that it matches all palindromic strings, not
       just those with an odd number of	characters, it is tempting  to	change
       the pattern to this:


       Again,  this  works  in Perl, but not in	PCRE, and for the same reason.
       When a deeper recursion has matched a single character,	it  cannot  be
       entered	again  in  order  to match an empty string. The	solution is to
       separate	the two	cases, and write out the odd and even cases as	alter-
       natives at the higher level:


       If  you	want  to match typical palindromic phrases, the	pattern	has to
       ignore all non-word characters, which can be done like this:


       If run with the caseless	option,	this pattern matches phrases  such  as
       "A  man,	 a  plan, a canal: Panama!" and	it works well in both PCRE and
       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
       ing  into  sequences of non-word	characters. Without this, PCRE takes a
       great deal longer (ten times or more) to	 match	typical	 phrases,  and
       Perl takes so long that you think it has	gone into a loop.

       WARNING:	 The  palindrome-matching patterns above work only if the sub-
       ject string does	not start with a palindrome that is shorter  than  the
       entire  string.	For example, although "abcba" is correctly matched, if
       the subject is "ababa", PCRE finds the palindrome "aba" at  the	start,
       then  fails at top level	because	the end	of the string does not follow.
       Once again, it cannot jump back into the	recursion to try other	alter-
       natives,	so the entire match fails.

       The  second  way	 in which PCRE and Perl	differ in their	recursion pro-
       cessing is in the handling of captured values. In Perl, when a  subpat-
       tern  is	 called	recursively or as a subpattern (see the	next section),
       it has no access	to any values that were	captured  outside  the	recur-
       sion,  whereas  in  PCRE	 these values can be referenced. Consider this


       In PCRE,	this pattern matches "bab". The	 first	capturing  parentheses
       match  "b",  then in the	second group, when the back reference \1 fails
       to match	"b", the second	alternative matches "a"	and then recurses.  In
       the  recursion,	\1 does	now match "b" and so the whole match succeeds.
       In Perl,	the pattern fails to match because inside the  recursive  call
       \1 cannot access	the externally set value.

       If  the	syntax for a recursive subpattern call (either by number or by
       name) is	used outside the parentheses to	which it refers,  it  operates
       like  a subroutine in a programming language. The called	subpattern may
       be defined before or after the reference. A numbered reference  can  be
       absolute	or relative, as	in these examples:

	 * (...(absolute)...)...(?2)...

	 * (...(relative)...)...(?-1)...

	 * (...(?+1)...(relative)...

       An earlier example pointed out that the pattern

       (sens|respons)e and \1ibility

       matches	"sense and sensibility"	and "response and responsibility", but
       not "sense and responsibility". If instead the pattern

       (sens|respons)e and (?1)ibility

       is used,	it does	match "sense and responsibility" as well as the	 other
       two  strings.  Another  example	is  given  in the discussion of	DEFINE

       All subroutine calls, whether recursive or not, are always  treated  as
       atomic  groups. That is,	once a subroutine has matched some of the sub-
       ject string, it is never	re-entered, even if it contains	untried	alter-
       natives	and  there  is	a  subsequent  matching	failure. Any capturing
       parentheses that	are set	during the subroutine  call  revert  to	 their
       previous	values afterwards.

       Processing  options  such as case-independence are fixed	when a subpat-
       tern is defined,	so if it is used as a subroutine, such options	cannot
       be changed for different	calls. For example, consider this pattern:


       It  matches  "abcabc". It does not match	"abcABC" because the change of
       processing option does not affect the called subpattern.

       For compatibility with Oniguruma, the non-Perl syntax \g	followed by  a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an alternative syntax for referencing a	subpattern  as	a  subroutine,
       possibly	 recursively. Here are two of the examples used	above, rewrit-
       ten using this syntax:

       (?<pn> \( ( (?>[^()]+) |	\g<pn> )* \) )

       (sens|respons)e and \g'1'ibility

       PCRE supports an	extension to Oniguruma:	if a number is preceded	 by  a
       plus or a minus sign it is taken	as a relative reference. For example:


       Note  that \g{...} (Perl	syntax)	and \g<...> (Oniguruma syntax) are not
       synonymous. The former is a back	reference; the latter is a  subroutine

       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
       which are still described in the	Perl  documentation  as	 "experimental
       and  subject to change or removal in a future version of	Perl". It goes
       on to say: "Their usage in production code should  be  noted  to	 avoid
       problems	 during	upgrades." The same remarks apply to the PCRE features
       described in this section.

       The new verbs make use of what was previously invalid syntax: an	 open-
       ing parenthesis followed	by an asterisk.	They are generally of the form
       (*VERB) or (*VERB:NAME).	Some may take either form,  possibly  behaving
       differently  depending  on  whether or not a name is present. A name is
       any sequence of characters that does not	include	a closing parenthesis.
       The maximum length of name is 255 in the	8-bit library and 65535	in the
       16-bit and 32-bit libraries. If the name	is  empty,  that  is,  if  the
       closing	parenthesis immediately	follows	the colon, the effect is as if
       the colon were not there. Any number of these verbs may occur in	a pat-

       The  behaviour  of  these  verbs	in repeated groups, assertions,	and in
       subpatterns called as subroutines (whether or not recursively) is docu-
       mented below.

       Optimizations that affect backtracking verbs

       PCRE  contains some optimizations that are used to speed	up matching by
       running some checks at the start	of each	match attempt. For example, it
       may  know  the minimum length of	matching subject, or that a particular
       character must be present. When one of these optimizations bypasses the
       running	of  a  match,  any  included  backtracking  verbs will not, of
       course, be processed. You can suppress the start-of-match optimizations
       by  setting  the	 no_start_optimize option when calling re:compile/2 or
       re:run/3, or by starting	the pattern with (*NO_START_OPT).

       Experiments with	Perl suggest that it too  has  similar	optimizations,
       sometimes leading to anomalous results.

       Verbs that act immediately

       The  following  verbs act as soon as they are encountered. They may not
       be followed by a	name.


       This verb causes	the match to end successfully, skipping	the  remainder
       of  the pattern.	However, when it is inside a subpattern	that is	called
       as a subroutine,	only that subpattern is	ended  successfully.  Matching
       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
       tive assertion, the assertion succeeds; in a  negative  assertion,  the
       assertion fails.

       If  (*ACCEPT)  is inside	capturing parentheses, the data	so far is cap-
       tured. For example:


       This matches "AB", "AAD", or "ACD"; when	it matches "AB", "B"  is  cap-
       tured by	the outer parentheses.

       (*FAIL) or (*F)

       This  verb causes a matching failure, forcing backtracking to occur. It
       is equivalent to	(?!) but easier	to read. The Perl documentation	 notes
       that  it	 is  probably  useful only when	combined with (?{}) or (??{}).
       Those are, of course, Perl features that	are not	present	in  PCRE.  The
       nearest	equivalent is the callout feature, as for example in this pat-


       A match with the	string "aaaa" always fails, but	the callout  is	 taken
       before each backtrack happens (in this example, 10 times).

       Recording which path was	taken

       There  is  one  verb whose main purpose is to track how a match was ar-
       rived at, though	it also	has a secondary	use in	conjunction  with  ad-
       vancing the match starting point	(see (*SKIP) below).

       In  Erlang, there is no interface to retrieve a mark with re:run/{2,3],
       so only the secondary purpose is	relevant to the	Erlang programmer!

       The rest	of this	section	is  therefore  deliberately  not  adapted  for
       reading	by  the	 Erlang	programmer, however the	examples might help in
       understanding NAMES as they can be used by (*SKIP).

       (*MARK:NAME) or (*:NAME)

       A name is always	required with this verb. There	may  be	 as  many  in-
       stances	of  (*MARK)  as	 you like in a pattern,	and their names	do not
       have to be unique.

       When a match succeeds, the name of the  last-encountered	 (*MARK:NAME),
       (*PRUNE:NAME),  or  (*THEN:NAME)	on the matching	path is	passed back to
       the caller as  described	 in  the  section  entitled  "Extra  data  for
       pcre_exec()"  in	 the  pcreapi  documentation.  Here  is	 an example of
       pcretest	output,	where the /K modifier requests the retrieval and  out-
       putting of (*MARK) data:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XY
	  0: XY
	 MK: A
	  0: XZ
	 MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple it indicates	which of the two alternatives matched. This is a  more
       efficient  way of obtaining this	information than putting each alterna-
       tive in its own capturing parentheses.

       If a verb with a	name is	encountered in a positive  assertion  that  is
       true,  the  name	 is recorded and passed	back if	it is the last-encoun-
       tered. This does	not happen for negative	assertions or failing positive

       After  a	 partial match or a failed match, the last encountered name in
       the entire match	process	is returned. For example:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XP
	 No match, mark	= B

       Note that in this unanchored example the	 mark  is  retained  from  the
       match attempt that started at the letter	"X" in the subject. Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, but nevertheless do not reset it.

       Verbs that act after backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues with what	follows, but if	there is no subsequent match,  causing
       a  backtrack  to	 the  verb, a failure is forced. That is, backtracking
       cannot pass to the left of the verb. However, when one of  these	 verbs
       appears inside an atomic	group or an assertion that is true, its	effect
       is confined to that group, because once the  group  has	been  matched,
       there  is never any backtracking	into it. In this situation, backtrack-
       ing can "jump back" to the left of the entire atomic  group  or	asser-
       tion.  (Remember	also, as stated	above, that this localization also ap-
       plies in	subroutine calls.)

       These verbs differ in exactly what kind of failure  occurs  when	 back-
       tracking	 reaches  them.	 The behaviour described below is what happens
       when the	verb is	not in a subroutine or an assertion.  Subsequent  sec-
       tions cover these special cases.


       This  verb, which may not be followed by	a name,	causes the whole match
       to fail outright	if there is a later matching failure that causes back-
       tracking	to reach it. Even if the pattern is unanchored,	no further at-
       tempts to find a	match by advancing the starting	point take  place.  If
       (*COMMIT)  is  the  only	backtracking verb that is encountered, once it
       has been	passed re:run/{2,3} is committed to finding  a	match  at  the
       current starting	point, or not at all. For example:


       This  matches  "xxaab" but not "aacaab".	It can be thought of as	a kind
       of dynamic anchor, or "I've started, so I must finish." The name	of the
       most  recently passed (*MARK) in	the path is passed back	when (*COMMIT)
       forces a	match failure.

       If there	is more	than one backtracking verb in a	pattern,  a  different
       one  that  follows  (*COMMIT) may be triggered first, so	merely passing
       (*COMMIT) during	a match	does not always	guarantee that a match must be
       at this starting	point.

       Note that (*COMMIT) at the start	of a pattern is	not the	same as	an an-
       chor, unless PCRE's start-of-match optimizations	 are  turned  off,  as
       shown in	this example:

	 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
	 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).

       PCRE  knows  that  any  match  must start with "a", so the optimization
       skips along the subject to "a" before running the first match  attempt,
       which succeeds. When the	optimization is	disabled by the	no_start_opti-
       mize option, the	match starts at	"x" and	so the (*COMMIT) causes	it  to
       fail without trying any other starting points.

       (*PRUNE)	or (*PRUNE:NAME)

       This  verb causes the match to fail at the current starting position in
       the subject if there is a later matching	failure	that causes backtrack-
       ing  to	reach it. If the pattern is unanchored,	the normal "bumpalong"
       advance to the next starting character then happens.  Backtracking  can
       occur  as  usual	to the left of (*PRUNE), before	it is reached, or when
       matching	to the right of	(*PRUNE), but if there	is  no	match  to  the
       right,  backtracking cannot cross (*PRUNE). In simple cases, the	use of
       (*PRUNE)	is just	an alternative to an atomic group or possessive	 quan-
       tifier, but there are some uses of (*PRUNE) that	cannot be expressed in
       any other way. In an anchored pattern (*PRUNE) has the same  effect  as

       The   behaviour	 of   (*PRUNE:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*PRUNE). It	is like	(*MARK:NAME) in	that the name  is  re-
       membered	for passing back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

       The fact	that (*PRUNE:NAME) remembers the name is useless to the	Erlang
       programmer, as names can	not be retrieved.


       This  verb, when	given without a	name, is like (*PRUNE),	except that if
       the pattern is unanchored, the "bumpalong" advance is not to  the  next
       character, but to the position in the subject where (*SKIP) was encoun-
       tered. (*SKIP) signifies	that whatever text was matched leading	up  to
       it cannot be part of a successful match.	Consider:


       If  the	subject	 is  "aaaac...",  after	 the first match attempt fails
       (starting at the	first character	in the	string),  the  starting	 point
       skips on	to start the next attempt at "c". Note that a possessive quan-
       tifer does not have the same effect as this example; although it	 would
       suppress	 backtracking  during  the first match attempt,	the second at-
       tempt would start at the	second character instead  of  skipping	on  to


       When (*SKIP) has	an associated name, its	behaviour is modified. When it
       is triggered, the previous path through the pattern is searched for the
       most  recent  (*MARK)  that  has	 the  same  name. If one is found, the
       "bumpalong" advance is to the subject position that corresponds to that
       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
       a matching name is found, the (*SKIP) is	ignored.

       Note that (*SKIP:NAME) searches only for	names set by (*MARK:NAME).  It
       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).

       (*THEN) or (*THEN:NAME)

       This  verb  causes  a skip to the next innermost	alternative when back-
       tracking	reaches	it. That  is,  it  cancels  any	 further  backtracking
       within  the  current  alternative.  Its name comes from the observation
       that it can be used for a pattern-based if-then-else block:

       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If the COND1 pattern matches, FOO is tried (and possibly	further	 items
       after  the  end	of the group if	FOO succeeds); on failure, the matcher
       skips to	the second alternative and tries COND2,	 without  backtracking
       into  COND1.  If	that succeeds and BAR fails, COND3 is tried. If	subse-
       quently BAZ fails, there	are no more alternatives, so there is a	 back-
       track  to  whatever came	before the entire group. If (*THEN) is not in-
       side an alternation, it acts like (*PRUNE).

       The   behaviour	 of   (*THEN:NAME)   is	  the	not   the   same    as
       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the	name is	remem-
       bered for passing back to the caller.  However,	(*SKIP:NAME)  searches
       only for	names set with (*MARK).

       The  fact that (*THEN:NAME) remembers the name is useless to the	Erlang
       programmer, as names can	not be retrieved.

       A subpattern that does not contain a | character	is just	a part of  the
       enclosing alternative; it is not	a nested alternation with only one al-
       ternative. The effect of	(*THEN)	extends	beyond such  a	subpattern  to
       the  enclosing alternative. Consider this pattern, where	A, B, etc. are
       complex pattern fragments that do not contain any | characters at  this

       A (B(*THEN)C) | D

       If  A and B are matched,	but there is a failure in C, matching does not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	 if the	subpattern containing (*THEN) is given an alternative,
       it behaves differently:

       A (B(*THEN)C | (*FAIL)) | D

       The effect of (*THEN) is	now confined to	the inner subpattern. After  a
       failure in C, matching moves to (*FAIL),	which causes the whole subpat-
       tern to fail because there are no more alternatives  to	try.  In  this
       case, matching does now backtrack into A.

       Note  that a conditional	subpattern is not considered as	having two al-
       ternatives, because only	one is ever used. In other words, the |	 char-
       acter  in  a  conditional  subpattern has a different meaning. Ignoring
       white space, consider:

       ^.*? (?(?=a) a |	b(*THEN)c )

       If the subject is "ba", this pattern does not match. Because .*?	is un-
       greedy,	it initially matches zero characters. The condition (?=a) then
       fails, the character "b"	is matched, but	"c" is	not.  At  this	point,
       matching	 does  not  backtrack to .*? as	might perhaps be expected from
       the presence of the | character.	The conditional	subpattern is part  of
       the  single  alternative	 that  comprises the whole pattern, and	so the
       match fails. (If	there was a backtrack into .*?,	allowing it  to	 match
       "b", the	match would succeed.)

       The  verbs just described provide four different	"strengths" of control
       when subsequent matching	fails. (*THEN) is the weakest, carrying	on the
       match  at  the next alternative.	(*PRUNE) comes next, failing the match
       at the current starting position, but allowing an advance to  the  next
       character  (for an unanchored pattern). (*SKIP) is similar, except that
       the advance may be more than one	character. (*COMMIT) is	the strongest,
       causing the entire match	to fail.

       More than one backtracking verb

       If  more	 than  one  backtracking verb is present in a pattern, the one
       that is backtracked onto	first acts. For	example,  consider  this  pat-
       tern, where A, B, etc. are complex pattern fragments:


       If  A matches but B fails, the backtrack	to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN)	causes	the next alternative (ABD) to be tried.	This behaviour
       is consistent, but is not always	the same as Perl's. It means  that  if
       two  or	more backtracking verbs	appear in succession, all the the last
       of them has no effect. Consider this example:


       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       cases it	to be triggered, and its action	is taken. There	can never be a
       backtrack onto (*COMMIT).

       Backtracking verbs in repeated groups

       PCRE differs from Perl in its handling of  backtracking	verbs  in  re-
       peated groups. For example, consider:


       If  the	subject	 is  "abac",  Perl matches, but	PCRE fails because the
       (*COMMIT) in the	second repeat of the group acts.

       Backtracking verbs in assertions

       (*FAIL) in an assertion has its normal effect: it forces	 an  immediate

       (*ACCEPT) in a positive assertion causes	the assertion to succeed with-
       out any further processing. In a	negative assertion,  (*ACCEPT)	causes
       the assertion to	fail without any further processing.

       The  other  backtracking	verbs are not treated specially	if they	appear
       in a positive assertion.	In particular, (*THEN) skips to	the  next  al-
       ternative  in  the  innermost  enclosing	 group	that has alternations,
       whether or not this is within the assertion.

       Negative	assertions are,	however, different, in order  to  ensure  that
       changing	a positive assertion into a negative assertion changes its re-
       sult. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a	 nega-
       tive  assertion to be true, without considering any further alternative
       branches	in the assertion. Backtracking into (*THEN) causes it to  skip
       to  the next enclosing alternative within the assertion (the normal be-
       haviour), but if	the assertion  does  not  have	such  an  alternative,
       (*THEN) behaves like (*PRUNE).

       Backtracking verbs in subroutines

       These  behaviours  occur	whether	or not the subpattern is called	recur-
       sively. Perl's treatment	of subroutines is different in some cases.

       (*FAIL) in a subpattern called as a subroutine has its  normal  effect:
       it forces an immediate backtrack.

       (*ACCEPT)  in a subpattern called as a subroutine causes	the subroutine
       match to	succeed	without	any further processing.	Matching then  contin-
       ues after the subroutine	call.

       (*COMMIT), (*SKIP), and (*PRUNE)	in a subpattern	called as a subroutine
       cause the subroutine match to fail.

       (*THEN) skips to	the next alternative in	the innermost enclosing	 group
       within  the subpattern that has alternatives. If	there is no such group
       within the subpattern, (*THEN) causes the subroutine match to fail.

Ericsson AB			  stdlib 2.4				 re(3)


Want to link to this manual page? Use this URL:

home | help