Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
re(3)			   Erlang Module Definition			 re(3)

       re - Perl like regular expressions for Erlang

       This  module contains regular expression	matching functions for strings
       and binaries.

       The regular expression syntax and semantics resemble that of Perl.

       The library's matching algorithms  are  currently  based	 on  the  PCRE
       library,	 but  not all of the PCRE library is interfaced	and some parts
       of the library go beyond	what PCRE offers. The  sections	 of  the  PCRE
       documentation which are relevant	to this	module are included here.

       The  Erlang literal syntax for strings uses the "\" (backslash) charac-
       ter as an escape	code.  You  need  to  escape  backslashes  in  literal
       strings,	 both  in your code and	in the shell, with an additional back-
       slash, i.e.: "\\".

       mp() = {re_pattern, term(), term(), term(), term()}

	      Opaque datatype containing a compiled  regular  expression.  The
	      mp()  is guaranteed to be	a tuple() having the atom 're_pattern'
	      as its first element, to allow for matching in guards. The arity
	      of  the tuple() or the content of	the other fields may change in
	      future releases.

       nl_spec() = cr |	crlf | lf | anycrlf | any

       compile_option()	= unicode
			| anchored
			| caseless
			| dollar_endonly
			| dotall
			| extended
			| firstline
			| multiline
			| no_auto_capture
			| dupnames
			| ungreedy
			| {newline, nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| no_start_optimize
			| ucp
			| never_utf

       compile(Regexp) -> {ok, MP} | {error, ErrSpec}


		 Regexp	= iodata()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      The same as compile(Regexp,[])

       compile(Regexp, Options)	-> {ok,	MP} | {error, ErrSpec}


		 Regexp	= iodata() | unicode:charlist()
		 Options = [Option]
		 Option	= compile_option()
		 MP = mp()
		 ErrSpec =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      This function compiles a	regular	 expression  with  the	syntax
	      described	 below	into  an internal format to be used later as a
	      parameter	to the run/2,3 functions.

	      Compiling	the regular expression before matching	is  useful  if
	      the  same	 expression is to be used in matching against multiple
	      subjects during the program's lifetime. Compiling	once and  exe-
	      cuting many times	is far more efficient than compiling each time
	      one wants	to match.

	      When the unicode option is given,	the regular expression	should
	      be  given	 as a valid Unicode charlist(),	otherwise as any valid

	      The options have the following meanings:

		  The regular expression is given as a Unicode charlist()  and
		  the resulting	regular	expression code	is to be run against a
		  valid	Unicode	charlist()  subject.  Also  consider  the  ucp
		  option when using Unicode characters.

		  The  pattern is forced to be "anchored", that	is, it is con-
		  strained to match only at the	first matching	point  in  the
		  string  that	is being searched (the "subject	string"). This
		  effect can also be achieved by appropriate constructs	in the
		  pattern itself.

		  Letters  in the pattern match	both upper and lower case let-
		  ters.	It is equivalent to Perl's /i option, and  it  can  be
		  changed within a pattern by a	(?i) option setting. Uppercase
		  and lowercase	letters	are defined as in the ISO-8859-1 char-
		  acter	set.

		  A  dollar  metacharacter  in the pattern matches only	at the
		  end of the subject string. Without  this  option,  a	dollar
		  also	matches	immediately before a newline at	the end	of the
		  string  (but	not  before  any  other	 newlines).  The  dol-
		  lar_endonly  option  is ignored if multiline is given. There
		  is no	equivalent option in Perl, and no way to set it	within
		  a pattern.

		  A dot	in the pattern matches all characters, including those
		  that indicate	newline. Without it, a dot does	not match when
		  the current position is at a newline.	This option is equiva-
		  lent to Perl's /s option, and	it can	be  changed  within  a
		  pattern  by  a (?s) option setting. A	negative class such as
		  [^a] always matches newline characters, independent of  this
		  option's setting.

		  Whitespace data characters in	the pattern are	ignored	except
		  when escaped or inside a character  class.  Whitespace  does
		  not  include the VT character	(ASCII 11). In addition, char-
		  acters between an unescaped #	outside	a character class  and
		  the  next  newline,  inclusive,  are	also  ignored. This is
		  equivalent to	Perl's /x option, and it can be	changed	within
		  a  pattern  by  a  (?x) option setting. This option makes it
		  possible to include comments	inside	complicated  patterns.
		  Note,	 however,  that	 this applies only to data characters.
		  Whitespace characters	may never appear within	special	 char-
		  acter	 sequences  in	a  pattern,  for  example  within  the
		  sequence (?( which introduces	a conditional subpattern.

		  An unanchored	pattern	is required to match before or at  the
		  first	newline	in the subject string, though the matched text
		  may continue over the	newline.

		  By default, PCRE treats the subject string as	consisting  of
		  a  single  line  of characters (even if it actually contains
		  newlines). The "start	of  line"  metacharacter  (^)  matches
		  only	at  the	 start	of the string, while the "end of line"
		  metacharacter	($) matches only at the	end of the string,  or
		  before  a  terminating  newline  (unless  dollar_endonly  is
		  given). This is the same as Perl.

		  When multiline is given, the "start of  line"	 and  "end  of
		  line"	 constructs match immediately following	or immediately
		  before internal newlines  in	the  subject  string,  respec-
		  tively, as well as at	the very start and end.	This is	equiv-
		  alent	to Perl's /m option, and it can	be  changed  within  a
		  pattern  by  a (?m) option setting. If there are no newlines
		  in a subject string, or no occurrences of ^ or $ in  a  pat-
		  tern,	setting	multiline has no effect.

		  Disables  the	 use  of numbered capturing parentheses	in the
		  pattern. Any opening parenthesis that	is not followed	 by  ?
		  behaves  as  if it were followed by ?: but named parentheses
		  can still be used for	capturing (and they acquire numbers in
		  the  usual  way).  There  is no equivalent of	this option in

		  Names	used to	identify capturing  subpatterns	 need  not  be
		  unique.  This	 can  be  helpful for certain types of pattern
		  when it is known that	only one instance of the named subpat-
		  tern	can  ever  be matched. There are more details of named
		  subpatterns below

		  This option inverts the "greediness" of the  quantifiers  so
		  that	they  are  not greedy by default, but become greedy if
		  followed by "?". It is not compatible	with Perl. It can also
		  be set by a (?U) option setting within the pattern.

		{newline, NLSpec}:
		  Override  the	default	definition of a	newline	in the subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline is indicated by a single character LF (ASCII  10),
		    the	default

		    Newline  is	 indicated by the two-character	CRLF (ASCII 13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	 of  the  newline  sequences  above,  plus the Unicode
		    sequences  VT  (vertical  tab,  U+000B),   FF   (formfeed,
		    U+000C),  NEL  (next  line,	 U+0085),  LS (line separator,
		    U+2028), and PS (paragraph separator, U+2029).

		  Specifies specifically that \R is to match only the  cr,  lf
		  or  crlf sequences, not the Unicode specific newline charac-

		  Specifies specifically that \R is to match all  the  Unicode
		  newline characters (including	crlf etc, the default).

		  This	option	disables  optimization that may	malfunction if
		  "Special start-of-pattern items" are present in the  regular
		  expression.	A  typical  example  would  be	when  matching
		  "DEFABC" against "(*COMMIT)ABC", where the  start  optimiza-
		  tion	of PCRE	would skip the subject up to the "A" and would
		  never	realize	that the  (*COMMIT)  instruction  should  have
		  made	the matching fail. This	option is only relevant	if you
		  use "start-of-pattern	items",	as discussed  in  the  section
		  "PCRE	regular	expression details" below.

		  Specifies  that  Unicode Character Properties	should be used
		  when resolving \B, \b, \D, \d, \S, \s, \W  and  \w.  Without
		  this	flag, only ISO-Latin-1 properties are used. Using Uni-
		  code properties hurts	performance, but is semantically  cor-
		  rect	when  working  with Unicode characters beyond the ISO-
		  Latin-1 range.

		  Specifies that the (*UTF) and/or  (*UTF8)  "start-of-pattern
		  items"  are  forbidden.  This	 flag can not be combined with
		  unicode. Useful if ISO-Latin-1  patterns  from  an  external
		  source are to	be compiled.

       inspect(MP, Item) -> {namelist, [binary()]}


		 MP = mp()
		 Item =	namelist

	      This  function  takes a compiled regular expression and an item,
	      returning	the relevant data from the  regular  expression.  Cur-
	      rently  the  only	 supported item	is namelist, which returns the
	      tuple {namelist, [  binary()]},  containing  the	names  of  all
	      (unique) named subpatterns in the	regular	expression.


	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> re:inspect(MP,namelist).
	      3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
	      4> re:inspect(MPD,namelist).

	      Note  specifically in the	second example that the	duplicate name
	      only occurs once in the returned list, and that the list	is  in
	      alphabetical  order regardless of	where the names	are positioned
	      in the regular expression. The order of the names	is the same as
	      the  order of captured subexpressions if {capture, all_names} is
	      given as an option to re:run/3. You can therefore	create a name-
	      to-value mapping from the	result of re:run/3 like	this:

	      1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
	      2> {namelist, N} = re:inspect(MP,namelist).
	      3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
	      4> NameMap = lists:zip(N,L).

	      More items are expected to be added in the future.

       run(Subject, RE)	-> {match, Captured} | nomatch


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Captured = [CaptureData]
		 CaptureData = {integer(), integer()}

	      The same as run(Subject,RE,[]).

       run(Subject, RE,	Options) ->
	      {match, Captured}	| match	| nomatch | {error, ErrType}


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| report_errors
			| {offset, integer() >=	0}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| {newline, NLSpec :: nl_spec()}
			| bsr_anycrlf
			| bsr_unicode
			| {capture, ValueSpec}
			| {capture, ValueSpec, Type}
			| CompileOpt
		 Type =	index |	list | binary
		 ValueSpec = all
			   | all_but_first
			   | all_names
			   | first
			   | none
			   | ValueList
		 ValueList = [ValueID]
		 ValueID = integer() | string()	| atom()
		 CompileOpt = compile_option()
		   See compile/2 above.
		 Captured = [CaptureData] | [[CaptureData]]
		 CaptureData = {integer(), integer()}
			     | ListConversionData
			     | binary()
		 ListConversionData = string()
				    | {error, string(),	binary()}
				    | {incomplete, string(), binary()}
		 ErrType = match_limit
			 | match_limit_recursion
			 | {compile, CompileErr}
		 CompileErr =
		     {ErrString	:: string(), Position :: integer() >= 0}

	      Executes a regexp	matching, returning match/{match, Captured} or
	      nomatch. The regular expression can be given either as  iodata()
	      in  which	case it	is automatically compiled (as by re:compile/2)
	      and executed, or as a pre-compiled mp() in which case it is exe-
	      cuted against the	subject	directly.

	      When  compilation	is involved, the exception badarg is thrown if
	      a	compilation error occurs. Call re:compile/2 to get information
	      about the	location of the	error in the regular expression.

	      If  the  regular	expression  is previously compiled, the	option
	      list can only contain  the  options  anchored,  global,  notbol,
	      noteol,	report_errors,	notempty,  notempty_atstart,  {offset,
	      integer()	  _=	0},    {match_limit,	integer()    _=	   0},
	      {match_limit_recursion,  integer()  _= 0}, {newline, NLSpec} and
	      {capture,	ValueSpec}/{capture, ValueSpec,	Type}.	Otherwise  all
	      options valid for	the re:compile/2 function are allowed as well.
	      Options allowed both for compilation and execution of  a	match,
	      namely anchored and {newline, NLSpec}, will affect both the com-
	      pilation and execution if	present	together with a	 non  pre-com-
	      piled regular expression.

	      If  the  regular	expression  was	 previously  compiled with the
	      option unicode, the Subject should be provided as	a  valid  Uni-
	      code  charlist(),	otherwise any iodata() will do.	If compilation
	      is involved and the option unicode is given,  both  the  Subject
	      and  the	regular	 expression  should  be	given as valid Unicode

	      The {capture, ValueSpec}/{capture, ValueSpec, Type} defines what
	      to  return  from the function upon successful matching. The cap-
	      ture tuple may contain both a value specification	telling	 which
	      of the captured substrings are to	be returned, and a type	speci-
	      fication,	telling	how captured substrings	are to be returned (as
	      index  tuples,  lists or binaries). The capture option makes the
	      function quite flexible and powerful. The	different options  are
	      described	in detail below.

	      If  the  capture options describe	that no	substring capturing at
	      all is to	be done	({capture, none}), the	function  will	return
	      the  single  atom	 match upon successful matching, otherwise the
	      tuple {match, ValueList} is returned. Disabling capturing	can be
	      done either by specifying	none or	an empty list as ValueSpec.

	      The  report_errors  option  adds	the  possibility that an error
	      tuple is returned. The tuple will	 either	 indicate  a  matching
	      error  (match_limit  or  match_limit_recursion) or a compilation
	      error, where the error tuple has the  format  {error,  {compile,
	      CompileErr}}.  Note  that	 if  the  option  report_errors	is not
	      given, the function never	returns	error tuples, but will	report
	      compilation  errors as a badarg exception	and failed matches due
	      to exceeded match	limits simply as nomatch.

	      The options relevant for execution are:

		  Limits re:run/3 to matching at the first matching  position.
		  If a pattern was compiled with anchored, or turned out to be
		  anchored by virtue of	its contents, it cannot	be made	 unan-
		  chored  at  matching	time,  hence  there  is	 no unanchored

		  Implements global (repetitive) search	(the g flag in	Perl).
		  Each	match  is returned as a	separate list()	containing the
		  specific match as well as any	matching subexpressions	(or as
		  specified  by	 the capture option). The Captured part	of the
		  return value will hence be a list()  of  list()s  when  this
		  option is given.

		  The  interaction of the global option	with a regular expres-
		  sion which matches an	empty  string  surprises  some	users.
		  When	the  global  option  is	 given,	re:run/3 handles empty
		  matches in the same way as Perl: a zero-length match at  any
		  point	  will	 be   retried	with  the  options  [anchored,
		  notempty_atstart] as well. If	that search gives a result  of
		  length > 0, the result is included. For example:


		  The following	matching will be performed:

		  At offset 0:
		    The	 regexp	(|at) will first match at the initial position
		    of the string cat, giving  the  result  set	 [{0,0},{0,0}]
		    (the  second  {0,0}	 is due	to the subexpression marked by
		    the	parentheses). As the length of	the  match  is	0,  we
		    don't advance to the next position yet.

		  At offset 0 with [anchored, notempty_atstart]:
		     The   search  is  retried	with  the  options  [anchored,
		    notempty_atstart] at the same  position,  which  does  not
		    give  any  interesting  result  of	longer	length,	so the
		    search position is now advanced to the next	character (a).

		  At offset 1:
		    This  time,	 the  search results in	[{1,0},{1,0}], so this
		    search will	also be	repeated with the extra	options.

		  At offset 1 with [anchored, notempty_atstart]:
		    Now	the ab alternative is found and	 the  result  will  be
		    [{1,2},{1,2}].  The	result is added	to the list of results
		    and	the position in	the  search  string  is	 advanced  two

		  At offset 3:
		    The	search now once	again matches the empty	string,	giving

		  At offset 1 with [anchored, notempty_atstart]:
		    This will give no result of	length > 0 and we are  at  the
		    last position, so the global search	is complete.

		  The result of	the call is:


		  An  empty  string  is	 not considered	to be a	valid match if
		  this option is given.	If there are alternatives in the  pat-
		  tern,	 they  are  tried.  If	all the	alternatives match the
		  empty	string,	the entire match fails.	For  example,  if  the


		  is  applied  to  a  string not beginning with	"a" or "b", it
		  would	normally match the empty string	at the	start  of  the
		  subject.  With the notempty option, this match is not	valid,
		  so re:run/3 searches further into the	string for occurrences
		  of "a" or "b".

		  This	is  like  notempty,  except that an empty string match
		  that is not at the start of the subject is permitted.	If the
		  pattern is anchored, such a match can	occur only if the pat-
		  tern contains	\K.

		  Perl	 has   no   direct   equivalent	  of	notempty    or
		  notempty_atstart,  but it does make a	special	case of	a pat-
		  tern match of	the empty string within	its split()  function,
		  and  when  using  the	/g modifier. It	is possible to emulate
		  Perl's behavior after	matching a null	string by first	trying
		  the match again at the same offset with notempty_atstart and
		  anchored, and	then, if that fails, by	advancing the starting
		  offset (see below) and trying	an ordinary match again.

		  This	option	specifies that the first character of the sub-
		  ject string is not the beginning of a	line, so  the  circum-
		  flex	metacharacter should not match before it. Setting this
		  without multiline (at	compile	time) causes circumflex	 never
		  to  match. This option only affects the behavior of the cir-
		  cumflex metacharacter. It does not affect \A.

		  This option specifies	that the end of	the subject string  is
		  not  the  end	 of a line, so the dollar metacharacter	should
		  not match it nor (except in multiline	mode) a	newline	 imme-
		  diately  before  it. Setting this without multiline (at com-
		  pile time) causes dollar never to match. This	option affects
		  only	the  behavior of the dollar metacharacter. It does not
		  affect \Z or \z.

		  This option gives better control of the  error  handling  in
		  re:run/3. When it is given, compilation errors (if the regu-
		  lar expression isn't already compiled) as well  as  run-time
		  errors are explicitly	returned as an error tuple.

		  The possible run-time	errors are:

		    The	PCRE library sets a limit on how many times the	inter-
		    nal	match function can be called. The  default  value  for
		    this  is  10000000	in the library compiled	for Erlang. If
		    {error, match_limit} is returned, it means that the	execu-
		    tion  of  the  regular  expression has reached this	limit.
		    Normally this is to	be regarded as a nomatch, which	is the
		    default  return value when this happens, but by specifying
		    report_errors, you will get	informed when the match	 fails
		    due	to to many internal calls.

		    This error is very similar to match_limit, but occurs when
		    the	internal  match	 function  of  PCRE  is	 "recursively"
		    called  more times than the	"match_limit_recursion"	limit,
		    which is by	default	10000000 as well. Note that as long as
		    the	match_limit and	match_limit_default values are kept at
		    the	default	values,	the  match_limit_recursion  error  can
		    not	occur, as the match_limit error	will occur before that
		    (each recursive call is also a call, but not vice  versa).
		    Both limits	can however be changed,	either by setting lim-
		    its	directly in the	regular	expression string (see	refer-
		    ence section below)	or by giving options to	re:run/3

		  It  is  important  to	understand that	what is	referred to as
		  "recursion" when limiting matches is not actually  recursion
		  on  the  C stack of the Erlang machine, neither is it	recur-
		  sion on the Erlang process stack. The	version	of  PCRE  com-
		  piled	into the Erlang	VM uses	machine	"heap" memory to store
		  values that needs to	be  kept  over	recursion  in  regular
		  expression matches.

		{match_limit, integer()	_= 0}:
		  This	option	limits	the  execution	time  of a match in an
		  implementation-specific way. It is described in the  follow-
		  ing way by the PCRE documentation:

		The match_limit	field provides a means of preventing PCRE from using
		up a vast amount of resources when running patterns that are not going
		to match, but which have a very	large number of	possibilities in their
		search trees. The classic example is a pattern that uses nested
		unlimited repeats.

		Internally, pcre_exec()	uses a function	called match(),	which it calls
		repeatedly (sometimes recursively). The	limit set by match_limit is
		imposed	on the number of times this function is	called during a	match,
		which has the effect of	limiting the amount of backtracking that can
		take place. For	patterns that are not anchored,	the count restarts
		from zero for each position in the subject string.

		  This	means that runaway regular expression matches can fail
		  faster if the	 limit	is  lowered  using  this  option.  The
		  default  value  compiled  into the Erlang virtual machine is

		This option does in no way affect the execution	of the	Erlang
		virtual	 machine  in  terms  of	 "long	running	BIF's".	re:run
		always give control back to the	scheduler of Erlang  processes
		at  intervals  that  ensures  the  real	time properties	of the
		Erlang system.

		{match_limit_recursion,	integer() _= 0}:
		  This option limits the execution time	and memory consumption
		  of  a	 match in an implementation-specific way, very similar
		  to match_limit. It is	described in the following way by  the
		  PCRE documentation:

		The match_limit_recursion field	is similar to match_limit, but instead
		of limiting the	total number of	times that match() is called, it
		limits the depth of recursion. The recursion depth is a	smaller	number
		than the total number of calls,	because	not all	calls to match() are
		recursive. This	limit is of use	only if	it is set smaller than

		Limiting the recursion depth limits the	amount of machine stack	that
		can be used, or, when PCRE has been compiled to	use memory on the heap
		instead	of the stack, the amount of heap memory	that can be

		  The  Erlang  virtual	machine	uses a PCRE library where heap
		  memory is used when regular expression match recursion  hap-
		  pens,	 why  this  limits  the	 usage	of machine heap, not C

		  Specifying a lower value may result  in  matches  with  deep
		  recursion failing, when they should actually have matched:

		1> re:run("aaaaaaaaaaaaaz","(a+)*z").
		2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
		3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).

		  This	option,	 as well as the	match_limit option should only
		  be used in  very  rare  cases.  Understanding	 of  the  PCRE
		  library internals is recommended before tampering with these

		{offset, integer() _= 0}:
		  Start	matching at the	offset (position) given	in the subject
		  string.  The	offset	is  zero-based,	so that	the default is
		  {offset,0} (all of the subject string).

		{newline, NLSpec}:
		  Override the default definition of a newline in the  subject
		  string, which	is LF (ASCII 10) in Erlang.

		    Newline is indicated by a single character CR (ASCII 13)

		    Newline  is	indicated by a single character	LF (ASCII 10),
		    the	default

		    Newline is indicated by the	two-character CRLF  (ASCII  13
		    followed by	ASCII 10) sequence.

		    Any	of the three preceding sequences should	be recognized.

		    Any	of the	newline	 sequences  above,  plus  the  Unicode
		    sequences	VT   (vertical	tab,  U+000B),	FF  (formfeed,
		    U+000C), NEL (next	line,  U+0085),	 LS  (line  separator,
		    U+2028), and PS (paragraph separator, U+2029).

		  Specifies  specifically  that	\R is to match only the	cr, lf
		  or crlf sequences, not the Unicode specific newline  charac-
		  ters.	(overrides compilation option)

		  Specifies  specifically  that	\R is to match all the Unicode
		  newline characters (including	crlf etc, the  default).(over-
		  rides	compilation option)

		{capture, ValueSpec}/{capture, ValueSpec, Type}:
		  Specifies which captured substrings are returned and in what
		  format. By default, re:run/3 captures	all  of	 the  matching
		  part	of  the	substring as well as all capturing subpatterns
		  (all of the pattern is automatically captured). The  default
		  return type is (zero-based) indexes of the captured parts of
		  the string, given as {Offset,Length} pairs (the  index  Type
		  of capturing).

		  As an	example	of the default behavior, the following call:


		  returns, as first and	only captured string the matching part
		  of the subject ("abcd" in the	middle)	as a index pair	{3,4},
		  where	 character  positions  are zero	based, just as in off-
		  sets.	The return value of the	call above would then be:


		  Another (and quite common) case is where the regular expres-
		  sion matches all of the subject, as in:


		  where	the return value correspondingly will point out	all of
		  the string, beginning	at index 0  and	 being	10  characters


		  If  the  regular  expression contains	capturing subpatterns,
		  like in the following	case:


		  all of the matched subject is	captured, as well as the  cap-
		  tured	substrings:


		  the complete matching	pattern	always giving the first	return
		  value	in the list and	the  rest  of  the  subpatterns	 being
		  added	 in the	order they occurred in the regular expression.

		  The capture tuple is built up	as follows:

		    Specifies which captured (sub)patterns are to be returned.
		    The	 ValueSpec  can	 either	be an atom describing a	prede-
		    fined set of return	values,	or a  list  containing	either
		    the	 indexes  or  the  names  of  specific	subpatterns to

		    The	predefined sets	of subpatterns are:

		      All captured subpatterns including the complete matching
		      string. This is the default.

		      All named	subpatterns in the regular expression, as if a
		      list() of	all the	names in alphabetical order was	given.
		      The  list	 of  all  names	can also be retrieved with the
		      inspect/2	function.

		      Only the first captured subpattern, which	is always  the
		      complete	matching  part	of the subject.	All explicitly
		      captured subpatterns are discarded.

		      All but the first	matching subpattern, i.e. all  explic-
		      itly captured subpatterns, but not the complete matching
		      part of the subject string. This is useful if the	 regu-
		      lar  expression  as  a whole matches a large part	of the
		      subject, but the part you're  interested	in  is	in  an
		      explicitly  captured  subpattern.	 If the	return type is
		      list or binary, not  returning  subpatterns  you're  not
		      interested in is a good way to optimize.

		      Do  not return matching subpatterns at all, yielding the
		      single atom match	as the return value  of	 the  function
		      when   matching  successfully  instead  of  the  {match,
		      list()} return. Specifying an empty list gives the  same

		    The	value list is a	list of	indexes	for the	subpatterns to
		    return, where index	0 is for all of	the pattern, and 1  is
		    for	the first explicit capturing subpattern	in the regular
		    expression,	and so forth. When using named	captured  sub-
		    patterns  (see  below)  in the regular expression, one can
		    use	atom()s	or string()s to	specify	the subpatterns	to  be
		    returned. For example, consider the	regular	expression:


		    matched  against  the  string "ABCabcdABC",	capturing only
		    the	"abcd" part (the first explicit	subpattern):


		    The	call will yield	the following result:


		    as the first explicitly captured subpattern	 is  "(abcd)",
		    matching  "abcd"  in the subject, at (zero-based) position
		    3, of length 4.

		    Now	consider the same regular  expression,	but  with  the
		    subpattern explicitly named	'FOO':


		    With this expression, we could still give the index	of the
		    subpattern with the	following call:


		    giving the same result as before. But, since  the  subpat-
		    tern  is  named, we	can also specify its name in the value


		    which would	yield the same result as the earlier examples,


		    The	values list might specify indexes or names not present
		    in the regular expression, in which	case the return	values
		    vary  depending  on	 the  type.  If	the type is index, the
		    tuple {-1,0} is returned for values	having no  correspond-
		    ing	 subpattern  in	 the  regexp,  but for the other types
		    (binary and	list), the values are the empty	binary or list

		    Optionally	specifies  how	captured  substrings are to be
		    returned. If omitted, the default of index	is  used.  The
		    Type can be	one of the following:

		      Return captured substrings as pairs of byte indexes into
		      the subject string and length of the matching string  in
		      the subject (as if the subject string was	flattened with
		      iolist_to_binary/1   or	unicode:characters_to_binary/2
		      prior to matching). Note that the	unicode	option results
		      in byte-oriented indexes in a (possibly  virtual)	 UTF-8
		      encoded binary. A	byte index tuple {0,2} might therefore
		      represent	one or	two  characters	 when  unicode	is  in
		      effect.  This might seem counter-intuitive, but has been
		      deemed the most effective	and useful way to  way	to  do
		      it. To return lists instead might	result in simpler code
		      if that is desired. This return type is the default.

		      Return  matching	substrings  as	lists  of   characters
		      (Erlang  string()s).  It	the  unicode option is used in
		      combination with the \C sequence in the regular  expres-
		      sion,  a	captured subpattern can	contain	bytes that are
		      not valid	UTF-8 (\C matches bytes	regardless of  charac-
		      ter  encoding).  In  that	 case  the  list capturing may
		      result in	the same types of tuples that  unicode:charac-
		      ters_to_list/2  can return, namely three-tuples with the
		      tag incomplete  or  error,  the  successfully  converted
		      characters  and the invalid UTF-8	tail of	the conversion
		      as a binary. The best strategy is	to avoid using the  \C
		      sequence when capturing lists.

		      Return  matching	substrings as binaries.	If the unicode
		      option is	used, these binaries are in UTF-8. If  the  \C
		      sequence	is used	together with unicode the binaries may
		      be invalid UTF-8.

		  In general, subpatterns that were not	assigned  a  value  in
		  the  match  are  returned  as	 the tuple {-1,0} when type is
		  index. Unassigned subpatterns	 are  returned	as  the	 empty
		  binary  or  list, respectively, for other return types. Con-
		  sider	the regular expression:


		  There	are three explicitly capturing subpatterns, where  the
		  opening  parenthesis	position  determines  the order	in the
		  result, hence	((?_FOO_abdd)|a(..d)) is subpattern  index  1,
		  (?_FOO_abdd)	is  subpattern index 2 and (..d) is subpattern
		  index	3. When	matched	against	the following string:


		  the subpattern at index 2 won't  match,  as  "abdd"  is  not
		  present in the string, but the complete pattern matches (due
		  to the alternative a(..d). The  subpattern  at  index	 2  is
		  therefore unassigned and the default return value will be:


		  Setting the capture Type to binary would give	the following:


		  where	the empty binary (____)	represents the unassigned sub-
		  pattern.  In	the  binary  case,  some information about the
		  matching is therefore	lost, the ____ might just as  well  be
		  an empty string captured.

		  If  differentiation  between	empty matches and non existing
		  subpatterns is necessary, use	the type index and do the con-
		  version to the final type in Erlang code.

		  When	the  option global is given, the capture specification
		  affects each match separately, so that:


		  gives	the result:


	      The options solely affecting the compilation step	are  described
	      in the re:compile/2 function.

       replace(Subject,	RE, Replacement) -> iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 Replacement = iodata()	| unicode:charlist()

	      The same as replace(Subject,RE,Replacement,[]).

       replace(Subject,	RE, Replacement, Options) ->
		  iodata() | unicode:charlist()


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Replacement = iodata()	| unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| global
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, NLSpec}
			| bsr_anycrlf
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_unicode
			| {return, ReturnType}
			| CompileOpt
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		 NLSpec	= cr | crlf | lf | anycrlf | any

	      Replaces	the  matched  part of the Subject string with the con-
	      tents of Replacement.

	      The permissible options are the same  as	for  re:run/3,	except
	      that  the	 capture  option  is  not  allowed. Instead a {return,
	      ReturnType} is present. The default return type is iodata,  con-
	      structed	in a way to minimize copying. The iodata result	can be
	      used directly in	many  I/O-operations.  If  a  flat  list()  is
	      desired,	specify	 {return,  list} and if	a binary is preferred,
	      specify {return, binary}.

	      As in the	re:run/3 function, an mp() compiled with  the  unicode
	      option  requires the Subject to be a Unicode charlist(). If com-
	      pilation is done implicitly and the unicode  compilation	option
	      is  given	 to this function, both	the regular expression and the
	      Subject should be	given as valid Unicode charlist()s.

	      The replacement string can  contain  the	special	 character  _,
	      which  inserts  the whole	matching expression in the result, and
	      the special sequence \N (where N is an  integer  >  0),  \gN  or
	      \g{N}  resulting	in the subexpression number N will be inserted
	      in the result. If	no subexpression with that number is generated
	      by the regular expression, nothing is inserted.

	      To  insert  an  _	 or \ in the result, precede it	with a \. Note
	      that Erlang already gives	a special  meaning  to	\  in  literal
	      strings, so a single \ has to be written as "\\" and therefore a
	      double \ as "\\\\". Example:








	      As with re:run/3,	compilation errors raise the badarg exception,
	      re:compile/2  can	 be  used  to  get  more information about the

       split(Subject, RE) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata()
		 SplitList = [iodata() | unicode:charlist()]

	      The same as split(Subject,RE,[]).

       split(Subject, RE, Options) -> SplitList


		 Subject = iodata() | unicode:charlist()
		 RE = mp() | iodata() |	unicode:charlist()
		 Options = [Option]
		 Option	= anchored
			| notbol
			| noteol
			| notempty
			| notempty_atstart
			| {offset, integer() >=	0}
			| {newline, nl_spec()}
			| {match_limit,	integer() >= 0}
			| {match_limit_recursion, integer() >= 0}
			| bsr_anycrlf
			| bsr_unicode
			| {return, ReturnType}
			| {parts, NumParts}
			| group
			| trim
			| CompileOpt
		 NumParts = integer() >= 0 | infinity
		 ReturnType = iodata | list | binary
		 CompileOpt = compile_option()
		   See compile/2 above.
		 SplitList = [RetData] | [GroupedRetData]
		 GroupedRetData	= [RetData]
		 RetData = iodata() | unicode:charlist() | binary() | list()

	      This function splits the input  into  parts  by  finding	tokens
	      according	to the regular expression supplied.

	      The splitting is done basically by running a global regexp match
	      and dividing the initial string wherever	a  match  occurs.  The
	      matching part of the string is removed from the output.

	      As  in  the re:run/3 function, an	mp() compiled with the unicode
	      option requires the Subject to be	a Unicode charlist(). If  com-
	      pilation	is  done implicitly and	the unicode compilation	option
	      is given to this function, both the regular expression  and  the
	      Subject should be	given as valid Unicode charlist()s.

	      The  result  is  given  as  a  list  of "strings", the preferred
	      datatype given in	the return option (default iodata).

	      If subexpressions	are  given  in	the  regular  expression,  the
	      matching	subexpressions	are  returned in the resulting list as
	      well. An example:


	      will yield the result:




	      will yield


	      The text matching	the subexpression (marked by  the  parentheses
	      in  the  regexp)	is  inserted  in  the result list where	it was
	      found. In	effect this means that concatenating the result	 of  a
	      split  where  the	 whole regexp is a single subexpression	(as in
	      the example above) will always result in the original string.

	      As there is no matching subexpression for	the last part  in  the
	      example (the "g"), there is nothing inserted after that. To make
	      the group	of strings and the parts matching  the	subexpressions
	      more  obvious,  one  might  use  the  group option, which	groups
	      together the part	of the subject string with the parts  matching
	      the subexpressions when the string was split:




	      Here  the	regular	expression matched first the "l", causing "Er"
	      to be the	first part in the result. When the regular  expression
	      matched,	the  (only) subexpression was bound to the "l",	so the
	      "l" is inserted in the group together with "Er". The next	 match
	      is  of  the  "n",	making "a" the next part to be returned. Since
	      the subexpression	is bound to the	substring "n"  in  this	 case,
	      the  "n" is inserted into	this group. The	last group consists of
	      the rest of the string, as no more matches are found.

	      By default,  all	parts  of  the	string,	 including  the	 empty
	      strings, are returned from the function. For example:


	      will return:


	      since the	matching of the	"g" in the end of the string leaves an
	      empty rest which is also returned. This behaviour	 differs  from
	      the default behaviour of the split function in Perl, where empty
	      strings at the end are by	default	removed. To get	the "trimming"
	      default behavior of Perl,	specify	trim as	an option:


	      The result will be:


	      The "trim" option	in effect says;	"give me as many parts as pos-
	      sible except the empty ones", which might	be useful in some cir-
	      cumstances.  You	can  also  specify how many parts you want, by
	      specifying {parts,N}:


	      This will	give:


	      Note that	the last part is "ang",	not "an", as we	only specified
	      splitting	 into  two  parts, and the splitting stops when	enough
	      parts are	given, which is	why the	result differs	from  that  of

	      More than	three parts are	not possible with this indata, so


	      will  give the same result as the	default, which is to be	viewed
	      as "an infinite number of	parts".

	      Specifying 0 as the number of parts gives	the same effect	as the
	      option trim. If subexpressions are captured, empty subexpression
	      matches at the end are also stripped from	the result if trim  or
	      {parts,0}	is specified.

	      If  you  are  familiar with Perl,	the trim behaviour corresponds
	      exactly to the Perl default, the {parts,N} where N is a positive
	      integer  corresponds  exactly to the Perl	behaviour with a posi-
	      tive numerical third parameter  and  the	default	 behaviour  of
	      re:split/3  corresponds to that when the Perl routine is given a
	      negative integer as the third parameter.

	      Summary of options not previously	 described  for	 the  re:run/3

		  Specifies how	the parts of the original string are presented
		  in the result	list. The possible types are:

		    The	variant	of iodata() that gives the  least  copying  of
		    data  with the current implementation (often a binary, but
		    don't depend on it).

		    All	parts returned as binaries.

		    All	parts returned as lists	of characters ("strings").

		  Groups together the part of the string with the parts	of the
		  string matching the subexpressions of	the regexp.

		  The  return  value  from the function	will in	this case be a
		  list() of list()s.  Each  sublist  begins  with  the	string
		  picked  out  of  the	subject	 string, followed by the parts
		  matching each	of the subexpressions in order	of  occurrence
		  in the regular expression.

		  Specifies  the  number  of parts the subject string is to be
		  split	into.

		  The number of	parts should be	a positive integer for a  spe-
		  cific	 maximum  on  the number of parts and infinity for the
		  maximum number of parts possible (the	 default).  Specifying
		  {parts,0} gives as many parts	as possible disregarding empty
		  parts	at the end, the	same as	specifying trim

		  Specifies that empty parts at	the end	of the result list are
		  to  be  disregarded.	The same as specifying {parts,0}. This
		  corresponds to the default behaviour of the split  built  in
		  function in Perl.

       The  following  sections	 contain  reference  material  for the regular
       expressions used	by this	module.	The regular  expression	 reference  is
       based  on  the  PCRE  documentation, with changes in cases where	the re
       module behaves differently to the PCRE library.

       The syntax and semantics	of the regular expressions that	are  supported
       by  PCRE	 are described in detail below.	Perl's regular expressions are
       described in its	own documentation, and regular expressions in  general
       are  covered in a number	of books, some of which	have copious examples.
       Jeffrey	Friedl's  "Mastering  Regular	Expressions",	published   by
       O'Reilly,  covers regular expressions in	great detail. This description
       of PCRE's regular expressions is	intended as reference material.

       The reference material is divided into the following sections:

	 * Special start-of-pattern items

	 * Characters and metacharacters

	 * Backslash

	 * Circumflex and dollar

	 * Full	stop (period, dot) and \N

	 * Matching a single data unit

	 * Square brackets and character classes

	 * POSIX character classes

	 * Vertical bar

	 * Internal option setting

	 * Subpatterns

	 * Duplicate subpattern	numbers

	 * Named subpatterns

	 * Repetition

	 * Atomic grouping and possessive quantifiers

	 * Back	references

	 * Assertions

	 * Conditional subpatterns

	 * Comments

	 * Recursive patterns

	 * Subpatterns as subroutines

	 * Oniguruma subroutine	syntax

	 * Backtracking	control

       A number	of options that	can be passed to re:compile/2 can also be  set
       by special items	at the start of	a pattern. These are not Perl-compati-
       ble, but	are provided to	make these options accessible to pattern writ-
       ers  who	are not	able to	change the program that	processes the pattern.
       Any number of these items may appear, but they  must  all  be  together
       right  at  the  start of	the pattern string, and	the letters must be in
       upper case.

       UTF support

       Unicode support is basically UTF-8 based. To  use  Unicode  characters,
       you  either  call re:compile/2/re:run/3 with the	unicode	option,	or the
       pattern must start with one of these special sequences:



       Both options give the same effect, the input string is  interpreted  as
       UTF-8.  Note  that with these instructions, the automatic conversion of
       lists to	UTF-8 is not performed by the re functions,  why  using	 these
       options is not recommended. Add the unicode option when running re:com-
       pile/2 instead.

       Some applications that allow their users	to supply patterns may wish to
       restrict	 them  to  non-UTF data	for security reasons. If the never_utf
       option is set at	compile	time, (*UTF) etc. are not allowed,  and	 their
       appearance causes an error.

       Unicode property	support

       Another special sequence	that may appear	at the start of	a pattern is


       This has	the same effect	as setting the ucp option: it causes sequences
       such as \d and \w to use	 Unicode  properties  to  determine  character
       types,  instead of recognizing only characters with codes less than 256
       via a lookup table.

       Disabling start-up optimizations

       If a pattern starts with	(*NO_START_OPT), it has	 the  same  effect  as
       setting the no_Start_optimize option at compile time.

       Newline conventions

       PCRE  supports five different conventions for indicating	line breaks in
       strings:	a single CR (carriage return) character, a  single  LF	(line-
       feed)  character,  the  two-character  sequence CRLF , any of the three
       preceding, or any Unicode newline sequence.

       It is also possible to specify a	newline	convention by starting a  pat-
       tern string with	one of the following five sequences:

	   carriage return


	   carriage return, followed by	linefeed

	   any of the three above

	   all Unicode newline sequences

       These  override	the default and	the options given to re:compile/2. For
       example,	the pattern:


       changes the convention to CR. That pattern matches "a\nb" because LF is
       no  longer a newline. If	more than one of them is present, the last one
       is used.

       The newline convention affects where the	circumflex and	dollar	asser-
       tions are true. It also affects the interpretation of the dot metachar-
       acter when dotall is not	set, and the behaviour of \N. However, it does
       not affect what the \R escape sequence matches. By default, this	is any
       Unicode newline sequence, for Perl compatibility. However, this can  be
       changed;	 see  the  description	of \R in the section entitled "Newline
       sequences" below. A change of \R	setting	can be combined	with a	change
       of newline convention.

       Setting match and recursion limits

       The  caller  of	re:run/3  can  set  a limit on the number of times the
       internal	match()	function is called and on the maximum depth of	recur-
       sive calls. These facilities are	provided to catch runaway matches that
       are provoked by patterns	with huge matching trees (a typical example is
       a  pattern  with	 nested	unlimited repeats) and to avoid	running	out of
       system stack by too  much  recursion.  When  one	 of  these  limits  is
       reached,	 pcre_exec() gives an error return. The	limits can also	be set
       by items	at the start of	the pattern of the form



       where d is any number of	decimal	digits.	However, the value of the set-
       ting  must  be less than	the value set by the caller of re:run/3	for it
       to have any effect. In other words, the pattern writer  can  lower  the
       limit  set  by  the programmer, but not raise it. If there is more than
       one setting of one of these limits, the lower value is used.

       The current default value for both  the	limits	are  10000000  in  the
       Erlang  VM.  Note that the recursion limit does not actually affect the
       stack depth of the VM, as PCRE for Erlang is compiled  in  such	a  way
       that the	match function never does recursion on the "C-stack".

       A  regular  expression  is  a pattern that is matched against a subject
       string from left	to right. Most characters stand	for  themselves	 in  a
       pattern,	 and  match  the corresponding characters in the subject. As a
       trivial example,	the pattern

       The quick brown fox

       matches a portion of a subject string that is identical to itself. When
       caseless	 matching  is  specified  (the	caseless  option), letters are
       matched independently of	case.

       The power of regular expressions	comes  from  the  ability  to  include
       alternatives  and  repetitions in the pattern. These are	encoded	in the
       pattern by the use of metacharacters, which do not stand	for themselves
       but instead are interpreted in some special way.

       There  are  two different sets of metacharacters: those that are	recog-
       nized anywhere in the pattern except within square brackets, and	 those
       that  are  recognized  within square brackets. Outside square brackets,
       the metacharacters are as follows:

	   general escape character with several uses

	   assert start	of string (or line, in multiline mode)

	   assert end of string	(or line, in multiline mode)

	   match any character except newline (by default)

	   start character class definition

	   start of alternative	branch

	   start subpattern

	   end subpattern

	   extends the meaning of (, also 0 or 1 quantifier,  also  quantifier

	   0 or	more quantifier

	   1 or	more quantifier, also "possessive quantifier"

	   start min/max quantifier

       Part  of	 a  pattern  that is in	square brackets	is called a "character
       class". In a character class the	only metacharacters are:

	   general escape character

	   negate the class, but only if the first character

	   indicates character range

	   POSIX character class (only if followed by POSIX syntax)

	   terminates the character class

       The following sections describe the use of each of the  metacharacters.

       The backslash character has several uses. Firstly, if it	is followed by
       a character that	is not a number	or a letter, it	takes away any special
       meaning	that  character	 may  have. This use of	backslash as an	escape
       character applies both inside and outside character classes.

       For example, if you want	to match a * character,	you write  \*  in  the
       pattern.	 This  escaping	 action	 applies  whether or not the following
       character would otherwise be interpreted	as a metacharacter, so	it  is
       always  safe  to	 precede  a non-alphanumeric with backslash to specify
       that it stands for itself. In particular, if you	want to	match a	 back-
       slash, you write	\\.

       In  unicode mode, only ASCII numbers and	letters	have any special mean-
       ing after a backslash. All other	characters (in particular, those whose
       codepoints are greater than 127)	are treated as literals.

       If  a  pattern is compiled with the extended option, white space	in the
       pattern (other than in a	character class) and characters	 between  a  #
       outside a character class and the next newline are ignored. An escaping
       backslash can be	used to	include	a white	space or # character  as  part
       of the pattern.

       If  you	want  to remove	the special meaning from a sequence of charac-
       ters, you can do	so by putting them between \Q and \E. This is  differ-
       ent  from  Perl	in  that  $  and  @ are	handled	as literals in \Q...\E
       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
       tion. Note the following	examples:

	 Pattern	   PCRE	matches	  Perl matches

	 \Qabc$xyz\E	   abc$xyz	  abc followed by the contents of $xyz
	 \Qabc\$xyz\E	   abc\$xyz	  abc\$xyz
	 \Qabc\E\$\Qxyz\E  abc$xyz	  abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.	An isolated \E that is not preceded by \Q is ignored. If \Q is
       not  followed  by  \E  later in the pattern, the	literal	interpretation
       continues to the	end of the pattern (that is,  \E  is  assumed  at  the
       end).  If  the  isolated	\Q is inside a character class,	this causes an
       error, because the character class is not terminated.

       Non-printing characters

       A second	use of backslash provides a way	of encoding non-printing char-
       acters  in patterns in a	visible	manner.	There is no restriction	on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates  a  pattern,	but  when  a pattern is	being prepared by text
       editing,	it is  often  easier  to  use  one  of	the  following	escape
       sequences than the binary character it represents:

	   alarm, that is, the BEL character (hex 07)

	   "control-x",	where x	is any ASCII character

	 \e :
	   escape (hex 1B)

	   form	feed (hex 0C)

	   linefeed (hex 0A)

	   carriage return (hex	0D)

	 \t :
	   tab (hex 09)

	   character with octal	code ddd, or back reference

	 \xhh :
	   character with hex code hh

	   character with hex code hhh..

       The  precise effect of \cx on ASCII characters is as follows: if	x is a
       lower case letter, it is	converted to upper case. Then  bit  6  of  the
       character (hex 40) is inverted. Thus \cA	to \cZ become hex 01 to	hex 1A
       (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is	7B), and  \c;  becomes
       hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
       has a value greater than	127, a compile-time error occurs.  This	 locks
       out non-ASCII characters	in all modes.

       The  \c	facility  was designed for use with ASCII characters, but with
       the extension to	Unicode	it is even less	useful than it once was.

       By default, after \x, from zero to  two	hexadecimal  digits  are  read
       (letters	can be in upper	or lower case).	Any number of hexadecimal dig-
       its may appear between \x{ and }, but the character code	is constrained
       as follows:

	 8-bit non-Unicode mode:
	   less	than 0x100

	 8-bit UTF-8 mode:
	   less	than 0x10ffff and a valid codepoint

       Invalid	Unicode	 codepoints  are  the  range 0xd800 to 0xdfff (the so-
       called "surrogate" codepoints), and 0xffef.

       If characters other than	hexadecimal digits appear between \x{  and  },
       or if there is no terminating },	this form of escape is not recognized.
       Instead,	the initial \x will be	interpreted  as	 a  basic  hexadecimal
       escape,	with  no  following  digits, giving a character	whose value is

       Characters whose	value is less than 256 can be defined by either	of the
       two  syntaxes  for  \x. There is	no difference in the way they are han-
       dled. For example, \xdc is exactly the same as \x{dc}.

       After \0	up to two further octal	digits are read. If  there  are	 fewer
       than  two  digits,  just	 those	that  are  present  are	used. Thus the
       sequence	\0\x\07	specifies two binary zeros followed by a BEL character
       (code  value 7).	Make sure you supply two digits	after the initial zero
       if the pattern character	that follows is	itself an octal	digit.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated.  Outside a character class, PCRE reads it	and any	following dig-
       its as a	decimal	number.	If the number is less than  10,	 or  if	 there
       have been at least that many previous capturing left parentheses	in the
       expression, the entire  sequence	 is  taken  as	a  back	 reference.  A
       description  of how this	works is given later, following	the discussion
       of parenthesized	subpatterns.

       Inside a	character class, or if the decimal number is  greater  than  9
       and  there have not been	that many capturing subpatterns, PCRE re-reads
       up to three octal digits	following the backslash, and uses them to gen-
       erate a data character. Any subsequent digits stand for themselves. The
       value of	the character is constrained in	the  same  way	as  characters
       specified in hexadecimal. For example:

	   is another way of writing a ASCII space

	   is  the  same,  provided there are fewer than 40 previous capturing

	   is always a back reference

	    might be a back reference, or another way of writing a tab

	   is always a tab

	   is a	tab followed by	the character "3"

	   might be a back reference, otherwise	the character with octal  code

	   might be a back reference, otherwise	the value 255 (decimal)

	   is  either  a  back reference, or a binary zero followed by the two
	   characters "8" and "1"

       Note that octal values of 100 or	greater	must not be  introduced	 by  a
       leading zero, because no	more than three	octal digits are ever read.

       All the sequences that define a single character	value can be used both
       inside and outside character classes. In	addition, inside  a  character
       class, \b is interpreted	as the backspace character (hex	08).

       \N  is not allowed in a character class.	\B, \R,	and \X are not special
       inside a	character class. Like  other  unrecognized  escape  sequences,
       they are	treated	as the literal characters "B", "R", and	"X". Outside a
       character class,	these sequences	have different meanings.

       Unsupported escape sequences

       In Perl,	the sequences \l, \L, \u, and \U are recognized	by its	string
       handler	and used to modify the case of following characters. PCRE does
       not support these escape	sequences.

       Absolute	and relative back references

       The sequence \g followed	by an unsigned or a negative  number,  option-
       ally  enclosed  in braces, is an	absolute or relative back reference. A
       named back reference can	be coded as \g{name}. Back references are dis-
       cussed later, following the discussion of parenthesized subpatterns.

       Absolute	and relative subroutine	calls

       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an  alternative	syntax for referencing a subpattern as a "subroutine".
       Details are discussed  later.  Note  that  \g{...}  (Perl  syntax)  and
       \g<...>	(Oniguruma  syntax)  are  not synonymous. The former is	a back
       reference; the latter is	a subroutine call.

       Generic character types

       Another use of backslash	is for specifying generic character types:

	   any decimal digit

	   any character that is not a decimal digit

	   any horizontal white	space character

	   any character that is not a horizontal white	space character

	   any white space character

	   any character that is not a white space character

	   any vertical	white space character

	   any character that is not a vertical	white space character

	   any "word" character

	   any "non-word" character

       There is	also the single	sequence \N, which matches a non-newline char-
       acter.  This  is	 the  same as the "." metacharacter when dotall	is not
       set. Perl also uses \N to match characters by name; PCRE	does not  sup-
       port this.

       Each  pair of lower and upper case escape sequences partitions the com-
       plete set of characters into two	disjoint  sets.	 Any  given  character
       matches	one, and only one, of each pair. The sequences can appear both
       inside and outside character classes. They each match one character  of
       the  appropriate	 type.	If the current matching	point is at the	end of
       the subject string, all of them fail, because there is no character  to

       For  compatibility  with	Perl, \s does not match	the VT character (code
       11). This makes it different from the POSIX "space" class. The \s char-
       acters  are  HT (9), LF (10), FF	(12), CR (13), and space (32). If "use
       locale;"	is included in a Perl script, \s may match the	VT  character.
       In PCRE,	it never does.

       A  "word"  character is an underscore or	any character that is a	letter
       or digit. By default, the definition of	letters	 and  digits  is  con-
       trolled	by  PCRE's  low-valued character tables, in Erlang's case (and
       without the unicode option), the	ISO-Latin-1 character set.

       By default, in unicode mode, characters with values greater  than  255,
       i.e.  all characters outside the	ISO-Latin-1 character set, never match
       \d, \s, or \w, and always match \D, \S, and \W. These sequences	retain
       their  original	meanings from before UTF support was available,	mainly
       for efficiency reasons. However,	if the ucp option is set,  the	behav-
       iour  is	changed	so that	Unicode	properties are used to determine char-
       acter types, as follows:

	   any character that \p{Nd} matches (decimal digit)

	   any character that \p{Z} matches, plus HT, LF, FF, CR)

	   any character that \p{L} or \p{N} matches, plus underscore)

       The upper case escapes match the	inverse	sets of	characters. Note  that
       \d  matches  only decimal digits, whereas \w matches any	Unicode	digit,
       as well as any Unicode letter,  and  underscore.	 Note  also  that  ucp
       affects	\b,  and  \B  because  they are	defined	in terms of \w and \W.
       Matching	these sequences	is noticeably slower when ucp is set.

       The sequences \h, \H, \v, and \V	are features that were added  to  Perl
       at  release  5.10. In contrast to the other sequences, which match only
       ASCII characters	by default, these  always  match  certain  high-valued
       codepoints,  whether or not ucp is set. The horizontal space characters

	   Horizontal tab (HT)


	   Non-break space

	   Ogham space mark

	   Mongolian vowel separator

	   En quad

	   Em quad

	   En space

	   Em space

	   Three-per-em	space

	   Four-per-em space

	   Six-per-em space

	   Figure space

	   Punctuation space

	   Thin	space

	   Hair	space

	   Narrow no-break space

	   Medium mathematical space

	   Ideographic space

       The vertical space characters are:

	   Linefeed (LF)

	   Vertical tab	(VT)

	   Form	feed (FF)

	   Carriage return (CR)

	   Next	line (NEL)

	   Line	separator

	   Paragraph separator

       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
       256 are relevant.

       Newline sequences

       Outside	a  character class, by default,	the escape sequence \R matches
       any Unicode newline sequence. In	non-UTF-8 mode \R is equivalent	to the


       This  is	 an  example  of an "atomic group", details of which are given

       This particular group matches either the	two-character sequence CR fol-
       lowed  by LF, or	one of the single characters LF	(linefeed, U+000A), VT
       (vertical tab, U+000B), FF (form	feed, U+000C),	CR  (carriage  return,
       U+000D),	 or  NEL  (next	 line,	U+0085). The two-character sequence is
       treated as a single unit	that cannot be split.

       In Unicode mode,	two additional characters whose	codepoints are greater
       than 255	are added: LS (line separator, U+2028) and PS (paragraph sepa-
       rator, U+2029). Unicode character property support is  not  needed  for
       these characters	to be recognized.

       It is possible to restrict \R to	match only CR, LF, or CRLF (instead of
       the complete set	 of  Unicode  line  endings)  by  setting  the	option
       bsr_anycrlf either at compile time or when the pattern is matched. (BSR
       is an abbreviation for "backslash R".) This can	be  made  the  default
       when  PCRE  is  built;  if this is the case, the	other behaviour	can be
       requested via the bsr_unicode option. It	is also	 possible  to  specify
       these  settings	by starting a pattern string with one of the following

       (*BSR_ANYCRLF) CR, LF, or CRLF only (*BSR_UNICODE) any Unicode  newline

       These override the default and the options given	to the compiling func-
       tion, but they can themselves be	 overridden  by	 options  given	 to  a
       matching	 function.  Note  that	these  special settings, which are not
       Perl-compatible,	are recognized only at the very	start  of  a  pattern,
       and  that  they	must  be  in  upper  case. If more than	one of them is
       present,	the last one is	used. They can be combined with	 a  change  of
       newline convention; for example,	a pattern can start with:


       They  can  also	be combined with the (*UTF8), (*UTF) or	(*UCP) special
       sequences. Inside a character class, \R is treated as  an  unrecognized
       escape sequence,	and so matches the letter "R" by default.

       Unicode character properties

       Three  additional  escape sequences that	match characters with specific
       properties are available. When in 8-bit non-UTF-8 mode, these sequences
       are  of	course limited to testing characters whose codepoints are less
       than 256, but they do work in this mode.	 The  extra  escape  sequences

	   a character with the	xx property

	   a character without the xx property

	   a Unicode extended grapheme cluster

       The  property  names represented	by xx above are	limited	to the Unicode
       script names, the general category properties, "Any", which matches any
       character   (including  newline),  and  some  special  PCRE  properties
       (described in the next section).	Other Perl properties such as "InMusi-
       calSymbols" are not currently supported by PCRE.	Note that \P{Any} does
       not match any characters, so always causes a match failure.

       Sets of Unicode characters are defined as belonging to certain scripts.
       A  character from one of	these sets can be matched using	a script name.
       For example:

       \p{Greek} \P{Han}

       Those that are not part of an identified	script are lumped together  as
       "Common". The current list of scripts is:

	 * Arabic

	 * Armenian

	 * Avestan

	 * Balinese

	 * Bamum

	 * Batak

	 * Bengali

	 * Bopomofo

	 * Braille

	 * Buginese

	 * Buhid

	 * Canadian_Aboriginal

	 * Carian

	 * Chakma

	 * Cham

	 * Cherokee

	 * Common

	 * Coptic

	 * Cuneiform

	 * Cypriot

	 * Cyrillic

	 * Deseret

	 * Devanagari

	 * Egyptian_Hieroglyphs

	 * Ethiopic

	 * Georgian

	 * Glagolitic

	 * Gothic

	 * Greek

	 * Gujarati

	 * Gurmukhi

	 * Han

	 * Hangul

	 * Hanunoo

	 * Hebrew

	 * Hiragana

	 * Imperial_Aramaic

	 * Inherited

	 * Inscriptional_Pahlavi

	 * Inscriptional_Parthian

	 * Javanese

	 * Kaithi

	 * Kannada

	 * Katakana

	 * Kayah_Li

	 * Kharoshthi

	 * Khmer

	 * Lao

	 * Latin

	 * Lepcha

	 * Limbu

	 * Linear_B

	 * Lisu

	 * Lycian

	 * Lydian

	 * Malayalam

	 * Mandaic

	 * Meetei_Mayek

	 * Meroitic_Cursive

	 * Meroitic_Hieroglyphs

	 * Miao

	 * Mongolian

	 * Myanmar

	 * New_Tai_Lue

	 * Nko

	 * Ogham

	 * Old_Italic

	 * Old_Persian

	 * Oriya

	 * Old_South_Arabian

	 * Old_Turkic

	 * Ol_Chiki

	 * Osmanya

	 * Phags_Pa

	 * Phoenician

	 * Rejang

	 * Runic

	 * Samaritan

	 * Saurashtra

	 * Sharada

	 * Shavian

	 * Sinhala

	 * Sora_Sompeng

	 * Sundanese

	 * Syloti_Nagri

	 * Syriac

	 * Tagalog

	 * Tagbanwa

	 * Tai_Le

	 * Tai_Tham

	 * Tai_Viet

	 * Takri

	 * Tamil

	 * Telugu

	 * Thaana

	 * Thai

	 * Tibetan

	 * Tifinagh

	 * Ugaritic

	 * Vai

	 * Yi

       Each character has exactly one Unicode general category property, spec-
       ified by	a two-letter abbreviation. For compatibility with Perl,	 nega-
       tion  can  be  specified	 by including a	circumflex between the opening
       brace and the property name.  For  example,  \p{^Lu}  is	 the  same  as

       If only one letter is specified with \p or \P, it includes all the gen-
       eral category properties	that start with	that letter. In	this case,  in
       the  absence of negation, the curly brackets in the escape sequence are
       optional; these two examples have the same effect:

	 * \p{L}

	 * \pL

       The following general category property codes are supported:





	   Private use



	   Lower case letter

	   Modifier letter

	   Other letter

	   Title case letter

	   Upper case letter


	   Spacing mark

	   Enclosing mark

	   Non-spacing mark


	   Decimal number

	   Letter number

	   Other number


	   Connector punctuation

	   Dash	punctuation

	   Close punctuation

	   Final punctuation

	   Initial punctuation

	   Other punctuation

	   Open	punctuation


	   Currency symbol

	   Modifier symbol

	   Mathematical	symbol

	   Other symbol


	   Line	separator

	   Paragraph separator

	   Space separator

       The special property L& is also supported: it matches a character  that
       has  the	 Lu,  Ll, or Lt	property, in other words, a letter that	is not
       classified as a modifier	or "other".

       The Cs (Surrogate) property applies only	to  characters	in  the	 range
       U+D800  to U+DFFF. Such characters are not valid	in Unicode strings and
       so cannot be tested by PCRE. Perl does not support the Cs property

       The long	synonyms for  property	names  that  Perl  supports  (such  as
       \p{Letter})  are	 not  supported	by PCRE, nor is	it permitted to	prefix
       any of these properties with "Is".

       No character that is in the Unicode table has the Cn (unassigned) prop-
       erty.  Instead, this property is	assumed	for any	code point that	is not
       in the Unicode table.

       Specifying caseless matching does not affect  these  escape  sequences.
       For  example,  \p{Lu}  always  matches only upper case letters. This is
       different from the behaviour of current versions	of Perl.

       Matching	characters by Unicode property is not fast, because  PCRE  has
       to  do  a  multistage table lookup in order to find a character's prop-
       erty. That is why the traditional escape	sequences such as \d and \w do
       not use Unicode properties in PCRE by default, though you can make them
       do so by	setting	the ucp	option or by starting the pattern with (*UCP).

       Extended	grapheme clusters

       The  \X	escape	matches	 any number of Unicode characters that form an
       "extended grapheme cluster", and	treats the sequence as an atomic group
       (see below). Up to and including	release	8.31, PCRE matched an earlier,
       simpler definition that was equivalent to


       That is,	it matched a character without the "mark"  property,  followed
       by  zero	 or  more characters with the "mark" property. Characters with
       the "mark" property are typically non-spacing accents that  affect  the
       preceding character.

       This  simple definition was extended in Unicode to include more compli-
       cated kinds of composite	character by giving each character a  grapheme
       breaking	 property,  and	 creating  rules  that use these properties to
       define the boundaries of	extended grapheme  clusters.  In  releases  of
       PCRE later than 8.31, \X	matches	one of these clusters.

       \X  always  matches  at least one character. Then it decides whether to
       add additional characters according to the following rules for ending a

	   End at the end of the subject string.

	   Do not end between CR and LF; otherwise end after any control char-

	   Do not break	Hangul (a Korean script)  syllable  sequences.	Hangul
	   characters  are of five types: L, V,	T, LV, and LVT.	An L character
	   may be followed by an L, V, LV, or LVT character; an	LV or V	 char-
	   acter  may be followed by a V or T character; an LVT	or T character
	   may be follwed only by a T character.

	   Do not end before extending characters or spacing marks. Characters
	   with	the "mark" property always have	the "extend" grapheme breaking

	   Do not end after prepend characters.

	   Otherwise, end the cluster.

       PCRE's additional properties

       As well as the standard Unicode properties described above,  PCRE  sup-
       ports  four  more  that	make it	possible to convert traditional	escape
       sequences such as \w and	\s and POSIX character classes to use  Unicode
       properties.  PCRE  uses	these non-standard, non-Perl properties	inter-
       nally when PCRE_UCP is set. However, they may also be used  explicitly.
       These properties	are:

	   Any alphanumeric character

	   Any POSIX space character

	   Any Perl space character

	   Any Perl "word" character

       Xan  matches  characters	that have either the L (letter)	or the N (num-
       ber) property. Xps matches the characters tab, linefeed,	vertical  tab,
       form  feed,  or carriage	return,	and any	other character	that has the Z
       (separator) property. Xsp is the	same as	Xps, except that vertical  tab
       is excluded. Xwd	matches	the same characters as Xan, plus underscore.

       There  is another non-standard property,	Xuc, which matches any charac-
       ter that	can be represented by a	Universal Character Name  in  C++  and
       other  programming  languages.  These are the characters	$, @, `	(grave
       accent),	and all	characters with	Unicode	code points  greater  than  or
       equal  to U+00A0, except	for the	surrogates U+D800 to U+DFFF. Note that
       most base (ASCII) characters are	excluded. (Universal  Character	 Names
       are  of	the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal	digit.
       Note that the Xuc property does not match these sequences but the char-
       acters that they	represent.)

       Resetting the match start

       The  escape sequence \K causes any previously matched characters	not to
       be included in the final	matched	sequence. For example, the pattern:


       matches "foobar", but reports that it has matched "bar".	 This  feature
       is  similar  to	a  lookbehind assertion	(described below). However, in
       this case, the part of the subject before the real match	does not  have
       to  be of fixed length, as lookbehind assertions	do. The	use of \K does
       not interfere with the setting of  captured  substrings.	 For  example,
       when the	pattern


       matches "foobar", the first substring is	still set to "foo".

       Perl  documents	that  the  use	of  \K	within assertions is "not well
       defined". In PCRE, \K is	acted upon  when  it  occurs  inside  positive
       assertions, but is ignored in negative assertions.

       Simple assertions

       The  final use of backslash is for certain simple assertions. An	asser-
       tion specifies a	condition that has to be met at	a particular point  in
       a  match, without consuming any characters from the subject string. The
       use of subpatterns for more complicated assertions is described	below.
       The backslashed assertions are:

	   matches at a	word boundary

	   matches when	not at a word boundary

	   matches at the start	of the subject

	   matches  at the end of the subject also matches before a newline at
	   the end of the subject

	   matches only	at the end of the subject

	   matches at the first	matching position in the subject

       Inside a	character class, \b has	a different meaning;  it  matches  the
       backspace  character.  If  any  other  of these assertions appears in a
       character class,	by default it matches the corresponding	literal	 char-
       acter (for example, \B matches the letter B).

       A  word	boundary is a position in the subject string where the current
       character and the previous character do not both	match \w or  \W	 (i.e.
       one  matches  \w	 and the other matches \W), or the start or end	of the
       string if the first or last character matches \w,  respectively.	 In  a
       UTF  mode,  the meanings	of \w and \W can be changed by setting the ucp
       option. When this is done, it also affects \b and \B. Neither PCRE  nor
       Perl has	a separate "start of word" or "end of word" metasequence. How-
       ever, whatever follows \b normally determines which it is. For example,
       the fragment \ba	matches	"a" at the start of a word.

       The  \A,	 \Z,  and \z assertions	differ from the	traditional circumflex
       and dollar (described in	the next section) in that they only ever match
       at  the	very start and end of the subject string, whatever options are
       set. Thus, they are independent of multiline mode. These	 three	asser-
       tions  are  not	affected by the	notbol or noteol options, which	affect
       only the	behaviour of the circumflex and	 dollar	 metacharacters.  How-
       ever,  if  the startoffset argument of re:run/3 is non-zero, indicating
       that matching is	to start at a point other than the  beginning  of  the
       subject,	 \A  can never match. The difference between \Z	and \z is that
       \Z matches before a newline at the end of the string as well as at  the
       very end, whereas \z matches only at the	end.

       The  \G assertion is true only when the current matching	position is at
       the start point of the match, as	specified by the startoffset  argument
       of  re:run/3.  It differs from \A when the value	of startoffset is non-
       zero. By	calling	re:run/3 multiple times	 with  appropriate  arguments,
       you  can	 mimic Perl's /g option, and it	is in this kind	of implementa-
       tion where \G can be useful.

       Note, however, that PCRE's interpretation of \G,	as the	start  of  the
       current match, is subtly	different from Perl's, which defines it	as the
       end of the previous match. In Perl, these can  be  different  when  the
       previously  matched  string was empty. Because PCRE does	just one match
       at a time, it cannot reproduce this behaviour.

       If all the alternatives of a pattern begin with \G, the	expression  is
       anchored	to the starting	match position,	and the	"anchored" flag	is set
       in the compiled regular expression.

       The circumflex and dollar  metacharacters  are  zero-width  assertions.
       That  is,  they test for	a particular condition being true without con-
       suming any characters from the subject string.

       Outside a character class, in the default matching mode,	the circumflex
       character  is  an  assertion  that is true only if the current matching
       point is	at the start of	the subject string. If the  startoffset	 argu-
       ment  of	re:run/3 is non-zero, circumflex can never match if the	multi-
       line option is unset. Inside  a	character  class,  circumflex  has  an
       entirely	different meaning (see below).

       Circumflex  need	 not be	the first character of the pattern if a	number
       of alternatives are involved, but it should be the first	thing in  each
       alternative  in	which  it appears if the pattern is ever to match that
       branch. If all possible alternatives start with a circumflex, that  is,
       if  the	pattern	 is constrained	to match only at the start of the sub-
       ject, it	is said	to be an "anchored" pattern.  (There  are  also	 other
       constructs that can cause a pattern to be anchored.)

       The  dollar  character is an assertion that is true only	if the current
       matching	point is at the	end of	the  subject  string,  or  immediately
       before  a newline at the	end of the string (by default).	Note, however,
       that it does not	actually match the newline. Dollar  need  not  be  the
       last character of the pattern if	a number of alternatives are involved,
       but it should be	the last item in any branch in which it	appears.  Dol-
       lar has no special meaning in a character class.

       The  meaning  of	 dollar	 can be	changed	so that	it matches only	at the
       very end	of the string, by setting the dollar_endonly option at compile
       time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are	changed	if the
       multiline option	is set.	When this is the case,	a  circumflex  matches
       immediately after internal newlines as well as at the start of the sub-
       ject string. It does not	match after a newline that ends	the string.  A
       dollar  matches	before	any  newlines in the string, as	well as	at the
       very end, when multiline	is set.	When newline is	specified as the  two-
       character  sequence CRLF, isolated CR and LF characters do not indicate

       For example, the	pattern	/^abc$/	matches	the subject string  "def\nabc"
       (where  \n  represents a	newline) in multiline mode, but	not otherwise.
       Consequently, patterns that are anchored	in single  line	 mode  because
       all  branches  start  with  ^ are not anchored in multiline mode, and a
       match for circumflex is	possible  when	the  startoffset  argument  of
       re:run/3	is non-zero. The dollar_endonly	option is ignored if multiline
       is set.

       Note that the sequences \A, \Z, and \z can be used to match  the	 start
       and  end	of the subject in both modes, and if all branches of a pattern
       start with \A it	is always anchored, whether or not multiline is	set.

       Outside a character class, a dot	in the pattern matches any one charac-
       ter  in	the subject string except (by default) a character that	signi-
       fies the	end of a line.

       When a line ending is defined as	a single character, dot	never  matches
       that  character;	when the two-character sequence	CRLF is	used, dot does
       not match CR if it is immediately followed  by  LF,  but	 otherwise  it
       matches	all characters (including isolated CRs and LFs). When any Uni-
       code line endings are being recognized, dot does	not match CR or	LF  or
       any of the other	line ending characters.

       The  behaviour  of  dot	with regard to newlines	can be changed.	If the
       dotall option is	set, a dot matches any one character,  without	excep-
       tion.  If  the  two-character  sequence	CRLF is	present	in the subject
       string, it takes	two dots to match it.

       The handling of dot is entirely independent of the handling of  circum-
       flex  and  dollar,  the	only relationship being	that they both involve
       newlines. Dot has no special meaning in a character class.

       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
       affected	 by  the  PCRE_DOTALL  option.	In other words,	it matches any
       character except	one that signifies the end of a	line. Perl  also  uses
       \N to match characters by name; PCRE does not support this.

       Outside	a character class, the escape sequence \C matches any one data
       unit, whether or	not a UTF mode is set. One  data  unit	is  one	 byte.
       Unlike  a dot, \C always	matches	line-ending characters.	The feature is
       provided	in Perl	in order to match individual bytes in UTF-8 mode,  but
       it is unclear how it can	usefully be used. Because \C breaks up charac-
       ters into individual data units,	matching one unit with	\C  in	a  UTF
       mode  means  that the rest of the string	may start with a malformed UTF
       character. This has undefined results, because PCRE assumes that	it  is
       dealing with valid UTF strings.

       PCRE  does  not	allow \C to appear in lookbehind assertions (described
       below) in a UTF mode, because this would	make it	impossible  to	calcu-
       late the	length of the lookbehind.

       In general, the \C escape sequence is best avoided. However, one	way of
       using it	that avoids the	problem	of malformed UTF characters is to  use
       a  lookahead to check the length	of the next character, as in this pat-
       tern, which could be used with a	UTF-8 string (ignore white  space  and
       line breaks):

	 (?| (?=[\x00-\x7f])(\C) |
	     (?=[\x80-\x{7ff}])(\C)(\C)	|
	     (?=[\x{800}-\x{ffff}])(\C)(\C)(\C)	|

       A  group	 that starts with (?| resets the capturing parentheses numbers
       in each alternative (see	"Duplicate  Subpattern	Numbers"  below).  The
       assertions  at  the start of each branch	check the next UTF-8 character
       for values whose	encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
       character's  individual bytes are then captured by the appropriate num-
       ber of groups.

       An opening square bracket introduces a character	class, terminated by a
       closing square bracket. A closing square	bracket	on its own is not spe-
       cial by default.	However, if the	PCRE_JAVASCRIPT_COMPAT option is  set,
       a lone closing square bracket causes a compile-time error. If a closing
       square bracket is required as a member of the class, it should  be  the
       first  data  character  in  the	class (after an	initial	circumflex, if
       present)	or escaped with	a backslash.

       A character class matches a single character in the subject. In	a  UTF
       mode,  the  character  may  be  more than one data unit long. A matched
       character must be in the	set of characters defined by the class,	unless
       the  first  character in	the class definition is	a circumflex, in which
       case the	subject	character must not be in the set defined by the	class.
       If  a  circumflex is actually required as a member of the class,	ensure
       it is not the first character, or escape	it with	a backslash.

       For example, the	character class	[aeiou]	matches	any lower case	vowel,
       while  [^aeiou]	matches	 any character that is not a lower case	vowel.
       Note that a circumflex is just a	convenient notation for	specifying the
       characters  that	 are in	the class by enumerating those that are	not. A
       class that starts with a	circumflex is not an assertion;	it still  con-
       sumes  a	 character  from the subject string, and therefore it fails if
       the current pointer is at the end of the	string.

       In UTF-8	mode, characters with values greater than 255 (0xffff) can  be
       included	 in a class as a literal string	of data	units, or by using the
       \x{ escaping mechanism.

       When caseless matching is set, any letters in a	class  represent  both
       their  upper  case  and lower case versions, so for example, a caseless
       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
       match  "A", whereas a caseful version would. In a UTF mode, PCRE	always
       understands the concept of case for characters whose  values  are  less
       than  256, so caseless matching is always possible. For characters with
       higher values, the concept of case is supported	if  PCRE  is  compiled
       with  Unicode  property	support, but not otherwise. If you want	to use
       caseless	matching in a UTF mode for characters 256 and above, you  must
       ensure  that  PCRE is compiled with Unicode property support as well as
       with UTF	support.

       Characters that might indicate line breaks are  never  treated  in  any
       special	way  when  matching  character	classes,  whatever line-ending
       sequence	is in  use,  and  whatever  setting  of	 the  PCRE_DOTALL  and
       PCRE_MULTILINE options is used. A class such as [^a] always matches one
       of these	characters.

       The minus (hyphen) character can	be used	to specify a range of  charac-
       ters  in	 a  character  class.  For  example,  [d-m] matches any	letter
       between d and m,	inclusive. If a	 minus	character  is  required	 in  a
       class,  it  must	 be  escaped  with a backslash or appear in a position
       where it	cannot be interpreted as indicating a range, typically as  the
       first or	last character in the class.

       It is not possible to have the literal character	"]" as the end charac-
       ter of a	range. A pattern such as [W-]46] is interpreted	as a class  of
       two  characters ("W" and	"-") followed by a literal string "46]", so it
       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
       backslash  it is	interpreted as the end of range, so [W-\]46] is	inter-
       preted as a class containing a range followed by	two other  characters.
       The  octal or hexadecimal representation	of "]" can also	be used	to end
       a range.

       Ranges operate in the collating sequence	of character values. They  can
       also   be  used	for  characters	 specified  numerically,  for  example
       [\000-\037]. Ranges can include any characters that are valid  for  the
       current mode.

       If a range that includes	letters	is used	when caseless matching is set,
       it matches the letters in either	case. For example, [W-c] is equivalent
       to  [][\\^_`wxyzabc],  matched  caselessly,  and	 in a non-UTF mode, if
       character tables	for a French locale are	in  use,  [\xc8-\xcb]  matches
       accented	 E  characters	in both	cases. In UTF modes, PCRE supports the
       concept of case for characters with values greater than 255  only  when
       it is compiled with Unicode property support.

       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
       \w, and \W may appear in	a character class, and add the characters that
       they  match to the class. For example, [\dABCDEF] matches any hexadeci-
       mal digit. In UTF modes,	the ucp	option affects the meanings of \d, \s,
       \w and their upper case partners, just as it does when they appear out-
       side a character	class, as described in the section  entitled  "Generic
       character  types" above.	The escape sequence \b has a different meaning
       inside a	character class;  it  matches  the  backspace  character.  The
       sequences  \B, \N, \R, and \X are not special inside a character	class.
       Like any	other unrecognized escape sequences, they are treated  as  the
       literal characters "B", "N", "R", and "X".

       A  circumflex  can  conveniently	 be used with the upper	case character
       types to	specify	a more restricted set of characters than the  matching
       lower  case  type.  For example,	the class [^\W_] matches any letter or
       digit, but not underscore, whereas [\w] includes	underscore. A positive
       character class should be read as "something OR something OR ..." and a
       negative	class as "NOT something	AND NOT	something AND NOT ...".

       The only	metacharacters that are	recognized in  character  classes  are
       backslash,  hyphen  (only  where	 it can	be interpreted as specifying a
       range), circumflex (only	at the start), opening	square	bracket	 (only
       when  it	can be interpreted as introducing a POSIX class	name - see the
       next section), and the terminating  closing  square  bracket.  However,
       escaping	other non-alphanumeric characters does no harm.

       Perl supports the POSIX notation	for character classes. This uses names
       enclosed	by [: and :] within the	enclosing square brackets.  PCRE  also
       supports	this notation. For example,


       matches "0", "1", any alphabetic	character, or "%". The supported class
       names are:

	   letters and digits


	   character codes 0 - 127

	   space or tab	only

	   control characters

	   decimal digits (same	as \d)

	   printing characters,	excluding space

	   lower case letters

	   printing characters,	including space

	   printing characters,	excluding letters and digits and space

	   whitespace (not quite the same as \s)

	   upper case letters

	   "word" characters (same as \w)

	   hexadecimal digits

       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR	 (13),
       and  space  (32). Notice	that this list includes	the VT character (code
       11). This makes "space" different to \s,	which does not include VT (for
       Perl compatibility).

       The  name  "word"  is  a	Perl extension,	and "blank" is a GNU extension
       from Perl 5.8. Another Perl extension is	negation, which	 is  indicated
       by a ^ character	after the colon. For example,


       matches	"1", "2", or any non-digit. PCRE (and Perl) also recognize the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported,	and an error is	given if they are encountered.

       By default, in UTF modes, characters with values	greater	 than  255  do
       not  match any of the POSIX character classes. However, if the PCRE_UCP
       option is passed	to pcre_compile() , some of the	classes	are changed so
       that Unicode character properties are used. This	is achieved by replac-
       ing the POSIX classes by	other sequences, as follows:

	   becomes \p{Xan}

	   becomes \p{L}

	   becomes \h

	   becomes \p{Nd}

	   becomes \p{Ll}

	   becomes \p{Xps}

	   becomes \p{Lu}

	   becomes \p{Xwd}

       Negated versions, such as [:^alpha:] use	\P instead of  \p.  The	 other
       POSIX classes are unchanged, and	match only characters with code	points
       less than 256.

       Vertical	bar characters are used	to separate alternative	patterns.  For
       example,	the pattern


       matches	either "gilbert" or "sullivan".	Any number of alternatives may
       appear, and an empty  alternative  is  permitted	 (matching  the	 empty
       string).	The matching process tries each	alternative in turn, from left
       to right, and the first one that	succeeds is used. If the  alternatives
       are  within a subpattern	(defined below), "succeeds" means matching the
       rest of the main	pattern	as well	as the alternative in the  subpattern.

       The  settings  of the caseless, multiline, dotall, and extended options
       (which are Perl-compatible) can be changed from within the pattern by a
       sequence	 of  Perl  option  letters  enclosed between "(?" and ")". The
       option letters are

	   for caseless

	   for multiline

	   for dotall

	   for extended

       For example, (?im) sets caseless, multiline matching. It	is also	possi-
       ble to unset these options by preceding the letter with a hyphen, and a
       combined	setting	and unsetting such as (?im-sx),	 which	sets  caseless
       and  multiline  while unsetting dotall and extended, is also permitted.
       If a letter appears both	before and after the  hyphen,  the  option  is

       The  PCRE-specific options dupnames, ungreedy, and extra	can be changed
       in the same way as the Perl-compatible options by using the  characters
       J, U and	X respectively.

       When  one  of  these  option  changes occurs at top level (that is, not
       inside subpattern parentheses), the change applies to the remainder  of
       the pattern that	follows. If the	change is placed right at the start of
       a pattern, PCRE extracts	it into	the global options.

       An option change	within a subpattern (see below for  a  description  of
       subpatterns)  affects only that part of the subpattern that follows it,


       matches abc and aBc and no other	 strings  (assuming  caseless  is  not
       used). By this means, options can be made to have different settings in
       different parts of the pattern. Any changes made	in one alternative  do
       carry on	into subsequent	branches within	the same subpattern. For exam-


       matches "ab", "aB", "c",	and "C", even though  when  matching  "C"  the
       first  branch  is  abandoned before the option setting. This is because
       the effects of option settings happen at	compile	time. There  would  be
       some very weird behaviour otherwise.

       Note:  There  are  other	 PCRE-specific	options	that can be set	by the
       application when	the compiling or matching  functions  are  called.  In
       some  cases  the	 pattern can contain special leading sequences such as
       (*CRLF) to override what	the application	 has  set  or  what  has  been
       defaulted.   Details   are  given  in  the  section  entitled  "Newline
       sequences" above.  There	 are  also  the	 (*UTF8)  and  (*UCP)  leading
       sequences  that can be used to set UTF and Unicode property modes; they
       are equivalent to setting the unicode  and  the	ucp  options,  respec-
       tively.	The (*UTF) sequence is a generic version that can be used with
       any of the libraries. However, the application can  set	the  never_utf
       option, which locks out the use of the (*UTF) sequences.

       Subpatterns are delimited by parentheses	(round brackets), which	can be
       nested. Turning part of a pattern into a	subpattern does	two things:

       1. It localizes a set of	alternatives. For example, the pattern


       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
       it would	match "cataract", "erpillar" or	an empty string.

       2.  It  sets  up	 the  subpattern as a capturing	subpattern. This means
       that, when the complete pattern matches,	that portion  of  the  subject
       string that matched the subpattern is passed back to the	caller via the
       return value of re:run/3.

       Opening parentheses are counted from left to right (starting from 1) to
       obtain numbers for the capturing	subpatterns.For	example, if the	string
       "the red	king" is matched against the pattern

       the ((red|white)	(king|queen))

       the captured substrings are "red	king", "red", and "king", and are num-
       bered 1,	2, and 3, respectively.

       The  fact  that	plain  parentheses  fulfil two functions is not	always
       helpful.	There are often	times when a grouping subpattern  is  required
       without	a capturing requirement. If an opening parenthesis is followed
       by a question mark and a	colon, the subpattern does not do any  captur-
       ing,  and  is  not  counted when	computing the number of	any subsequent
       capturing subpatterns. For example, if the string "the white queen"  is
       matched against the pattern

       the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1 and 2.	The maximum number of capturing	subpatterns is 65535.

       As a convenient shorthand, if any option	settings are required  at  the
       start  of  a  non-capturing  subpattern,	 the option letters may	appear
       between the "?" and the ":". Thus the two patterns

	 * (?i:saturday|sunday)

	 * (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried  from  left  to right, and	options	are not	reset until the	end of
       the subpattern is reached, an option setting in one branch does	affect
       subsequent  branches,  so  the above patterns match "SUNDAY" as well as

       Perl 5.10 introduced a feature whereby each alternative in a subpattern
       uses  the same numbers for its capturing	parentheses. Such a subpattern
       starts with (?| and is itself a non-capturing subpattern. For  example,
       consider	this pattern:


       Because	the two	alternatives are inside	a (?| group, both sets of cap-
       turing parentheses are numbered one. Thus, when	the  pattern  matches,
       you  can	 look  at captured substring number one, whichever alternative
       matched.	This construct is useful when you want to  capture  part,  but
       not all,	of one of a number of alternatives. Inside a (?| group,	paren-
       theses are numbered as usual, but the number is reset at	the  start  of
       each  branch.  The numbers of any capturing parentheses that follow the
       subpattern start	after the highest number used in any branch. The  fol-
       lowing example is taken from the	Perl documentation. The	numbers	under-
       neath show in which buffer the captured content will be stored.

	 # before  ---------------branch-reset----------- after
	 / ( a )  (?| x	( y ) z	| (p (q) r) | (t) u (v)	) ( z )	/x
	 # 1		2	  2  3	      2	    3	  4

       A back reference	to a numbered subpattern uses the  most	 recent	 value
       that  is	 set  for that number by any subpattern. The following pattern
       matches "abcabc"	or "defdef":


       In contrast, a subroutine call to a numbered subpattern	always	refers
       to  the	first  one in the pattern with the given number. The following
       pattern matches "abcabc"	or "defabc":


       If a condition test for a subpattern's having matched refers to a  non-
       unique  number, the test	is true	if any of the subpatterns of that num-
       ber have	matched.

       An alternative approach to using	this "branch reset" feature is to  use
       duplicate named subpatterns, as described in the	next section.

       Identifying  capturing  parentheses  by number is simple, but it	can be
       very hard to keep track of the numbers in complicated  regular  expres-
       sions.  Furthermore,  if	 an  expression	 is  modified, the numbers may
       change. To help with this difficulty, PCRE supports the naming of  sub-
       patterns. This feature was not added to Perl until release 5.10.	Python
       had the feature earlier,	and PCRE introduced it at release  4.0,	 using
       the  Python syntax. PCRE	now supports both the Perl and the Python syn-
       tax. Perl allows	identically numbered  subpatterns  to  have  different
       names, but PCRE does not.

       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
       to  capturing parentheses from other parts of the pattern, such as back
       references, recursion, and conditions, can be made by name as  well  as
       by number.

       Names  consist  of  up  to  32 alphanumeric characters and underscores.
       Named capturing parentheses are still  allocated	 numbers  as  well  as
       names, exactly as if the	names were not present.	The capture specifica-
       tion to re:run/3	can use	named values if	they are present in the	 regu-
       lar expression.

       By  default, a name must	be unique within a pattern, but	it is possible
       to relax	this constraint	by setting  the	 dupnames  option  at  compile
       time.  (Duplicate  names	are also always	permitted for subpatterns with
       the same	number,	set up as described in the previous  section.)	Dupli-
       cate  names  can	 be useful for patterns	where only one instance	of the
       named parentheses can match. Suppose you	want to	match the  name	 of  a
       weekday,	 either	as a 3-letter abbreviation or as the full name,	and in
       both cases you want to extract the abbreviation.	This pattern (ignoring
       the line	breaks)	does the job:


       There  are  five	capturing substrings, but only one is ever set after a
       match. (An alternative way of solving this problem is to	use a  "branch
       reset" subpattern, as described in the previous section.)

       In  case	of capturing named subpatterns which names are not unique, the
       first matching occurrence (counted from left to right in	 the  subject)
       is returned from	re:exec/3, if the name is specified in the values part
       of the capture statement. The all_names capturing value will match  all
       of the names in the same	way.

       Warning:	You cannot use different names to distinguish between two sub-
       patterns	with the same number because PCRE uses only the	 numbers  when
       matching. For this reason, an error is given at compile time if differ-
       ent names are given to subpatterns with the same	number.	 However,  you
       can  give  the same name	to subpatterns with the	same number, even when
       dupnames	is not set.

       Repetition is specified by quantifiers, which can  follow  any  of  the
       following items:

	 * a literal data character

	 * the dot metacharacter

	 * the \C escape sequence

	 * the \X escape sequence

	 * the \R escape sequence

	 * an escape such as \d	or \pL that matches a single character

	 * a character class

	 * a back reference (see next section)

	 * a parenthesized subpattern (including assertions)

	 * a subroutine	call to	a subpattern (recursive	or otherwise)

       The  general repetition quantifier specifies a minimum and maximum num-
       ber of permitted	matches, by giving the two numbers in  curly  brackets
       (braces),  separated  by	 a comma. The numbers must be less than	65536,
       and the first must be less than or equal	to the second. For example:


       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is	not  a
       special	character.  If	the second number is omitted, but the comma is
       present,	there is no upper limit; if the	second number  and  the	 comma
       are  both omitted, the quantifier specifies an exact number of required
       matches.	Thus


       matches at least	3 successive vowels, but may match many	more, while


       matches exactly 8 digits. An opening curly bracket that	appears	 in  a
       position	 where a quantifier is not allowed, or one that	does not match
       the syntax of a quantifier, is taken as a literal character. For	 exam-
       ple, {,6} is not	a quantifier, but a literal string of four characters.

       In Unicode mode,	quantifiers apply to characters	rather than  to	 indi-
       vidual  data  units.  Thus, for example,	\x{100}{2} matches two charac-
       ters, each of which is represented by a two-byte	sequence  in  a	 UTF-8
       string.	Similarly, \X{3} matches three Unicode extended	grapheme clus-
       ters, each of which may be several data units long (and they may	be  of
       different lengths).

       The quantifier {0} is permitted,	causing	the expression to behave as if
       the previous item and the quantifier were not present. This may be use-
       ful  for	 subpatterns that are referenced as subroutines	from elsewhere
       in the pattern (but see also the	section	entitled "Defining subpatterns
       for  use	 by  reference only" below). Items other than subpatterns that
       have a {0} quantifier are omitted from the compiled pattern.

       For convenience,	the three most common quantifiers have	single-charac-
       ter abbreviations:

	   is equivalent to {0,}

	   is equivalent to {1,}

	   is equivalent to {0,1}

       It  is  possible	 to construct infinite loops by	following a subpattern
       that can	match no characters with a quantifier that has no upper	limit,
       for example:


       Earlier versions	of Perl	and PCRE used to give an error at compile time
       for such	patterns. However, because there are cases where this  can  be
       useful,	such  patterns	are now	accepted, but if any repetition	of the
       subpattern does in fact match no	characters, the	loop is	forcibly  bro-

       By  default,  the quantifiers are "greedy", that	is, they match as much
       as possible (up to the maximum  number  of  permitted  times),  without
       causing	the  rest of the pattern to fail. The classic example of where
       this gives problems is in trying	to match comments in C programs. These
       appear  between	/*  and	 */ and	within the comment, individual * and /
       characters may appear. An attempt to match C comments by	 applying  the


       to the string

       /* first	comment	*/ not comment /* second comment */

       fails,  because it matches the entire string owing to the greediness of
       the .* item.

       However,	if a quantifier	is followed by a question mark,	it  ceases  to
       be greedy, and instead matches the minimum number of times possible, so
       the pattern


       does the	right thing with the C comments. The meaning  of  the  various
       quantifiers  is	not  otherwise	changed,  just the preferred number of
       matches.	Do not confuse this use	of question mark with  its  use	 as  a
       quantifier  in its own right. Because it	has two	uses, it can sometimes
       appear doubled, as in


       which matches one digit by preference, but can match two	if that	is the
       only way	the rest of the	pattern	matches.

       If  the	ungreedy  option  is  set  (an option that is not available in
       Perl), the quantifiers are not greedy by	default, but  individual  ones
       can  be	made  greedy  by following them	with a question	mark. In other
       words, it inverts the default behaviour.

       When a parenthesized subpattern is quantified  with  a  minimum	repeat
       count  that is greater than 1 or	with a limited maximum,	more memory is
       required	for the	compiled pattern, in proportion	to  the	 size  of  the
       minimum or maximum.

       If  a pattern starts with .* or .{0,} and the dotall option (equivalent
       to Perl's /s) is	set, thus allowing the dot to match newlines, the pat-
       tern  is	 implicitly  anchored,	because	whatever follows will be tried
       against every character position	in the subject string, so there	is  no
       point  in  retrying  the	overall	match at any position after the	first.
       PCRE normally treats such a pattern as though it	were preceded by \A.

       In cases	where it is known that the subject  string  contains  no  new-
       lines, it is worth setting dotall in order to obtain this optimization,
       or alternatively	using ^	to indicate anchoring explicitly.

       However,	there are some cases where the optimization  cannot  be	 used.
       When  .*	is inside capturing parentheses	that are the subject of	a back
       reference elsewhere in the pattern, a match at the start	may fail where
       a later one succeeds. Consider, for example:


       If  the subject is "xyz123abc123" the match point is the	fourth charac-
       ter. For	this reason, such a pattern is not implicitly anchored.

       Another case where implicit anchoring is	not applied is when the	 lead-
       ing  .* is inside an atomic group. Once again, a	match at the start may
       fail where a later one succeeds.	Consider this pattern:


       It matches "ab" in the subject "aab". The use of	the backtracking  con-
       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.

       When a capturing	subpattern is repeated,	the value captured is the sub-
       string that matched the final iteration.	For example, after


       has matched "tweedledum tweedledee" the value of	the captured substring
       is  "tweedledee".  However,  if there are nested	capturing subpatterns,
       the corresponding captured values may have been set in previous	itera-
       tions. For example, after


       matches "aba" the value of the second captured substring	is "b".

       With  both  maximizing ("greedy") and minimizing	("ungreedy" or "lazy")
       repetition, failure of what follows normally causes the	repeated  item
       to  be  re-evaluated to see if a	different number of repeats allows the
       rest of the pattern to match. Sometimes it is useful to	prevent	 this,
       either  to  change the nature of	the match, or to cause it fail earlier
       than it otherwise might,	when the author	of the pattern knows there  is
       no point	in carrying on.

       Consider,  for  example,	the pattern \d+foo when	applied	to the subject


       After matching all 6 digits and then failing to match "foo", the	normal
       action  of  the matcher is to try again with only 5 digits matching the
       \d+ item, and then with	4,  and	 so  on,  before  ultimately  failing.
       "Atomic	grouping"  (a  term taken from Jeffrey Friedl's	book) provides
       the means for specifying	that once a subpattern has matched, it is  not
       to be re-evaluated in this way.

       If  we  use atomic grouping for the previous example, the matcher gives
       up immediately on failing to match "foo"	the first time.	 The  notation
       is a kind of special parenthesis, starting with (?> as in this example:


       This kind of parenthesis	"locks up" the part of the pattern it contains
       once  it	 has  matched,	and a failure further into the pattern is pre-
       vented from backtracking	into it.  Backtracking	past  it  to  previous
       items, however, works as	normal.

       An  alternative	description  is	that a subpattern of this type matches
       the string of characters	that an	 identical  standalone	pattern	 would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be	thought	of as a	maximizing repeat that
       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
       pared to	adjust the number of digits they match in order	 to  make  the
       rest of the pattern match, (?>\d+) can only match an entire sequence of

       Atomic groups in	general	can of course contain arbitrarily  complicated
       subpatterns,  and  can  be  nested. However, when the subpattern	for an
       atomic group is just a single repeated item, as in the example above, a
       simpler	notation,  called  a "possessive quantifier" can be used. This
       consists	of an additional + character  following	 a  quantifier.	 Using
       this notation, the previous example can be rewritten as


       Note that a possessive quantifier can be	used with an entire group, for


       Possessive quantifiers are always greedy; the setting of	 the  ungreedy
       option is ignored. They are a convenient	notation for the simpler forms
       of atomic group.	However, there is no difference	in the	meaning	 of  a
       possessive quantifier and the equivalent	atomic group, though there may
       be a performance	difference; possessive quantifiers should be  slightly

       The  possessive	quantifier syntax is an	extension to the Perl 5.8 syn-
       tax. Jeffrey Friedl originated the idea (and the	 name)	in  the	 first
       edition of his book. Mike McCloskey liked it, so	implemented it when he
       built Sun's Java	package, and PCRE copied it from there.	It  ultimately
       found its way into Perl at release 5.10.

       PCRE has	an optimization	that automatically "possessifies" certain sim-
       ple pattern constructs. For example, the	sequence  A+B  is  treated  as
       A++B  because  there is no point	in backtracking	into a sequence	of A's
       when B must follow.

       When a pattern contains an unlimited repeat inside  a  subpattern  that
       can  itself  be	repeated  an  unlimited	number of times, the use of an
       atomic group is the only	way to avoid some  failing  matches  taking  a
       very long time indeed. The pattern


       matches	an  unlimited number of	substrings that	either consist of non-
       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
       matches,	it runs	quickly. However, if it	is applied to


       it  takes  a  long  time	 before	reporting failure. This	is because the
       string can be divided between the internal \D+ repeat and the  external
       *  repeat  in  a	 large	number of ways,	and all	have to	be tried. (The
       example uses [!?] rather	than a single character	at  the	 end,  because
       both  PCRE  and	Perl have an optimization that allows for fast failure
       when a single character is used.	They remember the last single  charac-
       ter  that  is required for a match, and fail early if it	is not present
       in the string.) If the pattern is changed so that  it  uses  an	atomic
       group, like this:


       sequences  of non-digits	cannot be broken, and failure happens quickly.

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a back reference to a	capturing sub-
       pattern earlier (that is, to its	left) in the pattern,  provided	 there
       have been that many previous capturing left parentheses.

       However,	if the decimal number following	the backslash is less than 10,
       it is always taken as a back reference, and causes  an  error  only  if
       there  are  not that many capturing left	parentheses in the entire pat-
       tern. In	other words, the parentheses that are referenced need  not  be
       to  the left of the reference for numbers less than 10. A "forward back
       reference" of this type can make	sense when a  repetition  is  involved
       and  the	 subpattern to the right has participated in an	earlier	itera-

       It is not possible to have a numerical "forward back  reference"	 to  a
       subpattern  whose  number  is  10  or  more using this syntax because a
       sequence	such as	\50 is interpreted as a	character  defined  in	octal.
       See the subsection entitled "Non-printing characters" above for further
       details of the handling of digits following a backslash.	 There	is  no
       such  problem  when named parentheses are used. A back reference	to any
       subpattern is possible using named parentheses (see below).

       Another way of avoiding the ambiguity inherent in  the  use  of	digits
       following  a  backslash	is  to use the \g escape sequence. This	escape
       must be followed	by an unsigned number or a negative number, optionally
       enclosed	in braces. These examples are all identical:

	 * (ring), \1

	 * (ring), \g1

	 * (ring), \g{1}

       An  unsigned number specifies an	absolute reference without the ambigu-
       ity that	is present in the older	syntax.	It is also useful when literal
       digits follow the reference. A negative number is a relative reference.
       Consider	this example:


       The sequence \g{-1} is a	reference to the most recently started captur-
       ing subpattern before \g, that is, is it	equivalent to \2 in this exam-
       ple. Similarly, \g{-2} would be equivalent to \1. The use  of  relative
       references  can	be helpful in long patterns, and also in patterns that
       are created by  joining	together  fragments  that  contain  references
       within themselves.

       A  back	reference matches whatever actually matched the	capturing sub-
       pattern in the current subject string, rather  than  anything  matching
       the subpattern itself (see "Subpatterns as subroutines" below for a way
       of doing	that). So the pattern

       (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not  "sense and responsibility".	If caseful matching is in force	at the
       time of the back	reference, the case of letters is relevant. For	 exam-


       matches	"rah  rah"  and	 "RAH RAH", but	not "RAH rah", even though the
       original	capturing subpattern is	matched	caselessly.

       There are several different ways	of writing back	 references  to	 named
       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
       \k'name'	are supported, as is the Python	syntax (?P=name). Perl	5.10's
       unified back reference syntax, in which \g can be used for both numeric
       and named references, is	also supported.	We  could  rewrite  the	 above
       example in any of the following ways:

	 * (?<p1>(?i)rah)\s+\k<p1>

	 * (?'p1'(?i)rah)\s+\k{p1}

	 * (?P<p1>(?i)rah)\s+(?P=p1)

	 * (?<p1>(?i)rah)\s+\g{p1}

       A  subpattern  that  is	referenced  by	name may appear	in the pattern
       before or after the reference.

       There may be more than one back reference to the	same subpattern. If  a
       subpattern  has	not actually been used in a particular match, any back
       references to it	always fail. For example, the pattern


       always fails if it starts to match "a" rather than "bc".	Because	 there
       may  be	many  capturing	parentheses in a pattern, all digits following
       the backslash are taken as part of a potential back  reference  number.
       If the pattern continues	with a digit character,	some delimiter must be
       used to terminate the back reference. If	the extended  option  is  set,
       this  can  be  whitespace.  Otherwise  an empty comment (see "Comments"
       below) can be used.

       Recursive back references

       A back reference	that occurs inside the parentheses to which it	refers
       fails  when  the	subpattern is first used, so, for example, (a\1) never
       matches.	However, such references can be	useful inside repeated subpat-
       terns. For example, the pattern


       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation of	the subpattern,	 the  back  reference  matches	the  character
       string  corresponding  to  the previous iteration. In order for this to
       work, the pattern must be such that the first iteration does  not  need
       to  match the back reference. This can be done using alternation, as in
       the example above, or by	a quantifier with a minimum of zero.

       Back references of this type cause the group that they reference	to  be
       treated	as  an	atomic group. Once the whole group has been matched, a
       subsequent matching failure cannot cause	backtracking into  the	middle
       of the group.

       An  assertion  is  a  test on the characters following or preceding the
       current matching	point that does	not actually consume  any  characters.
       The  simple  assertions	coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
       described above.

       More complicated	assertions are coded as	 subpatterns.  There  are  two
       kinds:  those  that  look  ahead	of the current position	in the subject
       string, and those that look  behind  it.	 An  assertion	subpattern  is
       matched	in  the	 normal	way, except that it does not cause the current
       matching	position to be changed.

       Assertion subpatterns are not capturing subpatterns. If such an	asser-
       tion  contains  capturing  subpatterns within it, these are counted for
       the purposes of numbering the capturing subpatterns in the  whole  pat-
       tern.  However,	substring  capturing  is carried out only for positive
       assertions. (Perl sometimes, but	not always, does do capturing in nega-
       tive assertions.)

       For  compatibility  with	 Perl,	assertion subpatterns may be repeated;
       though it makes no sense	to assert the same thing  several  times,  the
       side  effect  of	 capturing  parentheses	may occasionally be useful. In
       practice, there only three cases:

	   If the quantifier is	{0}, the  assertion  is	 never	obeyed	during
	   matching.  However, it may contain internal capturing parenthesized
	   groups that are called from elsewhere via the subroutine mechanism.

	   If  quantifier is {0,n} where n is greater than zero, it is treated
	   as if it were {0,1}.	At run time, the rest of the pattern match  is
	   tried  with	and  without the assertion, the	order depending	on the
	   greediness of the quantifier.

	   If the minimum repetition is	greater	than zero, the	quantifier  is
	   ignored.  The assertion is obeyed just once when encountered	during

       Lookahead assertions

       Lookahead assertions start with (?= for positive	assertions and (?! for
       negative	assertions. For	example,


       matches	a word followed	by a semicolon,	but does not include the semi-
       colon in	the match, and


       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
       that the	apparently similar pattern


       does  not  find	an  occurrence	of "bar" that is preceded by something
       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
       the assertion (?!foo) is	always true when the next three	characters are
       "bar". A	lookbehind assertion is	needed to achieve the other effect.

       If you want to force a matching failure at some point in	a pattern, the
       most  convenient	 way  to  do  it  is with (?!) because an empty	string
       always matches, so an assertion that requires there not to be an	 empty
       string  must always fail. The backtracking control verb (*FAIL) or (*F)
       is a synonym for	(?!).

       Lookbehind assertions

       Lookbehind assertions start with	(?<= for positive assertions and  (?<!
       for negative assertions.	For example,


       does  find  an  occurrence  of "bar" that is not	preceded by "foo". The
       contents	of a lookbehind	assertion are restricted  such	that  all  the
       strings it matches must have a fixed length. However, if	there are sev-
       eral top-level alternatives, they do not	all  have  to  have  the  same
       fixed length. Thus


       is permitted, but


       causes  an  error at compile time. Branches that	match different	length
       strings are permitted only at the top level of a	lookbehind  assertion.
       This is an extension compared with Perl,	which requires all branches to
       match the same length of	string.	An assertion such as


       is not permitted, because its single top-level  branch  can  match  two
       different lengths, but it is acceptable to PCRE if rewritten to use two
       top-level branches:


       In some cases, the escape sequence \K (see above) can be	 used  instead
       of a lookbehind assertion to get	round the fixed-length restriction.

       The  implementation  of lookbehind assertions is, for each alternative,
       to temporarily move the current position	back by	the fixed  length  and
       then try	to match. If there are insufficient characters before the cur-
       rent position, the assertion fails.

       In a UTF	mode, PCRE does	not allow the \C escape	(which matches a  sin-
       gle  data  unit even in a UTF mode) to appear in	lookbehind assertions,
       because it makes	it impossible to calculate the length of  the  lookbe-
       hind.  The \X and \R escapes, which can match different numbers of data
       units, are also not permitted.

       "Subroutine" calls (see below) such as (?2) or (?&X) are	 permitted  in
       lookbehinds,  as	 long as the subpattern	matches	a fixed-length string.
       Recursion, however, is not supported.

       Possessive quantifiers can  be  used  in	 conjunction  with  lookbehind
       assertions to specify efficient matching	of fixed-length	strings	at the
       end of subject strings. Consider	a simple pattern such as


       when applied to a long string that does	not  match.  Because  matching
       proceeds	from left to right, PCRE will look for each "a"	in the subject
       and then	see if what follows matches the	rest of	the  pattern.  If  the
       pattern is specified as


       the  initial .* matches the entire string at first, but when this fails
       (because	there is no following "a"), it backtracks to match all but the
       last  character,	 then all but the last two characters, and so on. Once
       again the search	for "a"	covers the entire string, from right to	 left,
       so we are no better off.	However, if the	pattern	is written as


       there  can  be  no backtracking for the .*+ item; it can	match only the
       entire string. The subsequent lookbehind	assertion does a  single  test
       on  the last four characters. If	it fails, the match fails immediately.
       For long	strings, this approach makes a significant difference  to  the
       processing time.

       Using multiple assertions

       Several assertions (of any sort)	may occur in succession. For example,


       matches	"foo" preceded by three	digits that are	not "999". Notice that
       each of the assertions is applied independently at the  same  point  in
       the  subject  string.  First  there  is a check that the	previous three
       characters are all digits, and then there is  a	check  that  the  same
       three  characters are not "999".	This pattern does not match "foo" pre-
       ceded by	six characters,	the first of which are	digits	and  the  last
       three  of  which	 are not "999".	For example, it	doesn't	match "123abc-
       foo". A pattern to do that is


       This time the first assertion looks at the  preceding  six  characters,
       checking	that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested	in any combination. For	example,


       matches an occurrence of	"baz" that is preceded by "bar"	which in  turn
       is not preceded by "foo", while


       is  another pattern that	matches	"foo" preceded by three	digits and any
       three characters	that are not "999".

       It is possible to cause the matching process to obey a subpattern  con-
       ditionally  or to choose	between	two alternative	subpatterns, depending
       on the result of	an assertion, or whether a specific capturing  subpat-
       tern  has  already  been	matched. The two possible forms	of conditional
       subpattern are:

	 * (?(condition)yes-pattern)

	 * (?(condition)yes-pattern|no-pattern)

       If the condition	is satisfied, the yes-pattern is used;	otherwise  the
       no-pattern  (if	present)  is used. If there are	more than two alterna-
       tives in	the subpattern,	a compile-time error occurs. Each of  the  two
       alternatives may	itself contain nested subpatterns of any form, includ-
       ing  conditional	 subpatterns;  the  restriction	 to  two  alternatives
       applies only at the level of the	condition. This	pattern	fragment is an
       example where the alternatives are complex:

       (?(1) (A|B|C) | (D | (?(2)E|F) |	E) )

       There are four kinds of condition: references  to  subpatterns,	refer-
       ences to	recursion, a pseudo-condition called DEFINE, and assertions.

       Checking	for a used subpattern by number

       If  the	text between the parentheses consists of a sequence of digits,
       the condition is	true if	a capturing subpattern of that number has pre-
       viously	matched.  If  there is more than one capturing subpattern with
       the same	number (see the	earlier	 section  about	 duplicate  subpattern
       numbers),  the condition	is true	if any of them have matched. An	alter-
       native notation is to precede the digits	with a plus or minus sign.  In
       this  case, the subpattern number is relative rather than absolute. The
       most recently opened parentheses	can be referenced by (?(-1), the  next
       most  recent  by	(?(-2),	and so on. Inside loops	it can also make sense
       to refer	to subsequent groups. The next parentheses to be opened	can be
       referenced  as (?(+1), and so on. (The value zero in any	of these forms
       is not used; it provokes	a compile-time error.)

       Consider	the following pattern, which contains  non-significant	white-
       space  to  make	it  more  readable (assume the extended	option)	and to
       divide it into three parts for ease of discussion:

       ( \( )? [^()]+ (?(1) \) )

       The first part matches an optional opening  parenthesis,	 and  if  that
       character is present, sets it as	the first captured substring. The sec-
       ond part	matches	one or more characters that are	not  parentheses.  The
       third  part  is	a conditional subpattern that tests whether or not the
       first set of parentheses	matched	or not.	If they	did, that is, if  sub-
       ject started with an opening parenthesis, the condition is true,	and so
       the yes-pattern is executed and a closing parenthesis is	required. Oth-
       erwise,	since  no-pattern is not present, the subpattern matches noth-
       ing. In other words, this pattern matches a sequence  of	 non-parenthe-
       ses, optionally enclosed	in parentheses.

       If  you	were  embedding	 this pattern in a larger one, you could use a
       relative	reference:

       ...other	stuff... ( \( )? [^()]+	(?(-1) \) ) ...

       This makes the fragment independent of the parentheses  in  the	larger

       Checking	for a used subpattern by name

       Perl  uses  the	syntax	(?(<name>)...) or (?('name')...) to test for a
       used subpattern by name.	For compatibility  with	 earlier  versions  of
       PCRE,  which  had this facility before Perl, the	syntax (?(name)...) is
       also recognized.	However, there is a possible ambiguity with this  syn-
       tax,  because  subpattern  names	 may  consist entirely of digits. PCRE
       looks first for a named subpattern; if it cannot	find one and the  name
       consists	 entirely  of digits, PCRE looks for a subpattern of that num-
       ber, which must be greater than zero. Using subpattern names that  con-
       sist entirely of	digits is not recommended.

       Rewriting the above example to use a named subpattern gives this:

       (?<OPEN>	\( )? [^()]+ (?(<OPEN>)	\) )

       If  the	name used in a condition of this kind is a duplicate, the test
       is applied to all subpatterns of	the same name, and is true if any  one
       of them has matched.

       Checking	for pattern recursion

       If the condition	is the string (R), and there is	no subpattern with the
       name R, the condition is	true if	a recursive call to the	whole  pattern
       or any subpattern has been made.	If digits or a name preceded by	amper-
       sand follow the letter R, for example:

       (?(R3)...) or (?(R&name)...)

       the condition is	true if	the most recent	recursion is into a subpattern
       whose number or name is given. This condition does not check the	entire
       recursion stack.	If the name used in a condition	 of  this  kind	 is  a
       duplicate, the test is applied to all subpatterns of the	same name, and
       is true if any one of them is the most recent recursion.

       At "top level", all these recursion test	conditions are false. The syn-
       tax for recursive patterns is described below.

       Defining	subpatterns for	use by reference only

       If  the	condition  is  the string (DEFINE), and	there is no subpattern
       with the	name DEFINE, the condition is  always  false.  In  this	 case,
       there  may  be  only  one  alternative  in the subpattern. It is	always
       skipped if control reaches this point  in  the  pattern;	 the  idea  of
       DEFINE  is that it can be used to define	"subroutines" that can be ref-
       erenced from elsewhere. (The use	of subroutines	is  described  below.)
       For   example,	a   pattern   to   match   an  IPv4  address  such  as
       ""	could be written like this (ignore whitespace and line

       (?(DEFINE)  (?<byte>  2[0-4]\d  |  25[0-5]  |  1\d\d  |	[1-9]?\d) ) \b
       (?&byte)	(\.(?&byte)){3}	\b

       The first part of the pattern is	a DEFINE group inside which a  another
       group  named "byte" is defined. This matches an individual component of
       an IPv4 address (a number less than 256). When  matching	 takes	place,
       this  part  of  the pattern is skipped because DEFINE acts like a false
       condition. The rest of the pattern uses references to the  named	 group
       to  match the four dot-separated	components of an IPv4 address, insist-
       ing on a	word boundary at each end.

       Assertion conditions

       If the condition	is not in any of the above  formats,  it  must	be  an
       assertion.  This	 may be	a positive or negative lookahead or lookbehind
       assertion. Consider  this  pattern,  again  containing  non-significant
       whitespace, and with the	two alternatives on the	second line:

	 \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive	lookahead  assertion  that  matches an
       optional	sequence of non-letters	followed by a letter. In other	words,
       it  tests  for the presence of at least one letter in the subject. If a
       letter is found,	the subject is matched against the first  alternative;
       otherwise  it  is  matched  against  the	 second.  This pattern matches
       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
       letters and dd are digits.

       There are two ways of including comments	in patterns that are processed
       by PCRE.	In both	cases, the start of the	comment	must not be in a char-
       acter class, nor	in the middle of any other sequence of related charac-
       ters such as (?:	or a subpattern	name or	number.	 The  characters  that
       make up a comment play no part in the pattern matching.

       The  sequence (?# marks the start of a comment that continues up	to the
       next closing parenthesis. Nested	parentheses are	not permitted. If  the
       PCRE_EXTENDED option is set, an unescaped # character also introduces a
       comment,	which in this case continues to	 immediately  after  the  next
       newline	character  or character	sequence in the	pattern. Which charac-
       ters are	interpreted as newlines	is controlled by the options passed to
       a  compiling function or	by a special sequence at the start of the pat-
       tern, as	described in the section entitled "Newline conventions"	above.
       Note that the end of this type of comment is a literal newline sequence
       in the pattern; escape sequences	that happen to represent a newline  do
       not count. For example, consider	this pattern when extended is set, and
       the default newline convention is in force:

       abc #comment \n still comment

       On encountering the # character,	pcre_compile()	skips  along,  looking
       for  a newline in the pattern. The sequence \n is still literal at this
       stage, so it does not terminate the comment. Only an  actual  character
       with the	code value 0x0a	(the default newline) does so.

       Consider	 the problem of	matching a string in parentheses, allowing for
       unlimited nested	parentheses. Without the use of	 recursion,  the  best
       that  can  be  done  is	to use a pattern that matches up to some fixed
       depth of	nesting. It is not possible to	handle	an  arbitrary  nesting

       For some	time, Perl has provided	a facility that	allows regular expres-
       sions to	recurse	(amongst other things).	It does	this by	 interpolating
       Perl  code in the expression at run time, and the code can refer	to the
       expression itself. A Perl pattern using code interpolation to solve the
       parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The (?p{...}) item interpolates Perl code at run	time, and in this case
       refers recursively to the pattern in which it appears.

       Obviously, PCRE cannot support the interpolation	of Perl	code. Instead,
       it  supports  special  syntax  for recursion of the entire pattern, and
       also for	individual subpattern recursion.  After	 its  introduction  in
       PCRE  and  Python,  this	 kind of recursion was subsequently introduced
       into Perl at release 5.10.

       A special item that consists of (? followed by a	 number	 greater  than
       zero  and  a  closing parenthesis is a recursive	subroutine call	of the
       subpattern of the given number, provided	that  it  occurs  inside  that
       subpattern.  (If	 not,  it is a non-recursive subroutine	call, which is
       described in the	next section.) The special item	 (?R)  or  (?0)	 is  a
       recursive call of the entire regular expression.

       This  PCRE  pattern  solves  the	nested parentheses problem (assume the
       extended	option is set so that whitespace is ignored):

       \( ( [^()]++ | (?R) )* \)

       First it	matches	an opening parenthesis.	Then it	matches	any number  of
       substrings  which  can  either  be  a sequence of non-parentheses, or a
       recursive match of the pattern itself (that is, a  correctly  parenthe-
       sized  substring). Finally there	is a closing parenthesis. Note the use
       of a possessive quantifier to avoid backtracking	into sequences of non-

       If  this	 were  part of a larger	pattern, you would not want to recurse
       the entire pattern, so instead you could	use this:

       ( \( ( [^()]++ |	(?1) )*	\) )

       We have put the pattern into parentheses, and caused the	 recursion  to
       refer to	them instead of	the whole pattern.

       In  a  larger  pattern,	keeping	 track	of  parenthesis	numbers	can be
       tricky. This is made easier by the use of relative references.  Instead
       of (?1) in the pattern above you	can write (?-2)	to refer to the	second
       most recently opened parentheses	 preceding  the	 recursion.  In	 other
       words,  a  negative  number counts capturing parentheses	leftwards from
       the point at which it is	encountered.

       It is also possible to refer to	subsequently  opened  parentheses,  by
       writing	references  such  as (?+2). However, these cannot be recursive
       because the reference is	not inside the	parentheses  that  are	refer-
       enced.  They are	always non-recursive subroutine	calls, as described in
       the next	section.

       An alternative approach is to use named parentheses instead.  The  Perl
       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
       supported. We could rewrite the above example as	follows:

       (?<pn> \( ( [^()]++ | (?&pn) )* \) )

       If there	is more	than one subpattern with the same name,	 the  earliest
       one is used.

       This  particular	 example pattern that we have been looking at contains
       nested unlimited	repeats, and so	the use	of a possessive	quantifier for
       matching	strings	of non-parentheses is important	when applying the pat-
       tern to strings that do not match. For example, when  this  pattern  is
       applied to


       it  yields  "no	match" quickly.	However, if a possessive quantifier is
       not used, the match runs	for a very long	time indeed because there  are
       so  many	 different  ways the + and * repeats can carve up the subject,
       and all have to be tested before	failure	can be reported.

       At the end of a match, the values of capturing  parentheses  are	 those
       from the	outermost level. If the	pattern	above is matched against


       the  value  for	the  inner capturing parentheses (numbered 2) is "ef",
       which is	the last value taken on	at the top level. If a capturing  sub-
       pattern	is  not	 matched at the	top level, its final captured value is
       unset, even if it was (temporarily) set at a deeper  level  during  the
       matching	process.

       Do  not	confuse	 the (?R) item with the	condition (R), which tests for
       recursion. Consider this	pattern, which matches text in angle brackets,
       allowing	 for  arbitrary	 nesting.  Only	 digits	 are allowed in	nested
       brackets	(that is, when recursing), whereas any characters are  permit-
       ted at the outer	level.

       < (?: (?(R) \d++	| [^<>]*+) | (?R)) * >

       In  this	 pattern, (?(R)	is the start of	a conditional subpattern, with
       two different alternatives for the recursive and	 non-recursive	cases.
       The (?R)	item is	the actual recursive call.

       Differences in recursion	processing between PCRE	and Perl

       Recursion  processing  in PCRE differs from Perl	in two important ways.
       In PCRE (like Python, but unlike	Perl), a recursive subpattern call  is
       always treated as an atomic group. That is, once	it has matched some of
       the subject string, it is never re-entered, even	if it contains untried
       alternatives  and  there	 is a subsequent matching failure. This	can be
       illustrated by the following pattern, which purports to match a	palin-
       dromic  string  that contains an	odd number of characters (for example,
       "a", "aba", "abcba", "abcdcba"):


       The idea	is that	it either matches a single character, or two identical
       characters  surrounding	a sub-palindrome. In Perl, this	pattern	works;
       in PCRE it does not if the pattern is  longer  than  three  characters.
       Consider	the subject string "abcba":

       At  the	top level, the first character is matched, but as it is	not at
       the end of the string, the first	alternative fails; the second alterna-
       tive is taken and the recursion kicks in. The recursive call to subpat-
       tern 1 successfully matches the next character ("b").  (Note  that  the
       beginning and end of line tests are not part of the recursion).

       Back  at	 the top level,	the next character ("c") is compared with what
       subpattern 2 matched, which was "a". This fails.	Because	the  recursion
       is  treated  as	an atomic group, there are now no backtracking points,
       and so the entire match fails. (Perl is able, at	 this  point,  to  re-
       enter  the  recursion  and try the second alternative.) However,	if the
       pattern is written with the alternatives	in the other order, things are


       This  time,  the	recursing alternative is tried first, and continues to
       recurse until it	runs out of characters,	at which point	the  recursion
       fails.  But  this  time	we  do	have another alternative to try	at the
       higher level. That is the big difference:  in  the  previous  case  the
       remaining alternative is	at a deeper recursion level, which PCRE	cannot

       To change the pattern so	that it	matches	all palindromic	 strings,  not
       just  those  with an odd	number of characters, it is tempting to	change
       the pattern to this:


       Again, this works in Perl, but not in PCRE, and for  the	 same  reason.
       When  a	deeper	recursion has matched a	single character, it cannot be
       entered again in	order to match an empty	string.	 The  solution	is  to
       separate	 the two cases,	and write out the odd and even cases as	alter-
       natives at the higher level:


       If you want to match typical palindromic	phrases, the  pattern  has  to
       ignore all non-word characters, which can be done like this:


       If  run	with the caseless option, this pattern matches phrases such as
       "A man, a plan, a canal:	Panama!" and it	works well in  both  PCRE  and
       Perl.  Note the use of the possessive quantifier	*+ to avoid backtrack-
       ing into	sequences of non-word characters. Without this,	PCRE  takes  a
       great  deal  longer  (ten  times	or more) to match typical phrases, and
       Perl takes so long that you think it has	gone into a loop.

       WARNING:	The palindrome-matching	patterns above work only if  the  sub-
       ject  string  does not start with a palindrome that is shorter than the
       entire string. For example, although "abcba" is correctly  matched,  if
       the  subject  is	"ababa", PCRE finds the	palindrome "aba" at the	start,
       then fails at top level because the end of the string does not  follow.
       Once  again, it cannot jump back	into the recursion to try other	alter-
       natives,	so the entire match fails.

       The second way in which PCRE and	Perl differ in	their  recursion  pro-
       cessing	is in the handling of captured values. In Perl,	when a subpat-
       tern is called recursively or as	a subpattern (see the  next  section),
       it  has	no  access to any values that were captured outside the	recur-
       sion, whereas in	PCRE these values can  be  referenced.	Consider  this


       In  PCRE,  this	pattern	matches	"bab". The first capturing parentheses
       match "b", then in the second group, when the back reference  \1	 fails
       to  match "b", the second alternative matches "a" and then recurses. In
       the recursion, \1 does now match	"b" and	so the whole  match  succeeds.
       In  Perl,  the pattern fails to match because inside the	recursive call
       \1 cannot access	the externally set value.

       If the syntax for a recursive subpattern	call (either by	number	or  by
       name)  is  used outside the parentheses to which	it refers, it operates
       like a subroutine in a programming language. The	called subpattern  may
       be  defined  before or after the	reference. A numbered reference	can be
       absolute	or relative, as	in these examples:

	 * (...(absolute)...)...(?2)...

	 * (...(relative)...)...(?-1)...

	 * (...(?+1)...(relative)...

       An earlier example pointed out that the pattern

       (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not "sense and responsibility". If instead the pattern

       (sens|respons)e and (?1)ibility

       is  used, it does match "sense and responsibility" as well as the other
       two strings. Another example is	given  in  the	discussion  of	DEFINE

       All  subroutine	calls, whether recursive or not, are always treated as
       atomic groups. That is, once a subroutine has matched some of the  sub-
       ject string, it is never	re-entered, even if it contains	untried	alter-
       natives and there is  a	subsequent  matching  failure.	Any  capturing
       parentheses  that  are  set  during the subroutine call revert to their
       previous	values afterwards.

       Processing options such as case-independence are	fixed when  a  subpat-
       tern  is	defined, so if it is used as a subroutine, such	options	cannot
       be changed for different	calls. For example, consider this pattern:


       It matches "abcabc". It does not	match "abcABC" because the  change  of
       processing option does not affect the called subpattern.

       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
       name or a number	enclosed either	in angle brackets or single quotes, is
       an  alternative	syntax	for  referencing a subpattern as a subroutine,
       possibly	recursively. Here are two of the examples used above,  rewrit-
       ten using this syntax:

       (?<pn> \( ( (?>[^()]+) |	\g<pn> )* \) )

       (sens|respons)e and \g'1'ibility

       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
       plus or a minus sign it is taken	as a relative reference. For example:


       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
       synonymous.  The	former is a back reference; the	latter is a subroutine

       Perl 5.10 introduced a number of	"Special Backtracking Control  Verbs",
       which  are  still  described in the Perl	documentation as "experimental
       and subject to change or	removal	in a future version of Perl". It  goes
       on  to  say:  "Their  usage in production code should be	noted to avoid
       problems	during upgrades." The same remarks apply to the	PCRE  features
       described in this section.

       The  new	verbs make use of what was previously invalid syntax: an open-
       ing parenthesis followed	by an asterisk.	They are generally of the form
       (*VERB)	or  (*VERB:NAME). Some may take	either form, possibly behaving
       differently depending on	whether	or not a name is present.  A  name  is
       any sequence of characters that does not	include	a closing parenthesis.
       The maximum length of name is 255 in the	8-bit library and 65535	in the
       16-bit  and  32-bit  libraries.	If  the	name is	empty, that is,	if the
       closing parenthesis immediately follows the colon, the effect is	as  if
       the colon were not there. Any number of these verbs may occur in	a pat-

       The behaviour of	these verbs in repeated	 groups,  assertions,  and  in
       subpatterns called as subroutines (whether or not recursively) is docu-
       mented below.

       Optimizations that affect backtracking verbs

       PCRE contains some optimizations	that are used to speed up matching  by
       running some checks at the start	of each	match attempt. For example, it
       may know	the minimum length of matching subject,	or that	 a  particular
       character must be present. When one of these optimizations bypasses the
       running of a match,  any	 included  backtracking	 verbs	will  not,  of
       course, be processed. You can suppress the start-of-match optimizations
       by setting the no_start_optimize	option when  calling  re:compile/2  or
       re:run/3, or by starting	the pattern with (*NO_START_OPT).

       Experiments  with  Perl	suggest	that it	too has	similar	optimizations,
       sometimes leading to anomalous results.

       Verbs that act immediately

       The following verbs act as soon as they are encountered.	They  may  not
       be followed by a	name.


       This  verb causes the match to end successfully,	skipping the remainder
       of the pattern. However,	when it	is inside a subpattern that is	called
       as  a  subroutine, only that subpattern is ended	successfully. Matching
       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
       tive  assertion,	 the  assertion	succeeds; in a negative	assertion, the
       assertion fails.

       If (*ACCEPT) is inside capturing	parentheses, the data so far  is  cap-
       tured. For example:


       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
       tured by	the outer parentheses.

       (*FAIL) or (*F)

       This verb causes	a matching failure, forcing backtracking to occur.  It
       is  equivalent to (?!) but easier to read. The Perl documentation notes
       that it is probably useful only when combined  with  (?{})  or  (??{}).
       Those  are,  of course, Perl features that are not present in PCRE. The
       nearest equivalent is the callout feature, as for example in this  pat-


       A  match	 with the string "aaaa"	always fails, but the callout is taken
       before each backtrack happens (in this example, 10 times).

       Recording which path was	taken

       There is	one verb whose main purpose  is	 to  track  how	 a  match  was
       arrived	at,  though  it	 also  has a secondary use in conjunction with
       advancing the match starting point (see (*SKIP) below).

       In Erlang, there	is no interface	to retrieve a mark with	 re:run/{2,3],
       so only the secondary purpose is	relevant to the	Erlang programmer!

       The  rest  of  this  section  is	therefore deliberately not adapted for
       reading by the Erlang programmer, however the examples  might  help  in
       understanding NAMES as they can be used by (*SKIP).

       (*MARK:NAME) or (*:NAME)

       A  name	is  always  required  with  this  verb.	 There	may be as many
       instances of (*MARK) as you like	in a pattern, and their	names  do  not
       have to be unique.

       When  a	match succeeds,	the name of the	last-encountered (*MARK:NAME),
       (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed  back  to
       the  caller  as	described  in  the  section  entitled  "Extra data for
       pcre_exec()" in the  pcreapi  documentation.  Here  is  an  example  of
       pcretest	 output, where the /K modifier requests	the retrieval and out-
       putting of (*MARK) data:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XY
	  0: XY
	 MK: A
	  0: XZ
	 MK: B

       The (*MARK) name	is tagged with "MK:" in	this output, and in this exam-
       ple  it indicates which of the two alternatives matched.	This is	a more
       efficient way of	obtaining this information than	putting	each  alterna-
       tive in its own capturing parentheses.

       If  a  verb  with a name	is encountered in a positive assertion that is
       true, the name is recorded and passed back if it	 is  the  last-encoun-
       tered. This does	not happen for negative	assertions or failing positive

       After a partial match or	a failed match,	the last encountered  name  in
       the entire match	process	is returned. For example:

	   re> /X(*MARK:A)Y|X(*MARK:B)Z/K
	 data> XP
	 No match, mark	= B

       Note  that  in  this  unanchored	 example the mark is retained from the
       match attempt that started at the letter	"X" in the subject. Subsequent
       match attempts starting at "P" and then with an empty string do not get
       as far as the (*MARK) item, but nevertheless do not reset it.

       Verbs that act after backtracking

       The following verbs do nothing when they	are encountered. Matching con-
       tinues  with what follows, but if there is no subsequent	match, causing
       a backtrack to the verb,	a failure is  forced.  That  is,  backtracking
       cannot  pass  to	the left of the	verb. However, when one	of these verbs
       appears inside an atomic	group or an assertion that is true, its	effect
       is  confined  to	 that  group, because once the group has been matched,
       there is	never any backtracking into it.	In this	situation,  backtrack-
       ing  can	 "jump	back" to the left of the entire	atomic group or	asser-
       tion. (Remember also, as	stated	above,	that  this  localization  also
       applies in subroutine calls.)

       These  verbs  differ  in	exactly	what kind of failure occurs when back-
       tracking	reaches	them. The behaviour described below  is	 what  happens
       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
       tions cover these special cases.


       This verb, which	may not	be followed by a name, causes the whole	 match
       to fail outright	if there is a later matching failure that causes back-
       tracking	to reach it. Even if the pattern  is  unanchored,  no  further
       attempts	to find	a match	by advancing the starting point	take place. If
       (*COMMIT) is the	only backtracking verb that is	encountered,  once  it
       has  been  passed  re:run/{2,3}	is committed to	finding	a match	at the
       current starting	point, or not at all. For example:


       This matches "xxaab" but	not "aacaab". It can be	thought	of as  a  kind
       of dynamic anchor, or "I've started, so I must finish." The name	of the
       most recently passed (*MARK) in the path	is passed back when  (*COMMIT)
       forces a	match failure.

       If  there  is more than one backtracking	verb in	a pattern, a different
       one that	follows	(*COMMIT) may be triggered first,  so  merely  passing
       (*COMMIT) during	a match	does not always	guarantee that a match must be
       at this starting	point.

       Note that (*COMMIT) at the start	of a pattern is	not  the  same	as  an
       anchor,	unless	PCRE's start-of-match optimizations are	turned off, as
       shown in	this example:

	 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
	 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).

       PCRE knows that any match must start  with  "a",	 so  the  optimization
       skips  along the	subject	to "a" before running the first	match attempt,
       which succeeds. When the	optimization is	disabled by the	no_start_opti-
       mize  option, the match starts at "x" and so the	(*COMMIT) causes it to
       fail without trying any other starting points.

       (*PRUNE)	or (*PRUNE:NAME)

       This verb causes	the match to fail at the current starting position  in
       the subject if there is a later matching	failure	that causes backtrack-
       ing to reach it.	If the pattern is unanchored, the  normal  "bumpalong"
       advance	to  the	next starting character	then happens. Backtracking can
       occur as	usual to the left of (*PRUNE), before it is reached,  or  when
       matching	 to  the  right	 of  (*PRUNE), but if there is no match	to the
       right, backtracking cannot cross	(*PRUNE). In simple cases, the use  of
       (*PRUNE)	 is just an alternative	to an atomic group or possessive quan-
       tifier, but there are some uses of (*PRUNE) that	cannot be expressed in
       any  other  way.	In an anchored pattern (*PRUNE)	has the	same effect as

       The   behaviour	 of   (*PRUNE:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*PRUNE).  It  is  like  (*MARK:NAME)  in  that the name is
       remembered for  passing	back  to  the  caller.	However,  (*SKIP:NAME)
       searches	only for names set with	(*MARK).

       The fact	that (*PRUNE:NAME) remembers the name is useless to the	Erlang
       programmer, as names can	not be retrieved.


       This verb, when given without a name, is	like (*PRUNE), except that  if
       the  pattern  is	unanchored, the	"bumpalong" advance is not to the next
       character, but to the position in the subject where (*SKIP) was encoun-
       tered.  (*SKIP)	signifies that whatever	text was matched leading up to
       it cannot be part of a successful match.	Consider:


       If the subject is "aaaac...",  after  the  first	 match	attempt	 fails
       (starting  at  the  first  character in the string), the	starting point
       skips on	to start the next attempt at "c". Note that a possessive quan-
       tifer  does not have the	same effect as this example; although it would
       suppress	backtracking  during  the  first  match	 attempt,  the	second
       attempt	would  start at	the second character instead of	skipping on to


       When (*SKIP) has	an associated name, its	behaviour is modified. When it
       is triggered, the previous path through the pattern is searched for the
       most recent (*MARK) that	has the	 same  name.  If  one  is  found,  the
       "bumpalong" advance is to the subject position that corresponds to that
       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
       a matching name is found, the (*SKIP) is	ignored.

       Note  that (*SKIP:NAME) searches	only for names set by (*MARK:NAME). It
       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).

       (*THEN) or (*THEN:NAME)

       This verb causes	a skip to the next innermost  alternative  when	 back-
       tracking	 reaches  it.  That  is,  it  cancels any further backtracking
       within the current alternative. Its name	 comes	from  the  observation
       that it can be used for a pattern-based if-then-else block:

       ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

       If  the COND1 pattern matches, FOO is tried (and	possibly further items
       after the end of	the group if FOO succeeds); on	failure,  the  matcher
       skips  to  the second alternative and tries COND2, without backtracking
       into COND1. If that succeeds and	BAR fails, COND3 is tried.  If	subse-
       quently	BAZ fails, there are no	more alternatives, so there is a back-
       track to	whatever came before the  entire  group.  If  (*THEN)  is  not
       inside an alternation, it acts like (*PRUNE).

       The    behaviour	  of   (*THEN:NAME)   is   the	 not   the   same   as
       (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the	name is	remem-
       bered  for  passing  back to the	caller.	However, (*SKIP:NAME) searches
       only for	names set with (*MARK).

       The fact	that (*THEN:NAME) remembers the	name is	useless	to the	Erlang
       programmer, as names can	not be retrieved.

       A  subpattern that does not contain a | character is just a part	of the
       enclosing alternative; it is not	a nested  alternation  with  only  one
       alternative.  The effect	of (*THEN) extends beyond such a subpattern to
       the enclosing alternative. Consider this	pattern, where A, B, etc.  are
       complex	pattern	fragments that do not contain any | characters at this

       A (B(*THEN)C) | D

       If A and	B are matched, but there is a failure in C, matching does  not
       backtrack into A; instead it moves to the next alternative, that	is, D.
       However,	if the subpattern containing (*THEN) is	given an  alternative,
       it behaves differently:

       A (B(*THEN)C | (*FAIL)) | D

       The  effect of (*THEN) is now confined to the inner subpattern. After a
       failure in C, matching moves to (*FAIL),	which causes the whole subpat-
       tern  to	 fail  because	there are no more alternatives to try. In this
       case, matching does now backtrack into A.

       Note that a conditional subpattern is  not  considered  as  having  two
       alternatives,  because  only  one  is  ever used. In other words, the |
       character in a conditional subpattern has a different meaning. Ignoring
       white space, consider:

       ^.*? (?(?=a) a |	b(*THEN)c )

       If  the	subject	 is  "ba", this	pattern	does not match.	Because	.*? is
       ungreedy, it initially matches zero  characters.	 The  condition	 (?=a)
       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
       point, matching does not	backtrack to .*? as might perhaps be  expected
       from  the  presence  of	the | character. The conditional subpattern is
       part of the single alternative that comprises the whole pattern,	and so
       the  match  fails.  (If	there was a backtrack into .*?,	allowing it to
       match "b", the match would succeed.)

       The verbs just described	provide	four different "strengths" of  control
       when subsequent matching	fails. (*THEN) is the weakest, carrying	on the
       match at	the next alternative. (*PRUNE) comes next, failing  the	 match
       at  the	current	starting position, but allowing	an advance to the next
       character (for an unanchored pattern). (*SKIP) is similar, except  that
       the advance may be more than one	character. (*COMMIT) is	the strongest,
       causing the entire match	to fail.

       More than one backtracking verb

       If more than one	backtracking verb is present in	 a  pattern,  the  one
       that  is	 backtracked  onto first acts. For example, consider this pat-
       tern, where A, B, etc. are complex pattern fragments:


       If A matches but	B fails, the backtrack to (*COMMIT) causes the	entire
       match to	fail. However, if A and	B match, but C fails, the backtrack to
       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
       is  consistent,	but is not always the same as Perl's. It means that if
       two or more backtracking	verbs appear in	succession, all	the  the  last
       of them has no effect. Consider this example:


       If there	is a matching failure to the right, backtracking onto (*PRUNE)
       cases it	to be triggered, and its action	is taken. There	can never be a
       backtrack onto (*COMMIT).

       Backtracking verbs in repeated groups

       PCRE  differs  from  Perl  in  its  handling  of	 backtracking verbs in
       repeated	groups.	For example, consider:


       If the subject is "abac", Perl matches,	but  PCRE  fails  because  the
       (*COMMIT) in the	second repeat of the group acts.

       Backtracking verbs in assertions

       (*FAIL)	in  an assertion has its normal	effect:	it forces an immediate

       (*ACCEPT) in a positive assertion causes	the assertion to succeed with-
       out  any	 further processing. In	a negative assertion, (*ACCEPT)	causes
       the assertion to	fail without any further processing.

       The other backtracking verbs are	not treated specially if  they	appear
       in  a  positive	assertion.  In	particular,  (*THEN) skips to the next
       alternative in the innermost enclosing  group  that  has	 alternations,
       whether or not this is within the assertion.

       Negative	 assertions  are,  however, different, in order	to ensure that
       changing	a positive assertion into a  negative  assertion  changes  its
       result. Backtracking into (*COMMIT), (*SKIP), or	(*PRUNE) causes	a neg-
       ative assertion to be true, without considering any further alternative
       branches	 in the	assertion. Backtracking	into (*THEN) causes it to skip
       to the next enclosing alternative within	the assertion (the normal  be-
       haviour),  but  if  the	assertion  does	 not have such an alternative,
       (*THEN) behaves like (*PRUNE).

       Backtracking verbs in subroutines

       These behaviours	occur whether or not the subpattern is	called	recur-
       sively. Perl's treatment	of subroutines is different in some cases.

       (*FAIL)	in  a subpattern called	as a subroutine	has its	normal effect:
       it forces an immediate backtrack.

       (*ACCEPT) in a subpattern called	as a subroutine	causes the  subroutine
       match  to succeed without any further processing. Matching then contin-
       ues after the subroutine	call.

       (*COMMIT), (*SKIP), and (*PRUNE)	in a subpattern	called as a subroutine
       cause the subroutine match to fail.

       (*THEN)	skips to the next alternative in the innermost enclosing group
       within the subpattern that has alternatives. If there is	no such	 group
       within the subpattern, (*THEN) causes the subroutine match to fail.

Ericsson AB			  stdlib 2.4				 re(3)


Want to link to this manual page? Use this URL:

home | help