Skip site navigation (1)Skip section navigation (2)

FreeBSD Man Pages

Man Page or Keyword Search:
Man Architecture
Apropos Keyword Search (all sections) Output format
home | help
regex(5)	      Standards, Environments, and Macros	      regex(5)

NAME
       regex  -	internationalized basic	and extended regular expression	match-
       ing

DESCRIPTION
       Regular Expressions  (REs)  provide  a  mechanism  to  select  specific
       strings	from a set of character	strings. The Internationalized Regular
       Expressions described below differ from the Simple Regular  Expressions
       described on the	regexp(5) manual page in the following ways:

	  o  both Basic	and Extended Regular Expressions are supported

	  o  the  Internationalization	features--character class, equivalence
	     class, and	multi-character	collation--are supported.

       The Basic Regular Expression  (BRE)  notation  and  construction	 rules
       described   in  the   BASIC  REGULAR  EXPRESSIONS section apply to most
       utilities supporting regular expressions. Some utilities, instead, sup-
       port  the Extended Regular Expressions (ERE) described in the  EXTENDED
       REGULAR EXPRESSIONS section; any	exceptions for both cases are noted in
       the  descriptions  of the specific utilities using regular expressions.
       Both BREs and EREs are  supported by the	 Regular  Expression  Matching
       interfaces regcomp(3C) and  regexec(3C).

BASIC REGULAR EXPRESSIONS
   BREs	Matching a Single Character
       A  BRE ordinary character, a special character preceded by a backslash,
       or a period matches a single character. A bracket expression matches  a
       single character	or a single collating element.	See RE Bracket Expres-
       sion, below.

   BRE Ordinary	Characters
       An ordinary character is	a BRE that matches itself:  any	 character  in
       the  supported  character  set,	except	for the	BRE special characters
       listed in BRE Special Characters, below.

       The interpretation of an	ordinary character preceded by a backslash (\)
       is undefined, except for:

       1. the characters ), (, {, and }

       2. the  digits 1	to 9 inclusive (see BREs Matching Multiple Characters,
	  below)

       3. a character inside a bracket expression.

   BRE Special Characters
       A BRE special character has special  properties	in  certain  contexts.
       Outside those contexts, or when preceded	by a backslash,	such a charac-
       ter will	be a BRE that matches the special character  itself.  The  BRE
       special	characters  and	 the contexts in which they have their special
       meaning are:

       . [ \ The period, left-bracket, and backslash are special  except  when
	     used  in a	bracket	expression (see	RE Bracket Expression, below).
	     An	expression containing a	[ that is not preceded by a  backslash
	     and  is  not  part	 of  a	bracket	 expression produces undefined
	     results.

       *     The asterisk is special except when used:

		o  in a	bracket	expression

		o  as the first	character of an	entire BRE (after  an  initial
		   ^, if any)

		o  as the first	character of a subexpression (after an initial
		   ^, if any); see BREs	Matching Multiple Characters, below.

       ^     The circumflex is special when used:

		o  as an anchor	(see BRE Expression Anchoring, below).

		o  as the first	character of  a	 bracket  expression  (see  RE
		   Bracket Expression, below).

       $     The dollar	sign is	special	when used as an	anchor.

   Periods in BREs
       A  period  (.),	when  used outside a bracket expression, is a BRE that
       matches any character in	the supported character	set except NUL.

   RE Bracket Expression
       A bracket expression (an	expression enclosed in square brackets,	[]) is
       an  RE  that  matches  a	single collating element contained in the non-
       empty set of collating elements represented by the bracket  expression.

       The following rules and definitions apply to bracket expressions:

       1. A  bracket expression	is either a matching list expression or	a non-
	  matching list	expression. It consists	of one	or  more  expressions:
	  collating  elements, collating symbols, equivalence classes, charac-
	  ter classes, or range	 expressions  (see  rule  7  below).  Portable
	  applications	must not use range expressions,	even though all	imple-
	  mentations support them. The right-bracket  (])  loses  its  special
	  meaning  and	represents itself in a bracket expression if it	occurs
	  first	in the list (after an initial circumflex (^), if any).	Other-
	  wise,	 it  terminates	the bracket expression,	unless it appears in a
	  collating symbol (such as [.].]) or is the ending right-bracket  for
	  a  collating symbol, equivalence class, or character class. The spe-
	  cial characters:

	       .   *   [   \

	  (period, asterisk, left-bracket and  backslash,  respectively)  lose
	  their	special	meaning	within a bracket expression.

	  The character	sequences:

	       [.   [=	  [:

	  (left-bracket	 followed by a period, equals-sign, or colon) are spe-
	  cial inside a	bracket	expression and are used	to  delimit  collating
	  symbols,  equivalence	class expressions, and character class expres-
	  sions. These symbols must be followed	by a valid expression and  the
	  matching terminating sequence	.], =] or :], as described in the fol-
	  lowing items.

       2. A matching list expression specifies a list that matches any one  of
	  the  expressions represented in the list. The	first character	in the
	  list must not	be the circumflex. For example,	[abc] is  an  RE  that
	  matches any of the characters
	   a, b	or  c.

       3. A  non-matching  list	 expression  begins with a circumflex (^), and
	  specifies a list that	matches	any  character	or  collating  element
	  except  for the expressions represented in the list after the	 lead-
	  ing circumflex.  For example,	[^abc] is an RE	that matches any char-
	  acter	 or  collating	element	except the characters  a, b, or	c. The
	  circumflex will have this special meaning only when it occurs	 first
	  in the list, immediately following the left-bracket.

       4. A  collating	symbol is a collating element enclosed within bracket-
	  period ([..])	delimiters. Multi-character collating elements must be
	  represented as collating symbols when	it is necessary	to distinguish
	  them from a list of the  individual  characters  that	 make  up  the
	  multi-character collating element. For example, if the string	 ch is
	  a collating element in the current collation sequence	with the asso-
	  ciated  collating  symbol  <ch>,  the	 expression  [[.ch.]]  will be
	  treated as an	RE matching the	character  sequence   ch,  while  [ch]
	  will	be  treated as an RE matching  c or  h.	Collating symbols will
	  be recognized	only inside bracket expressions. This implies that the
	  RE   [[.ch.]]*c  matches  the	first to fifth character in the	string
	  chchch. If the string	is not a collating element in the current col-
	  lating sequence definition, or if the	collating element has no char-
	  acters associated with it, the symbol	will be	treated	as an  invalid
	  expression.

       5. An equivalence class expression represents the set of	collating ele-
	  ments	belonging to an	equivalence class.  Only  primary  equivalence
	  classes  will	be recognised. The class is expressed by enclosing any
	  one of the  collating	 elements  in  the  equivalence	 class	within
	  bracket-equal	 ([==])	delimiters. For	example, if a,	and  belong to
	  the same equivalence class,  then [[=a=]b], [[==]b] and [[==]b] will
	  each	be  equivalent	 to  [ab].  If	the collating element does not
	  belong to an equivalence class,  the	equivalence  class  expression
	  will be treated as a collating symbol.

       6. A  character	class  expression  represents  the  set	 of characters
	  belonging to a character class, as defined in	the  LC_CTYPE category
	  in  the  current locale. All character classes specified in the cur-
	  rent locale will be recognized.  A  character	 class	expression  is
	  expressed  as	 a  character class name enclosed within bracket-colon
	  ([::]) delimiters.

	  The following	character  class  expressions  are  supported  in  all
	  locales:

	  [:alnum:]	   [:cntrl:]	   [:lower:]	    [:space:]
	  [:alpha:]	   [:digit:]	   [:print:]	    [:upper:]
	  [:blank:]	   [:graph:]	   [:punct:]	    [:xdigit:]

	  In addition, character class expressions of the form:

			[:name:]

	  are  recognized  in  those  locales  where the name keyword has been
	  given	a charclass definition in the  LC_CTYPE	category.

       7. A range expression represents	the set	 of  collating	elements  that
	  fall	between	two elements in	the current collation sequence,	inclu-
	  sively. It is	expressed as the starting point	and the	 ending	 point
	  separated by a hyphen	(-).

	  Range	 expressions must not be used in portable applications because
	  their	behavior is dependent on the collating sequence.  Ranges  will
	  be  treated according	to the current collating sequence, and include
	  such characters that fall within the range based on  that  collating
	  sequence,  regardless	of character values. This, however, means that
	  the interpretation will differ depending on collating	sequence.  If,
	  for  instance,  one  collating sequence defines  as a	variant	of  a,
	  while	another	defines	it as a	letter following  z, then the  expres-
	  sion	[-z] is	valid in the first language and	invalid	in the second.

	  In the following, all	examples assume	the collation sequence	speci-
	  fied	for  the  POSIX	 locale,  unless another collation sequence is
	  specifically defined.

	  The starting range point and the ending range	point must be  a  col-
	  lating  element or collating symbol. An equivalence class expression
	  used as a starting or	ending point of	a  range  expression  produces
	  unspecified  results.	 An  equivalence  class	 can  be used portably
	  within a bracket expression, but only	outside	the range.  For	 exam-
	  ple,	the  unspecified  expression  [[=e=]-f]	 should	 be  given  as
	  [[=e=]e-f]. The ending range point must collate equal	to  or	higher
	  than	the  starting  range  point; otherwise,	the expression will be
	  treated as invalid. The order	used is	the order in which the collat-
	  ing elements are specified in	the current collation definition. One-
	  to-many mappings (see	 locale(5)) will not be	performed.  For	 exam-
	  ple,	assuming  that	the character eszet is placed in the collation
	  sequence after  r and	 s, but	before t, and  that  it	 maps  to  the
	  sequence  ss	for  collation	purposes,  then	 the  expression [r-s]
	  matches only	r and  s, but the expression [s-t] matches s, beta, or
	  t.

	  The interpretation of	range expressions where	the ending range point
	  is also the starting range point of a	 subsequent  range  expression
	  (for instance	[a-m-o]) is undefined.

	  The  hyphen  character  will be treated as itself if it occurs first
	  (after an initial ^, if any) or last in the list, or	as  an	ending
	  range	 point	in  a  range  expression. As examples, the expressions
	  [-ac]	and [ac-] are equivalent and match any of the  characters   a,
	  c,  or  -; [^-ac] and	[^ac-] are equivalent and match	any characters
	  except a, c, or -; the expression [%--]  matches any of the  charac-
	  ters	between	% and -	inclusive; the expression [--@]	matches	any of
	  the characters between - and @ inclusive; and	the expression	[a--@]
	  is  invalid, because the letter  a follows the symbol	- in the POSIX
	  locale. To use a hyphen as the starting range	point, it must	either
	  come	first in the bracket expression	or be specified	as a collating
	  symbol, for  example:	 [][.-.]-0],  which  matches  either  a	 right
	  bracket or any character  or collating element that collates between
	  hyphen and 0,	inclusive.

	  If a bracket expression must specify both - and ],  the  ]  must  be
	  placed first (after the ^, if	any) and the - last within the bracket
	  expression.

       Note:  Latin-1 characters such as  ` or	^ are not  printable  in  some
       locales,	for example, the ja locale.

   BREs	Matching Multiple Characters
       The  following  rules  can  be used to construct	BREs matching multiple
       characters from BREs matching a single character:

       1. The concatenation of BREs matches the	concatenation of  the  strings
	  matched by each component of the BRE.

       2. A  subexpression can be defined within a BRE by enclosing it between
	  the character	pairs \( and \)	. Such a subexpression	matches	 what-
	  ever	it  would  have	 matched  without  the	\( and \), except that
	  anchoring  within  subexpressions  is	 optional  behavior;  see  BRE
	  Expression  Anchoring,  below.  Subexpressions  can  be  arbitrarily
	  nested.

       3. The back-reference expression	\n matches the same  (possibly	empty)
	  string  of  characters  as  was  matched by a	subexpression enclosed
	  between \( and \) preceding the \n. The character n must be a	 digit
	  from	1  to 9	inclusive, nth subexpression (the one that begins with
	  the nth \( and ends with the corresponding paired \)).  The  expres-
	  sion	is  invalid  if	less than n subexpressions precede the \n. For
	  example, the expression ^\(.*\)\1$ matches a line consisting of  two
	  adjacent appearances of the same string, and the expression \(a\)*\1
	  fails	to match  a. The limit of nine back-references	to  subexpres-
	  sions	 in  the  RE is	based on the use of a single digit identifier.
	  This does not	imply that only	nine  subexpressions  are  allowed  in
	  REs. The following is	a valid	BRE with ten subexpressions:

	  \(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*

       4. When	a  BRE matching	a single character, a subexpression or a back-
	  reference  is	 followed  by  the  special  character	asterisk  (*),
	  together with	that asterisk it matches what zero or more consecutive
	  occurrences of the BRE would match. For example,  [ab]* and [ab][ab]
	  are equivalent when matching the string  ab.

       5. When	a BRE matching a single	character, a subexpression, or a back-
	  reference is followed	by an interval expression of the format	\{m\},
	  \{m,\} or \{m,n\}, together with that	interval expression it matches
	  what repeated	consecutive occurrences	of the BRE  would  match.  The
	  values  of m and n will be decimal integers in the range 0 <=	m <= n
	  <= {RE_DUP_MAX}, where m specifies the exact or  minimum  number  of
	  occurrences  and  n specifies	the maximum number of occurrences. The
	  expression \{m\} matches exactly m occurrences of the	preceding BRE,
	  \{m,\} matches at least m occurrences	and \{m,n\} matches any	number
	  of occurrences between m and n, inclusive.

	  For example, in the string  abababccccccd, the BRE c\{3\} is matched
	  by  characters seven to nine,	the BRE	\(ab\)\{4,\} is	not matched at
	  all and the BRE c\{1,3\}d is matched by characters ten to  thirteen.

       The  behavior  of multiple adjacent duplication symbols ( *  and	inter-
       vals) produces undefined	results.

   BRE Precedence
       The order of precedence is as shown in the following table:

       +----------------------------------------------------------------+
       |	       BRE Precedence (fromhigh to low)		|
       |collation-related bracket symbols | [= =]  [: :]  [. .]		|
       |escaped	characters		  | \<special character>	|
       |bracket	expression		  | [ ]				|
       |subexpressions/back-references	  | \( \) \n			|
       |single-character-BRE duplication  | * \{m,n\}			|
       |concatenation			  |				|
       |anchoring			  | ^  $			|
       +----------------------------------+-----------------------------+

   BRE Expression Anchoring
       A BRE can be limited to matching	strings	that begin or end a line; this
       is  called anchoring. The circumflex and	dollar sign special characters
       will be considered BRE anchors in the following contexts:

       1. A circumflex ( ^ ) is	an anchor when used as the first character  of
	  an  entire BRE. The implementation may treat circumflex as an	anchor
	  when used as the first character of a	subexpression. The  circumflex
	  will	anchor	the  expression	 to  the  beginning  of	a string; only
	  sequences starting at	the  first  character  of  a  string  will  be
	  matched  by  the  BRE.  For  example,	the BRE	^ab matches  ab	in the
	  string  abcdef, but fails to match in	the string  cdefab. A portable
	  BRE  must  escape a leading circumflex in a subexpression to match a
	  literal circumflex.

       2. A dollar sign	( $ ) is an anchor when	used as	the last character  of
	  an  entire  BRE.  The	 implementation	 may treat a dollar sign as an
	  anchor when used as the last character of a subexpression. The  dol-
	  lar  sign  will anchor the expression	to the end of the string being
	  matched; the dollar sign can be said to match	the end-of-string fol-
	  lowing the last character.

       3. A  BRE  anchored  by both ^ and $ matches only an entire string. For
	  example, the	BRE   ^abcdef$	matches	 strings  consisting  only  of
	  abcdef.

       4. ^ and	$ are not special in subexpressions.

       Note:   The  Solaris  implementation  does not support anchoring	in BRE
       subexpressions.

EXTENDED REGULAR EXPRESSIONS
       The rules specififed for	BREs apply  to	Extended  Regular  Expressions
       (EREs) with the following exceptions:

	  o  The  characters   |,  +,  and  ? have special meaning, as defined
	     below.

	  o  The { and } characters, when used as  the	duplication  operator,
	     are not preceded by backslashes.  The constructs \{ and \}	simply
	     match the characters { and	}, respectively.

	  o  The back reference	operator is not	supported.

	  o  Anchoring (^$) is supported in subexpressions.

   EREs	Matching a Single Character
       An ERE ordinary character, a special character preceded by a backslash,
       or  a period matches a single character.	A bracket expression matches a
       single character	or a single collating element. An  ERE matching	a sin-
       gle character enclosed in parentheses matches the same as the ERE with-
       out parentheses would have matched.

   ERE Ordinary	Characters
       An ordinary character is	an ERE that matches itself. An ordinary	 char-
       acter  is  any character	in the supported character set,	except for the
       ERE special characters listed in	 ERE Special  Characters  below.   The
       interpretation  of an ordinary character	preceded by a backslash	(\) is
       undefined.

   ERE Special Characters
       An ERE special character	has special properties	in  certain  contexts.
       Outside those contexts, or when preceded	by a backslash,	such a charac-
       ter is an ERE that matches the special character	itself.	 The  extended
       regular	expression  special  characters	and the	contexts in which they
       have their special meaning are:

       . [ \ (
	     The period, left-bracket,	backslash,  and	 left-parenthesis  are
	     special except when used in a bracket expression (see  RE Bracket
	     Expression, above). Outside a bracket expression,	a  left-paren-
	     thesis immediately	followed by a right-parenthesis	produces unde-
	     fined results.

       )     The right-parenthesis is special when matched  with  a  preceding
	     left-parenthesis, both outside a bracket expression.

       * + ? {
	     The  asterisk,  plus-sign,	question-mark, and left-brace are spe-
	     cial except when used in a	bracket	 expression  (see  RE  Bracket
	     Expression, above).   Any of the following	uses produce undefined
	     results:

		o  if these characters appear first in an ERE, or  immediately
		   following a vertical-line, circumflex or left-parenthesis

		o  if a	left-brace is not part of a valid interval expression.

       |     The vertical-line is  special  except  when  used	in  a  bracket
	     expression	 (see  RE  Bracket Expression, above). A vertical-line
	     appearing first or	last in	an ERE,	 or  immediately  following  a
	     vertical-line  or	a left-parenthesis, or immediately preceding a
	     right-parenthesis,	produces undefined results.

       ^     The circumflex is special when used:

		o  as an anchor	(see  ERE Expression Anchoring,	below).

		o  as the first	character of a	bracket	 expression  (see   RE
		   Bracket Expression, above).

       $     The dollar	sign is	special	when used as an	anchor.

   Periods in EREs
       A  period  (.),	when used outside a bracket expression,	is an ERE that
       matches any character in	the supported character	set except NUL.

   ERE Bracket Expression
       The rules for ERE Bracket Expressions are the same as for Basic Regular
       Expressions; see	RE Bracket Expression, above).

   EREs	Matching Multiple Characters
       The  following  rules  will be used to construct	EREs matching multiple
       characters from EREs matching a single character:

       1. A  concatenation of EREs matches the concatenation of	the  character
	  sequences  matched by	each component of the ERE.  A concatenation of
	  EREs enclosed	in  parentheses	 matches  whatever  the	 concatenation
	  without the parentheses  matches.  For example, both the ERE	cd and
	  the ERE  (cd)	are matched  by	the third and fourth character of  the
	  string  abcdefabcdef.

       2. When an ERE matching a single	character or an	ERE enclosed in	paren-
	  theses is followed by	the special character plus-sign	(+),  together
	  with	that  plus-sign	it matches what	one or more consecutive	occur-
	  rences of the	ERE would match. For example, the ERE  b+(bc)  matches
	  the  fourth  to  seventh characters in the string  acabbbcde;	[ab] +
	  and  [ab][ab]* are equivalent.

       3. When an ERE matching a single	character or an	ERE enclosed in	paren-
	  theses  is  followed by the special character	asterisk (*), together
	  with that asterisk it	matches	what zero or more  consecutive	occur-
	  rences of the	ERE would match. For example, the ERE  b*c matches the
	  first	character in the string	cabbbcde, and the  ERE	 b*cd  matches
	  the  third to	seventh	characters  in the string  cabbbcdebbbbbbcdbc.
	  And,	[ab]* and  [ab][ab] are	equivalent  when matching  the	string
	  ab.

       4. When an ERE matching a single	character or an	ERE enclosed in	paren-
	  theses is followed  by  the  special	character  question-mark  (?),
	  together with	that question-mark it matches what zero	or one consec-
	  utive	occurrences of the ERE would match. For	example, the ERE   b?c
	  matches the second character in the string acabbbcde.

       5. When an ERE matching a single	character or an	ERE enclosed in	paren-
	  theses is followed by	an interval expression of the format {m}, {m,}
	  or  {m,n},  together	with  that interval expression it matches what
	  repeated consecutive occurrences of the ERE would match. The	values
	  of  m	 and  n	 will  be decimal integers in the range	0 <= m <= n <=
	  {RE_DUP_MAX},	where m	specifies  the	exact  or  minimum  number  of
	  occurrences  and  n specifies	the maximum number of occurrences. The
	  expression {m} matches exactly m occurrences of the  preceding  ERE,
	  {m,}	matches	at least m occurrences and {m,n} matches any number of
	  occurrences between m	and n, inclusive.

	  For example, in the string  abababccccccd the	ERE c{3} is matched by
	  characters  seven to nine and	the ERE	(ab){2,} is matched by charac-
	  ters one to six.

       The behavior of multiple	adjacent duplication  symbols  (+,  *,	?  and
       intervals) produces undefined results.

   ERE Alternation
       Two  EREs  separated by the special character vertical-line (|) match a
       string that is matched  by  either.  For	 example,  the	ERE  a((bc)|d)
       matches the string abc and the string ad. Single	characters, or expres-
       sions matching single characters, separated by  the  vertical  bar  and
       enclosed	 in  parentheses,  will	be treated as an ERE matching a	single
       character.

   ERE Precedence
       The order of precedence will be as shown	in the following table:

       +----------------------------------------------------------------+
       |	       ERE Precedence (fromhigh to low)		|
       |collation-related bracket symbols | [= =]  [: :]  [. .]		|
       |escaped	characters		  | \<special character>	|
       |bracket	expression		  | [ ]				|
       |grouping			  | ( )				|
       |single-character-ERE duplication  | * +	? {m,n}			|
       |concatenation			  |				|
       |anchoring			  | ^  $			|
       |alternation			  | |				|
       +----------------------------------+-----------------------------+

       For example, the	ERE abba|cde matches either the	 string	 abba  or  the
       string	cde  (rather  than the string  abbade or  abbcde, because con-
       catenation has a	higher order of	precedence than	alternation).

   ERE Expression Anchoring
       An ERE can be limited to	matching strings that begin  or	 end  a	 line;
       this  is	called anchoring. The circumflex and dollar sign special char-
       acters are considered ERE anchors when used anywhere outside a  bracket
       expression. This	has the	following effects:

       1. A circumflex (^) outside a bracket expression	anchors	the expression
	  or subexpression it begins to	the beginning of  a  string;  such  an
	  expression  or  subexpression	 can match only	a sequence starting at
	  the first character of a string. For example,	the EREs ^ab and (^ab)
	  match	 ab in the string abcdef, but fail to match in the string cde-
	  fab, and the ERE a^b is valid, but can never	match  because	the  a
	  prevents the expression ^b from matching starting at the first char-
	  acter.

       2. A dollar sign	( $ ) outside a	bracket	expression anchors the expres-
	  sion	or  subexpression   it	ends  to  the end of a string; such an
	  expression or	subexpression  can match only a	sequence ending	at the
	  last	character  of  a  string.  For example,	the EREs ef$ and (ef$)
	  match	ef in the string abcdef, but fail to match in the string  cde-
	  fab,	and  the  ERE e$f is valid, but	 can never match because the f
	  prevents the expression e$ from matching ending at the last  charac-
	  ter.

SEE ALSO
       localedef(1),  regcomp(3C),  attributes(5), environ(5), locale(5), reg-
       exp(5)

SunOS 5.9			  12 Jul 1999			      regex(5)

NAME | DESCRIPTION | BASIC REGULAR EXPRESSIONS | EXTENDED REGULAR EXPRESSIONS | SEE ALSO

Want to link to this manual page? Use this URL:
<http://www.freebsd.org/cgi/man.cgi?query=regex&sektion=5&manpath=SunOS+5.9>

home | help