Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
lex(1)				 User Commands				lex(1)

       lex - generate programs for lexical tasks

       lex [-cntv] [-e | -w]  [	-V -Q
	[y | n]	] [file...]

       The  lex	 utility generates C programs to be used in lexical processing
       of character input, and that can	be used	as an interface	to yacc. The C
       programs	 are  generated	 from lex source code and conform to the ISO C
       standard. Usually, the lex utility writes the program it	 generates  to
       the  file  lex.yy.c. The	state of this file is unspecified if lex exits
       with a non-zero exit status. See	EXTENDED DESCRIPTION  for  a  complete
       description of the lex input language.

       The following options are supported:

       -c	       Indicates C-language action (default option).

       -e	       Generates  a  program  that  can	 handle	EUC characters
		       (cannot be used with the	-w  option).  yytext[]	is  of
		       type unsigned char[].

       -n	       Suppresses  the	summary	 of statistics usually written
		       with the	-v option. If no table sizes are specified  in
		       the lex source code and the -v option is	not specified,
		       then -n is implied.

       -t	       Writes the resulting program to standard	output instead
		       of lex.yy.c.

       -v	       Writes  a summary of lex	statistics to the standard er-
		       ror. (See the discussion	of lex table sizes  under  the
		       heading	Definitions in lex.) If	table sizes are	speci-
		       fied in the lex source code, and	if the	-n  option  is
		       not specified, the -v option may	be enabled.

       -w	       Generates  a  program  that  can	 handle	EUC characters
		       (cannot be used with the	-e option). Unlike the -e  op-
		       tion, yytext[] is of type wchar_t[].

       -V	       Prints out version information on standard error.

       -Q[y|n]	       Prints  out version information to output file lex.yy.c
		       by using	-Qy. The -Qn option does not print out version
		       information and is the default.

       The following operand is	supported:

       file	A  pathname  of	 an  input file. If more than one such file is
		specified, all files will be concatenated to produce a	single
		lex  program.  If no file operands are specified, or if	a file
		operand	is -, the standard input will be used.

       The lex output files are	described below.

       If the -t option	is specified, the text file of C source	code output of
       lex will	be written to standard output.

       If the -t option	is specified informational, error and warning messages
       concerning the contents of lex source code input	will be	written	to the
       standard	error.

       If the -t option	is not specified:

       1.  Informational error and warning messages concerning the contents of
	   lex source code input will be written to either the standard	output
	   or standard error.

       2.  If  the  -v option is specified and the -n option is	not specified,
	   lex statistics will also be written to standard error.  These  sta-
	   tistics may also be generated if table sizes	are specified with a %
	   operator in the Definitions in lex section (see  EXTENDED  DESCRIP-
	   TION), as long as the -n option is not specified.

   Output Files
       A text file containing C	source code will be written to lex.yy.c, or to
       the standard output if the -t option is present.

       Each input file contains	lex source code, which is a table  of  regular
       expressions  with  corresponding	actions	in the form of C program frag-

       When lex.yy.c is	compiled and linked with the lex  library  (using  the
       -l l operand with c89 or	cc), the resulting program reads character in-
       put from	the standard input and partitions it into strings  that	 match
       the given expressions.

       When an expression is matched, these actions will occur:

	 o  The	input string that was matched is left in yytext	as a null-ter-
	    minated string; yytext is either an	external character array or  a
	    pointer to a character string. As explained	in Definitions in lex,
	    the	type can be explicitly selected	using the %array  or  %pointer
	    declarations, but the default is %array.

	 o  The	 external  int	yyleng	is  set	 to the	length of the matching

	 o  The	expression's corresponding program fragment, or	action,	is ex-

       During  pattern matching, lex searches the set of patterns for the sin-
       gle longest possible match. Among rules that match the same  number  of
       characters, the rule given first	will be	chosen.

       The general format of lex source	is:

       User Subroutines

       The  first  %%  is required to mark the beginning of the	rules (regular
       expressions and actions); the second %% is required only	if  user  sub-
       routines	follow.

       Any line	in the Definitions in lex section beginning with a blank char-
       acter will be assumed to	be a C program fragment	and will be copied  to
       the  external definition	area of	the lex.yy.c file. Similarly, anything
       in the Definitions in lex section included between delimiter lines con-
       taining	only  %{  and %} will also be copied unchanged to the external
       definition area of the lex.yy.c file.

       Any such	input (beginning with a	blank character	or within  %{  and  %}
       delimiter lines)	appearing at the beginning of the Rules	section	before
       any rules are specified will be written to lex.yy.c after the  declara-
       tions  of variables for the yylex function and before the first line of
       code in yylex. Thus, user variables local  to  yylex  can  be  declared
       here, as	well as	application code to execute upon entry to yylex.

       The  action  taken  by lex when encountering any	input beginning	with a
       blank character or within %{ and	%} delimiter lines  appearing  in  the
       Rules  section  but  coming  after  one or more rules is	undefined. The
       presence	of such	input may result in an	erroneous  definition  of  the
       yylex function.

   Definitions in lex
       Definitions  in	lex  appear before the first %%	delimiter. Any line in
       this section not	contained between %{ and %} lines  and	not  beginning
       with  a blank character is assumed to define a lex substitution string.
       The format of these lines is:

       name   substitute

       If a name does not meet the requirements	for identifiers	in the	ISO  C
       standard,  the  result is undefined. The	string substitute will replace
       the string { name } when	it is used in a	rule. The name string is  rec-
       ognized	in  this context only when the braces are provided and when it
       does not	appear within a	bracket	expression or within double-quotes.

       In the Definitions in lex section, any line beginning with a % (percent
       sign) character and followed by an alphanumeric word beginning with ei-
       ther s or S defines a set of start conditions. Any line beginning  with
       a  %  followed  by a word beginning with	either x or X defines a	set of
       exclusive start conditions. When	the  generated	scanner	 is  in	 a  %s
       state,  patterns	 with  no state	specified will be also active; in a %x
       state, such patterns will not be	active.	The rest of  the  line,	 after
       the  first  word, is considered to be one or more blank-character-sepa-
       rated names of start conditions.	Start condition	names are  constructed
       in  the	same  way as definition	names. Start conditions	can be used to
       restrict	the matching of	regular	expressions to one or more  states  as
       described in Regular expressions	in lex.

       Implementations	accept	either of the following	two mutually exclusive
       declarations in the Definitions in lex section:

       %array	       Declare the type	of  yytext  to	be  a  null-terminated
		       character array.

       %pointer	       Declare	the  type of yytext to be a pointer to a null-
		       terminated character string.

       Note: When using	the %pointer option, you may not also use  the	yyless
       function	to alter yytext.

       %array  is  the	default. If %array is specified	(or neither %array nor
       %pointer	is specified), then the	correct	way to make an external	refer-
       ence to yyext is	with a declaration of the form:

	      extern char yytext[]

       If %pointer is specified, then the correct external reference is	of the

	      extern char *yytext;

       lex will	accept declarations in the Definitions in lex section for set-
       ting  certain  internal	table sizes. The declarations are shown	in the
       following table.

	      Table Size Declaration in	lex

       | Declaration		   Description		       Default	  |
       |     %pn	Number of positions		     2500	  |
       |     %nn	Number of states		     500	  |
       |    %a n	Number of transitions		     2000	  |
       |     %en	Number of parse	tree nodes	     1000	  |
       |     %kn	Number of packed character classes   10000	  |
       |     %on	Size of	the output array	     3000	  |

       Programs	generated by lex need either the -e or -w option to handle in-
       put  that  contains EUC characters from supplementary codesets. If nei-
       ther of these options is	specified, yytext is of	the type  char[],  and
       the generated program can handle	only ASCII characters.

       When  the  -e option is used, yytext is of the type unsigned char[] and
       yyleng gives the	total number of	bytes in the matched string. With this
       option,	the  macros input(), unput(c), and output(c) should do a byte-
       based I/O in the	same way as with the regular ASCII lex.	Two more vari-
       ables  are available with the -e	option,	yywtext	and yywleng, which be-
       have the	same as	yytext and yyleng would	under the -w option.

       When the	-w option is used, yytext is of	the type wchar_t[] and	yyleng
       gives  the  total  number  of characters	in the matched string.	If you
       supply your own input(),	unput(c), or output(c) macros  with  this  op-
       tion,  they  must  return  or accept EUC	characters in the form of wide
       character (wchar_t). This allows	a  different  interface	 between  your
       program and the lex internals, to expedite some programs.

   Rules in lex
       The Rules in lex	source files are a table in which the left column con-
       tains regular expressions and the right column contains actions (C pro-
       gram fragments) to be executed when the expressions are recognized.

       ERE action
       ERE action

       The  extended  regular  expression (ERE)	portion	of a row will be sepa-
       rated from action by one	or more	blank characters. A regular expression
       containing  blank  characters  is recognized under one of the following

	 o  The	entire expression appears within double-quotes.

	 o  The	blank characters appear	within double-quotes or	square	brack-

	 o  Each blank character is preceded by	a backslash character.

   User	Subroutines in lex
       Anything	 in  the  user	subroutines section will be copied to lex.yy.c
       following yylex.

   Regular Expressions	   in lex
       The lex utility supports	the set	of Extended Regular Expressions	(EREs)
       described  on  regex(5)	with the following additions and exceptions to
       the syntax:


	   Any string enclosed in double-quotes	will represent the  characters
	   within  the	double-quotes as themselves, except that backslash es-
	   capes (which	appear in the following	 table)	 are  recognized.  Any
	   backslash-escape  sequence  is terminated by	the closing quote. For
	   example, "\01""1" represents	a single string:  the  octal  value  1
	   followed by the character 1.


       <state1,	state2,	...>r

	   The	regular	 expression r will be matched only when	the program is
	   in one of the start conditions indicated by state, state1,  and  so
	   forth. For more information,	see Actions in lex. As an exception to
	   the typographical conventions of the	rest of	this document, in this
	   case	<state>	does not represent a metavariable, but the literal an-
	   gle-bracket characters surrounding a	symbol.	The start condition is
	   recognized as such only at the beginning of a regular expression.


	   The	regular	expression r will be matched only if it	is followed by
	   an occurrence of regular expression x. The token returned in	yytext
	   will	 only match r. If the trailing portion of r matches the	begin-
	   ning	of x, the result is unspecified. The r expression  cannot  in-
	   clude  further trailing context or the $ (match-end-of-line)	opera-
	   tor;	x cannot include the ^ (match-beginning-of-line) operator, nor
	   trailing  context, nor the $	operator. That is, only	one occurrence
	   of trailing context is allowed in a lex regular expression, and the
	   ^ operator only can be used at the beginning	of such	an expression.
	   A further restriction  is  that  the	 trailing-context  operator  /
	   (slash) cannot be grouped within parentheses.


	   When	 name  is one of the substitution symbols from the Definitions
	   section, the	string,	including the enclosing	braces,	 will  be  re-
	   placed  by  the  substitute	value.	The  substitute	 value will be
	   treated in the extended regular expression as if it	were  enclosed
	   in  parentheses. No substitution will occur if {name} occurs	within
	   a bracket expression	or within double-quotes.

       Within an ERE, a	backslash character (\\, \a, \b, \f, \n, \r,  \t,  \v)
       is  considered to begin an escape sequence. In addition,	the escape se-
       quences in the following	table will be recognized.

       A literal newline character cannot occur	within an ERE; the escape  se-
       quence \n can be	used to	represent a newline character. A newline char-
       acter cannot be matched by a period operator.

       Escape Sequences	in lex

       |			   Escape Sequences in lex			      |
       |Escape Sequence	  Description			  Meaning		      |
       |\digits		  A  backslash	character  fol-	  The  character whose encod- |
       |		  lowed	by the longest sequence	  ing is represented  by  the |
       |		  of one, two or  three	 octal-	  one-,	 two-  or three-digit |
       |		  digit	 characters (01234567).	  octal	 integer.  Multi-byte |
       |		  Ifall	of the	digits	are  0,	  characters  require  multi- |
       |		  (that	 is,  representation of	  ple,	concatenated   escape |
       |		  the NUL character),  the  be-	  sequences of this type, in- |
       |		  havior is undefined.		  cluding the leading  \  for |
       |						  each byte.		      |
       |\xdigits	  A  backslash	character  fol-	  The character	whose  encod- |
       |		  lowed	by the longest sequence	  ing  is  represented by the |
       |		  of  hexadecimal-digit	charac-	  hexadecimal integer.	      |
       |		  ters	(01234567abcdefABCDEF).				      |
       |		  If  all  of the digits are 0,				      |
       |		  (that	is,  representation  of				      |
       |		  the  NUL  character),	the be-				      |
       |		  havior is undefined.					      |
       |\c		  A  backslash	character  fol-	  The character	c, unchanged. |
       |		  lowed	 by  any  character not				      |
       |		  described  in	  this	 table.				      |
       |		  (\\, \a, \b, \f, \en,	\r, \t,				      |
       |		  \v).							      |

       The order of precedence given to	extended regular expressions  for  lex
       is as shown in the following table, from	high to	low.

       Note:	The  escaped characters	entry is not meant to imply that these
		are operators, but they	are included  in  the  table  to  show
		their  relationships  to  the true operators. The start	condi-
		tion, trailing context and anchoring notations have been omit-
		ted  from  the table because of	the placement restrictions de-
		scribed	in this	section; they can only appear at the beginning
		or ending of an	ERE.

       |		     ERE Precedence in lex			|
       |collation-related bracket symbols   [= =]  [: :]  [. .]		|
       |escaped	characters		    \<special character>	|
       |bracket	expression		    [ ]				|
       |quoting				    "..."			|
       |grouping			    ()				|
       |definition			    {name}			|
       |single-character RE duplication	    * +	?			|
       |concatenation							|
       |interval expression		    {m,n}			|
       |alternation			    |				|

       The  ERE	anchoring operators (^ and $) do not appear in the table. With
       lex regular expressions,	these operators	are restricted in  their  use:
       the  ^  operator	can only be used at the	beginning of an	entire regular
       expression, and the $ operator only at the end. The operators apply  to
       the   entire   regular  expression.  Thus,  for	example,  the  pattern
       (^abc)|(def$) is	undefined; it can instead be written as	 two  separate
       rules,  one  with  the regular expression ^abc and one with def$, which
       share a common action via the special | action (see below). If the pat-
       tern  were  written ^abc|def$, it would match either of abc or def on a
       line by itself.

       Unlike the general ERE rules, embedded anchoring	is not allowed by most
       historical  lex implementations.	An example of embedded anchoring would
       be for patterns such as (^)foo($) to match foo when it exists as	a com-
       plete  word. This functionality can be obtained using existing lex fea-

       ^foo/[ \n]|
       " foo"/[	\n]    /* found	foo as a separate word */

       Notice also that	$ is a form of trailing	context	(it is	equivalent  to
       /\n  and	as such	cannot be used with regular expressions	containing an-
       other instance of the operator (see the preceding discussion of	trail-
       ing context).

       The  additional regular expressions trailing-context operator / (slash)
       can be used as an ordinary character if presented within	double-quotes,
       "/";  preceded by a backslash, \/; or within a bracket expression, [/].
       The start-condition < and > operators are special only in a start  con-
       dition at the beginning of a regular expression;	elsewhere in the regu-
       lar expression they are treated as ordinary characters.

       The following examples clarify the differences between lex regular  ex-
       pressions and regular expressions appearing elsewhere in	this document.
       For regular expressions of the form r/x,	the string matching r  is  al-
       ways  returned; confusion may arise when	the beginning of x matches the
       trailing	portion	of r. For example, given the regular expression	a*b/cc
       and  the	 input	aaabcc,	 yytext	 would contain the string aaab on this
       match. But given	the regular expression x*/xy and the input  xxxy,  the
       token  xxx,  not	 xx,  is  returned by some implementations because xxx
       matches x*.

       In the rule ab*/bc, the b* at the end of	r will extend r's  match  into
       the beginning of	the trailing context, so the result is unspecified. If
       this rule were ab/bc, however, the rule matches the text	ab when	it  is
       followed	 by the	text bc. In this latter	case, the matching of r	cannot
       extend into the beginning of x, so the result is	specified.

   Actions in lex
       The action to be	taken when an ERE is matched can be a C	program	 frag-
       ment  or	 the special actions described below; the program fragment can
       contain one or more C statements, and can also include special actions.
       The  empty  C statement ; is a valid action; any	string in the lex.yy.c
       input that matches the pattern portion of such a	 rule  is  effectively
       ignored or skipped. However, the	absence	of an action is	not valid, and
       the action lex takes in such a condition	is undefined.

       The specification for an	action,	including C statements and special ac-
       tions, can extend across	several	lines if enclosed in braces:

       ERE <one	or more	blanks>	{ program statement
       program statement }

       The  default action when	a string in the	input to a lex.yy.c program is
       not matched by any expression is	to copy	the string to the output.  Be-
       cause the default behavior of a program generated by lex	is to read the
       input and copy it to the	output,	a minimal lex source program that  has
       just  %%	generates a C program that simply copies the input to the out-
       put unchanged.

       Four special actions are	available:

       |       ECHO;	  REJECT;      BEGIN

       |	       The action | means that the action for the next rule is
		       the  action  for	 this rule. Unlike the other three ac-
		       tions, |	cannot be enclosed in braces or	be  semicolon-
		       terminated.  It	must be	specified alone, with no other

       ECHO;	       Writes the contents of the string yytext	on the output.

       REJECT;	       Usually only a single expression	is matched by a	 given
		       string in the input. REJECT means "continue to the next
		       expression that matches the current input," and	causes
		       whatever	 rule  was the second choice after the current
		       rule to be executed for the same	input. Thus,  multiple
		       rules  can be matched and executed for one input	string
		       or overlapping input strings. For  example,  given  the
		       regular	expressions xyz	and xy and the input xyz, usu-
		       ally only the regular expression	xyz would  match.  The
		       next  attempted	match would start after	z. If the last
		       action in the xyz rule is REJECT	, both this  rule  and
		       the xy rule would be executed. The REJECT action	may be
		       implemented in such a fashion that flow of control does
		       not  continue  after  it, as if it were equivalent to a
		       goto to another part of yylex. The use  of  REJECT  may
		       result in somewhat larger and slower scanners.

       BEGIN	       The action:

		       BEGIN newstate;

		       switches	 the  state  (start condition) to newstate. If
		       the string newstate has not been	declared previously as
		       a  start	 condition  in the Definitions in lex section,
		       the results are unspecified. The	initial	state is indi-
		       cated by	the digit 0 or the token INITIAL.

       The functions or	macros described below are accessible to user code in-
       cluded in the lex input.	It is unspecified whether they appear in the C
       code  output of lex, or are accessible only through the -l l operand to
       c89 or cc (the lex library).

       int yylex(void)	       Performs	lexical	analysis on the	input; this is
			       the primary function generated by the lex util-
			       ity. The	function returns zero when the end  of
			       input is	reached; otherwise it returns non-zero
			       values (tokens) determined by the actions  that
			       are selected.

       int yymore(void)	       When called, indicates that when	the next input
			       string is recognized, it	is to be  appended  to
			       the current value of yytext rather than replac-
			       ing it; the value in yyleng is adjusted accord-

       intyyless(int n)	       Retains	n  initial  characters in yytext, NUL-
			       terminated, and treats the remaining characters
			       as  if  they  had  not  been read; the value in
			       yyleng is adjusted accordingly.

       int input(void)	       Returns the next	character from the  input,  or
			       zero  on	end-of-file. It	obtains	input from the
			       stream pointer yyin, although possibly  via  an
			       intermediate  buffer.  Thus,  once scanning has
			       begun, the effect of altering the value of yyin
			       is  undefined.  The  character  read is removed
			       from the	input stream of	 the  scanner  without
			       any processing by the scanner.

       int unput(int c)	       Returns	the  character	c to the input;	yytext
			       and yyleng are undefined	until the next expres-
			       sion  is	matched. The result of using unput for
			       more characters than have been input is unspec-

       The  following  functions  appear  only	in  the	lex library accessible
       through the -l l	operand; they can therefore be redefined by a portable

       int yywrap(void)

	   Called  by yylex at end-of-file; the	default	yywrap always will re-
	   turn	1. If the application requires yylex  to  continue  processing
	   with	 another  source  of input, then the application can include a
	   function yywrap, which associates another file  with	 the  external
	   variable FILE *yyin and will	return a value of zero.

       int main(int argc, char *argv[])

	   Calls  yylex	to perform lexical analysis, then exits. The user code
	   can contain main to perform application-specific operations,	 call-
	   ing yylex as	applicable.

       The  reason  for	 breaking  these functions into	two lists is that only
       those functions in libl.a can be	reliably redefined by a	 portable  ap-

       Except  for input, unput	and main, all external and static names	gener-
       ated by lex begin with the prefix yy or YY.

       Portable	applications are warned	that in	the Rules in lex  section,  an
       ERE  without  an	 action	is not acceptable, but need not	be detected as
       erroneous by lex. This may result in compilation	or run-time errors.

       The purpose of input is to take characters off  the  input  stream  and
       discard	them as	far as the lexical analysis is concerned. A common use
       is to discard the body of a comment once	the beginning of a comment  is

       The lex utility is not fully internationalized in its treatment of reg-
       ular expressions	in the lex source code or generated lexical  analyzer.
       It would	seem desirable to have the lexical analyzer interpret the reg-
       ular expressions	given in the lex source	according to  the  environment
       specified when the lexical analyzer is executed,	but this is not	possi-
       ble with	the current lex	technology. Furthermore, the  very  nature  of
       the lexical analyzers produced by lex must be closely tied to the lexi-
       cal requirements	of the input language being described, which will fre-
       quently	be  locale-specific  anyway. (For example, writing an analyzer
       that is used for	French text will not automatically be useful for  pro-
       cessing other languages.)

       Example 1: Using	lex

       The following is	an example of a	lex program that implements a rudimen-
       tary scanner for	a Pascal-like syntax:

       /* need this for	the call to atof() below */
       #include	<math.h>
       /* need this for	printf(), fopen() and stdin below */
       #include	<stdio.h>

       DIGIT	[0-9]
       ID	[a-z][a-z0-9]*

       {DIGIT}+				 {
				  printf("An integer: %s (%d)\n", yytext,

       {DIGIT}+"."{DIGIT}*	  {
				  printf("A float: %s (%g)\n", yytext,

       if|then|begin|end|procedure|function	   {
				  printf("A keyword: %s\n", yytext);

       {ID}			  printf("An identifier: %s\n",	yytext);

       "+"|"-"|"*"|"/"		  printf("An operator: %s\n", yytext);

       "{"[^}\n]*"}"		  /* eat up one-line comments */

       [ \t\n]+			  /* eat up white space	*/

       .			  printf("Unrecognized character: %s\n", yytext);


       int main(int argc, char *argv[])
				 ++argv, --argc;  /* skip over program name */
				 if (argc > 0)
											      yyin = fopen(argv[0], "r");
				 yyin =	stdin;


       See environ(5) for descriptions of the following	environment  variables
       that  affect  the execution of lex: LANG, LC_ALL, LC_COLLATE, LC_CTYPE,

       The following exit values are returned:

       0	Successful completion.

       >0	An error occurred.

       See attributes(5) for descriptions of the following attributes:

       |      ATTRIBUTE	TYPE	     |	    ATTRIBUTE VALUE	   |
       |Availability		     |SUNWbtool			   |
       |Interface Stability	     |Standard			   |

       yacc(1),	attributes(5), environ(5), regex(5), standards(5)

       If routines such	as yyback(), yywrap(), and yylock() in .l (ell)	 files
       are  to be external C functions,	the command line to compile a C++ pro-
       gram must define	the __EXTERN_C__ macro.	For example:

       example%	 CC -D__EXTERN_C__ ... file

SunOS 5.10			  22 Aug 1997				lex(1)


Want to link to this manual page? Use this URL:

home | help