Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
HOP::Lexer::Article(3)User Contributed Perl DocumentatioHOP::Lexer::Article(3)

NAME
       Lexing Without Grammars:	 When Regular Expressions Suck

ABSTRACT
       Perl is famed for its text processing capabilities.  However, sometimes
       the data	you want to process is too complicated for regular expressions
       and you reach for a parser for your HTML, RTF, or other common format.
       This article discusses when you don't have a pre-defined	parser but the
       text you	need to	work with is too complicated for regular expressions.

ORIGINAL PUBLICATION
       This article was	originally published by	O'Reilly in January, 2006 at
       <http://www.perl.com/pub/a/2006/01/05/parsing.html>.  Reproduced	with
       permission.

"s/(?<!SHOOTING	YOURSELF IN THE	)FOOT/HEAD/g"
       Most of us have tried at	one time or another to use regular expressions
       to do things we shouldn't.  Parse HTML, obfuscate code, wash dishes,
       etc.  This is referred to by the	technical term "showing	off".  I've
       done it too:

	$html =~ s{
		     (<a\s(?:[^>](?!href))*href\s*)
		     (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
		     ([^>]+>)
		  }
		  {$1 .	decode_entities($2) .  $4}gsexi;

       I was strutting like a peacock when I wrote that, followed quickly by
       eating crow when	I ran it.  I never did get that	working	right.	I'm
       still not sure what I was trying	to do.	That was the regular
       expression which	forced me to learn how to use "HTML::TokeParser".
       More importantly, that was the regular expression which taught me how
       difficult regular expressions can be.

   The Problem with Regular Expressions
       Let's look at that regex	again:

	/(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/

       Do you know that	matches?  Exactly?  Are	you sure?  Even	if it works,
       how easily can you modify it?

       If you don't know what it was trying to do (and to be fair, don't
       forget it's broken), how	long did you spend trying to figure it out?
       When's the last time a single line of code gave you such	fits?

       The problem, of course, is that this regular expression is trying to do
       far more	work than a single line	of code	is likely to do.  When faced
       with a regular expression like that, there are a	few things I like to
       do.

       o   Document them carefully.

       o   Use the /x switch so	I can expand them over several lines.

       o   Possibly encapsulate	them in	a subroutine.

       Sometimes, though, there's a fourth option:  lexing.

   Lexing
       When developing code, we	typically take a problem and break it down
       into a series of	smaller	problems which are easier to solve.  Regular
       expressions are code and	they too can be	broken down into a series of
       smaller problems	which are easier to solve.  One	technique is to	use
       lexing to facilitate this.

       Lexing is the act of taking data, breaking it down into discreet	tokens
       and assigning meaning to	those tokens.  There's a bit of	fudging	in
       that statement, but it pretty much covers the basics.

       Lexing is typically followed by parsing whereby the tokens are then
       converted into something	more useful.  Parsing is frequently handled by
       having some tool	which applies a	well-defined grammar to	the lexed
       tokens.

       Sometimes well-defined grammars are not practical for extracting	and
       reporting information.  There might not be a grammar available for a
       company's ad-hoc	log file format.  Other	times we might find it easier
       to process the tokens manually then to spend the	time writing a
       grammar.	 And still other times we are only interested in part of the
       data you've lexed, not all of it.  All three of these reasons apply to
       the following problem.

   Parsing SQL
       Recently	on Perlmonks (http://perlmonks.org/index.pl?node_id=472684),
       someone had the following SQL:

	 select	the_date as "date",
	 round(months_between(first_date,second_date),0) months_old
	 ,product,extract(year from the_date) year
	 ,case
	   when	a=b then 'c'
	   else	'd'
	   end tough_one
	 from ...
	 where ...

       What they needed	from that SQL was the alias for	each column.  In this
       case, those would be "date", "months_old", "product", "year",
       "tough_one."  Of	course,	they mentioned that this was only one example.
       There's actually	plenty of generated SQL, all with subtle variations on
       how the columns are aliased so this is not a trivial task.  What's
       interesting about this, though, is that we don't	give a fig about
       anything	except the column aliases.  The	rest of	the text is merely
       there to	help us	find those aliases.

       Our first thought might be to try and parse this	with "SQL::Statement".
       As it turns out,	this module does not handle "CASE" statements.	Thus,
       we're left with either trying to	figure out how to patch
       "SQL::Statement", submit	said patch, hope it gets accepted and released
       in a timely fashion.  (Note that	"SQL::Statement" uses "SQL::Parser",
       so the latter is	also not an option).

       Second, a number	of us have worked in environments where	problems have
       to be solved in production now but we still have	to wait	three weeks
       for the necessary modules to be installed, if they'll be	approved at
       all.

       The most	important reason, though, is even if "SQL::Statement" could
       handle this problem, this would be an awfully short article if you used
       it instead of a lexer.

   Lexing Basics
       As mentioned earlier, lexing is essentially the task of analyzing data
       and breaking it down into a series of easy-to-use tokens.  While	the
       data may	be in other forms, usually this	means analyzing	strings.  To
       give a trivial example, consider	the following:

	x = (3 + 2) / y

       When lexed, we might get	a series of tokens like	the following:

	my @tokens = (
	  [ VAR	=> 'x' ],
	  [ OP	=> '=' ],
	  [ OP	=> '(' ],
	  [ INT	=> '3' ],
	  [ OP	=> '+' ],
	  [ INT	=> '2' ],
	  [ OP	=> ')' ],
	  [ OP	=> '/' ],
	  [ VAR	=> 'y' ],
	);

       With a proper grammar, we could then read this series of	tokens and
       take actions based upon their values, such as build a simple language
       interpreter or translate	this code into another programming language.
       Even without a grammar we can find these	tokens useful as we'll see
       with the	SQL example.

   Identifying tokens
       The first step in building a lexer is identifying the tokens you	wish
       to parse.  Let's	take another look at the SQL.

	 select	the_date as "date",
	 round(months_between(first_date,second_date),0) months_old
	 ,product,extract(year from the_date) year
	 ,case
	   when	a=b then 'c'
	     else 'd'
	   end tough_one
	 from ...
	 where ...

       We really don't care about anything after the "from" keyword.  In
       looking at this closer, we see that everything we do care about is
       immediately prior to a comma or the 'from' keyword.  However, splitting
       on commas isn't enough as we have some commas embedded in function
       parentheses.

       The first thing we need to do is	to identify the	various	things we can
       match with simple regular expressions.

       These "things" appear to	be parentheses,	commas,	operators, keywords
       and random text.	 A first pass at it might look something like this:

	 my $lparen  = qr/\(/;
	 my $rparen  = qr/\)/;
	 my $keyword = qr/(?i:select|from|as)/;	# this is all this problem needs
	 my $comma   = qr/,/;
	 my $text    = qr/(?:\w+|'\w+'|"\w+")/;
	 my $op	     = qr{[-=+*/<>]};

       The text	matching is somewhat naive and we might	want "Regexp::Common"
       for some	of the regular expressions, but	for now	we'll keep this
       simple.

       The operators are a bit more involved as	we'll assume that some SQL
       might have math statements embedded in them.

       Then we create the actual lexer.	 One way to do this is to make our own
       lexer.  It might	look something like this:

	 sub lexer {
	     my	$sql = shift;
	     return sub	{
		 LEXER:	{
		     return ['KEYWORD',	$1] if $sql =~ /\G ($keyword) /gcx;
		     return ['COMMA',	''] if $sql =~ /\G ($comma)   /gcx;
		     return ['OP',	$1] if $sql =~ /\G ($op)      /gcx;
		     return ['PAREN',	 1] if $sql =~ /\G $lparen    /gcx;
		     return ['PAREN',	-1] if $sql =~ /\G $rparen    /gcx;
		     return ['TEXT',	$1] if $sql =~ /\G ($text)    /gcx;
		     redo LEXER		    if $sql =~ /\G \s+	      /gcx;
		 }
	     };
	 }
	 my $lexer = lexer($sql);
	 while (defined	(my $token = $lexer->())) {
	     # do something with the token
	 }

       Without going into the detail of	how that works,	it's fair to say that
       this is not the best solution.  By looking at the original post this
       came from, (http://perlmonks.org/index.pl?node_id=472701), we find that
       we need to make two passes through the data to extract what we want.
       Why this	is the case is an exercise left	for the	reader.

       To make this simpler, we're going to use	the "HOP::Lexer" module	from
       the CPAN.  This module, described by Mark Jason Dominus in his book
       "Higher Order Perl", makes creating lexers a rather trivial task	and
       makes them a bit	more powerful than what	we have	above.	Here's our
       code:

	 use HOP::Lexer	'make_lexer';
	 my @sql = $sql;
	 my $lexer = make_lexer(
	     sub { shift @sql },
	     [ 'KEYWORD', qr/(?i:select|from|as)/	   ],
	     [ 'COMMA',	  qr/,/				   ],
	     [ 'OP',	  qr{[-=+*/]}			   ],
	     [ 'PAREN',	  qr/\(/,      sub { [shift,  1] } ],
	     [ 'PAREN',	  qr/\)/,      sub { [shift, -1] } ],
	     [ 'TEXT',	  qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
	     [ 'SPACE',	  qr/\s*/,     sub {}		   ],
	 );

	 sub text {
	     my	($label, $value) = @_;
	     $value =~ s/^["']//;
	     $value =~ s/["']$//;
	     return [ $label, $value ];
	 }

       This certainly doesn't look any easier to read, but bear	with me.

       The "make_lexer"	subroutine takes as its	first argument an iterator
       which returns the text to match every time it's called.	In our case,
       we only have one	snippet	of text	to match, so we	merely shift it	off of
       an array.  If we	were reading lines from	a log file, the	iterator would
       be quite	handy.

       After the first argument, we have a series of array references.	Each
       reference takes two mandatory and one optional argument:

	 [ $label, $pattern, $optional_subroutine ]

       The $label will be used as the name of the token.  The pattern should
       match whatever the label	is identifying.	 The third argument, a
       subroutine reference, takes as arguments	the label and the text the
       label matched and returns whatever you wish for a token.	 We'll get to
       that in a moment.  First, let's consider	how we typically use the
       "make_lexer" subroutine.

	 [ 'KEYWORD', qr/(?i:select|from|as)/ ],

       An example of how we might transform the	data before making the token
       is as follows:

	 [ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text  ],

       As mentioned previously,	our regular expression might be	naive, but
       we'll leave that	for now	and focus on the &text subroutine.

	 sub text {
	     my	($label, $value) = @_;
	     $value =~ s/^["']//;
	     $value =~ s/["']$//;
	     return [ $label, $value ];
	 }

       This says "take the label and the value,	strip leading and trailing
       quotes from the value and return	them in	an array reference".

       To strip	the whitespace,	something we don't care	about, we simply
       return nothing:

	[ 'SPACE', qr/\s*/, sub	{} ],

       Now that	we have	our lexer, let's put it	to work.  Remember that	we had
       decided that column aliases were	the "TEXT" not in parentheses but
       immediately prior to commas or the "from" keyword.  But how do we know
       if we're	inside of parentheses?	We're going to cheat a little bit:

	 [ 'PAREN', qr/\(/, sub	{ [shift,  1] }	],
	 [ 'PAREN', qr/\)/, sub	{ [shift, -1] }	],

       With that, we can add a one whenever we get to an opening parenthesis
       and subtract it when we get to a	closing	one.  Whenever the result is
       zero, we	know that we're	outside	of parentheses.

       We can get the tokens by	repeatedly calling the $lexer iterator.

	 while ( defined (my $token = $lexer->() ) { ... }

       And the tokens would look like this:

	 [  'KEYWORD',	    'select' ]
	 [  'TEXT',	  'the_date' ]
	 [  'KEYWORD',		'as' ]
	 [  'TEXT',	      'date' ]
	 [  'COMMA',		 ',' ]
	 [  'TEXT',	     'round' ]
	 [  'PAREN',		   1 ]
	 [  'TEXT', 'months_between' ]
	 [  'PAREN',		   1 ]

       And so on ...

       Here's how we process the tokens:

	  1:  my $inside_parens	= 0;
	  2:  while ( defined (my $token = $lexer->()) ) {
	  3:	  my ($label, $value) =	@$token;
	  4:	  $inside_parens += $value if 'PAREN' eq $label;
	  5:	  next if $inside_parens || 'TEXT' ne $label;
	  6:	  if (defined (my $next	= $lexer->('peek'))) {
	  7:	      my ($next_label, $next_value) = @$next;
	  8:	      if ('COMMA' eq $next_label) {
	  9:		  print	"$value\n";
	 10:	      }
	 11:	      elsif ('KEYWORD' eq $next_label && 'from'	eq $next_value)	{
	 12:		  print	"$value\n";
	 13:		  last;	# we're	done
	 14:	      }
	 15:	  }
	 16:  }

       This is pretty straightforward, but there are some tricky bits.	Each
       token is	a two element array reference, so line three makes the label
       and value fairly	explicit.  Lines four and five use the "cheat" we
       mentioned for handling parentheses.  Five also skips anything which
       isn't text and therefore	cannot be a column alias.

       Line six	is a bit odd.  In the "HOP::Lexer", passing the	string "peek"
       to the lexer will return	the next token without actually	advancing the
       $lexer iterator.	 From there, it's straightforward logic	to find	out if
       the value we have is a column alias which matches our criteria.

       Putting all of this together we have:

	 #!/usr/bin/perl

	 use strict;
	 use warnings;
	 use HOP::Lexer	'make_lexer';

	 my $sql = <<END_SQL;
	 select	the_date as "date",
	 round(months_between(first_date,second_date),0) months_old
	 ,product,extract(year from the_date) year
	 ,case
	   when	a=b then 'c'
	     else 'd'
	       end tough_one
	       from XXX
	 END_SQL

	 my @sql = $sql;
	 my $lexer = make_lexer(
	     sub { shift @sql },
	     [ 'KEYWORD', qr/(?i:select|from|as)/	   ],
	     [ 'COMMA',	  qr/,/				   ],
	     [ 'OP',	  qr{[-=+*/]}			   ],
	     [ 'PAREN',	  qr/\(/,      sub { [shift,  1] } ],
	     [ 'PAREN',	  qr/\)/,      sub { [shift, -1] } ],
	     [ 'TEXT',	  qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
	     [ 'SPACE',	  qr/\s*/,     sub {}		   ],
	 );

	 sub text {
	     my	( $label, $value ) = @_;
	     $value =~ s/^["']//;
	     $value =~ s/["']$//;
	     return [ $label, $value ];
	 }

	 my $inside_parens = 0;
	 while ( defined ( my $token = $lexer->() ) ) {
	     my	( $label, $value ) = @$token;
	     $inside_parens += $value if 'PAREN' eq $label;
	     next if $inside_parens || 'TEXT' ne $label;
	     if	( defined ( my $next = $lexer->('peek')	) ) {
		 my ( $next_label, $next_value ) = @$next;
		 if ( 'COMMA' eq $next_label ) {
		     print "$value\n";
		 }
		 elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
		     print "$value\n";
		     last; # we're done
		 }
	     }
	 }

       And that	prints out the column aliases.

	 date
	 months_old
	 product
	 year
	 tough_one

       So are we done?	No, probably not.  What	we really need now are many
       other examples of the SQL generated in the first	problem	statement.
       Maybe the &text subroutine is naive.  Maybe there are other operators
       we forgot.  Maybe there are floating point numbers embedded in the SQL.
       When we are forced to lex data by hand, fine-tuning the lexer to	match
       your actual data	can take a few tries.

       It's also important to note that	precedence is very important here.
       Each array reference passed to &make_lexer is evaluated in the order
       it's passed.  If	we passed the 'TEXT' array reference before the
       'KEYWORD' array reference, the 'TEXT' regular expression	would match
       keywords	before the 'KEYWORD' could match, thus generating spurious
       results.

       Happy lexing!

perl v5.24.1			  2017-07-03		HOP::Lexer::Article(3)

NAME | ABSTRACT | ORIGINAL PUBLICATION | "s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g"

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=HOP::Lexer::Article&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help