Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Marpa::XS::Advanced::MUsersContributed Perl DocuMarpa::XS::Advanced::Models(3)

NAME
       Marpa::XS::Advanced::Models - Other Input Models

ABOUT THIS DOCUMENT
       The alternative input models described in this document are an advanced
       technique.  If you are starting out with	Marpa, you probably want to
       ignore this document.  If you are an experienced	Marpa user, it is
       still safe to ignore this document, but you might find the
       possibilities it	discusses interesting.

DESCRIPTION
   What	is an alternative input	model?
       An alternative input model is anything that is not the default, token-
       stream model.  More helpfully, Marpa allows variable-length tokens and
       ambiguous tokens, and an	alternative input model	is any input model
       which

       o   Allows a token whose	length is not exactly 1, or

       o   Allows locations which have more than one token.

       To do either of these things, a user must use the recognizer's
       "alternative" method.  In other words, if you are not using the
       recognizer's "alternative" method call, you are not using an
       alternative input method.

       Many concepts, such as parsing location,	parse exhaustion, and the end
       of parsing, are somewhat	more complicated when alternative input	models
       are involved.  These concepts were explained in the main	document for
       the recognizer on the assumption	that the default input model was in
       use.  This document revises those explanations as necessary to take
       into account the	alternative input models.

   Token streams
       Marpa's default input model is the traditional one -- a token stream.
       Token streams are very standard in parsing applications -- so much so
       that most texts do not take the trouble of defining the term.  A	token
       stream is input structured as a sequence	of tokens, where each token
       occupies	one location and every location	has a token.  In the token
       stream model, all tokens	are of the same	length.

       Conventionally, all tokens are of length	1, and the token stream	starts
       at location 0.  Following this convention, the Nth token	would start at
       location	N-1 and	end at location	N.  For	example, the first token would
       start at	location 0 and end at location 1.  The second token would
       start at	location 1 and end at location 2.

   Earlemes
       For most	parsers, position is location in a token stream.  To deal with
       variable-length and overlapping tokens, Marpa needs a more flexible
       idea of location.

       Marpa's tracks position in terms	of earlemes.  Earlemes are named after
       Jay Earley, the inventor	of the first algorithm in Marpa's lineage.
       Every token has a start earleme and an end earleme.

       The token stream	model may also be called the token-per-earleme model.
       In the token stream model, token	location and earleme location are
       exactly identical an one-to-one basis.  If a user's application uses
       the token stream	model, he can ignore earlemes, and think entirely in
       terms of	tokens and position in a token stream.	Because	of this, the
       main Marpa documents speak simply of the	"location" in the parse.

   The furthest	earleme
       The furthest earleme is the last	earleme	at which a token ends.	In the
       default input model, the	furthest earleme is not	an important concept.
       In the default input model, the furthest	earleme	and the	current
       earleme are always the same.

       In alternative input models, tokens may be longer than 1	earleme, and
       the furthest earleme and	the current earleme may	be far apart.  This
       becomes an issue	when parsing is	finished.  Alternative input models
       use the recognizer's "end_input"	method to ensure that processing of
       input catches up	to the furthest	earleme.

METHODS
   alternative
	   defined $recce->alternative(	'a', 42, 1 )
	       or return 'First	alternative failed';

       The "alternative" method	is the most general method for reading input,
       and is used in alternative input	models.	 It takes three	arguments,
       only the	first of which is required.

       The first two arguments are similar to those of the recognizer's	"read"
       method: the token type and the token value.  As with the	"read" method,
       the token value is optional and,	if omitted, defaults to	a Perl
       "undef".

       The third argument to the "alternative" method is a token length.  If
       omitted,	token length defaults to 1, which is the correct value for the
       token stream model.  Its	value can be any integer greater than zero.
       Marpa does not allow zero length	tokens in any input model.

       Unlike the "read" method, the "alternative" method does not advance the
       current location	(current earleme) on each call.	 This allows the
       application to read several tokens at the same earleme.	This is	how
       ambiguous input is created.  To advance the current earleme when	input
       is read using the "alternative" method, the "complete_earleme" method
       must be called.

       On success, "alternative" returns the current earleme.  On failures,
       other than the rejection	of a token in interactive mode,	"alternative"
       throws an exception.  If	the token is rejected, "alternative" returns a
       Perl "undef".

   current_earleme
	   $current_earleme = $recce->current_earleme();

       Returns the current parse location, also	known as the current earleme.
       Not often needed.

   earleme_complete
	   $recce->earleme_complete();

       Processes all tokens at the current earleme and advances	the current
       earleme by 1.  Returns the number of terminals acceptable at the	new
       current earleme.	 Note that, in alternative input models, a return
       value of	0 does not necessarily indicate	an exhausted parser.

       When reading input using	the "alternative" method, "earleme_complete"
       is used to move forward in the input stream.  All tokens	read using the
       "alternative" method are	read at	the same location -- the current
       earleme.	 A "earleme_complete" call causes the tokens read using	the
       "alternative" method to be processed, and the current earleme to	be
       advanced.

       "earleme_complete" may be called	even if	the "alternative" method has
       been not	called since the last call to "earleme_complete".  This	will
       create an earleme with no tokens, which is often	useful.

   exhausted
	       $recce->exhausted() and die 'Recognizer exhausted';

       The "exhausted" method returns a	Perl true if parsing in	a recognizer
       is exhausted, and a Perl	false otherwise.  Parsing is exhausted when
       the recognizer will not accept any further input.  This method is
       seldom used.  For most applications, the	return values of the other
       methods provide sufficient information about the	parse status, and
       there is	no need	to specifically	test for parse exhaustion.

       In the default input model, a recognizer	was exhausted if the "read"
       method returned 0, indicating that no tokens would be accepted at the
       new current earleme.  In	alternative input models, "earleme_complete"
       will always return 0 when the parser is exhausted, but the converse is
       not always true:	"earleme_complete" may return 0	even in	some cases
       when the	parser is not exhausted.

       When "earleme_complete" returns 0, no input will	be accepted at the
       current earleme.	 But in	input models which allow tokens	longer than 1,
       it is possible for input	to be accepted at earlemes after the current
       earleme,	even if	no input is accepted at	the current earleme.

   end_input
	   $recce->end_input();

       Indicates that input is finished.  Calling "end_input" is not necessary
       or useful in the	default	input model, because in	the default input
       model no	token has a length greater than	1.

       The "end_input" method takes no arguments.  The "end_input" method
       returns a Perl true value on success.  On failure, it throws an
       exception.

       In alternative input models, calling the	"earleme_complete" method once
       input is	finished does not ensure that all input	has been processed.
       The "earleme_complete" method completes the current earleme, but	in
       alternative models, tokens may extend well past the current earleme.
       The "end_input" method ensures that all input is	processed.

       Calling "end_input" multiple times on the same recognizer object	is
       harmless, but useless.  The second and subsequent calls will return a
       Perl true, but will have	no effect.

ALTERNATIVE MODELS AND THE "read" METHOD
       A recognizer can	mix calls to its "read"	method with calls to its
       "alternative" method.  The "read" method	has the	same effect as a
       single call to the "alternative"	method,	followed immediately by	a call
       of the "earleme_complete" method.  More precisely, the "read" method is
       equivalent to

	   sub Marpa::XS::Recognizer::read {
	       my $recce = shift;
	       return defined $recce->alternative(@_) ?	$recce->earleme_complete() : undef;
	   }

AMBIGUOUS LEXING
       Marpa allows ambiguous tokens.  Several Marpa tokens can	start at a
       single parsing location.	 Ambiguous tokens can be of various lengths.
       Tokens can also overlap.

       Potentially ambiguous lexing occurs when	more than one token starts at
       a single	earleme.  When potentially ambiguous lexing occurs, it becomes
       possible	for there to be	more than one sequence of tokens.

       An actual lexical ambiguity only	occurs if more than one	of the
       potential token sequences is consistent with the	grammar.  If there is
       no actual lexical ambiguity, Marpa will use the only token choice that
       is consistent with the grammar.

       When lexing is actually ambiguous, Marpa	will use all the alternatives
       consistent with the grammar.  When the lexing in	a parse	is actually
       ambiguous, the parse will be ambiguous.	From the point of view of
       Marpa's semantics, ambiguities caused by	lexing look the	same as
       ambiguities caused by an	ambiguous grammar.

       In the standard terminology, if a grammar produces more than one	parse
       tree for	any input, then	that grammar must be ambiguous.	 In Marpa this
       is not strictly true.  In Marpa,	if the input is	ambiguous, even	an
       unambiguous grammar can produce than one	parse.

DUPLICATE TOKENS
       A duplicate token is a token of the same	type and the same length as
       another that was	read at	the same earleme.  Duplicate tokens are
       impossible in the default, token-stream,	model.	This is	because	in the
       token-stream model only one token can be	read at	each earleme.

       In alternative models, more than	one token may be read at an earleme,
       and duplicates are possible.  Marpa detects duplicate tokens and	treats
       them as "hard errors" --	Marpa throws an	exception when it sees a
       duplicate token.	 Marpa's assumption is that duplicate tokens indicate
       an error	at the application level.

       An application can retry	input after a duplicate	token, if it catches
       the exception.  In the future, if recovery from duplicate tokens	is
       found to	be a useful technique, Marpa may provide an option to change
       its behavior, so	that a soft failure is returned	when there is a
       duplicate token.

EARLEMES: THE DETAILS
       While scanning, Marpa keeps track of the	current	earleme.  Earlemes in
       a parse start at	earleme	0 and increase numerically.  The earleme
       immediately following earleme 0 is earleme 1, the earleme immediately
       following earleme 1 is earleme 2, and so	on.  The earleme immediately
       following earleme N is always earleme N+1.

       Distance	in the earleme stream is what you would	intuitively expect it
       to be.  The distance between earleme X and earleme Y is the absolute
       value of	the difference between X and Y,	|X-Y|.	The distance from
       earleme 3 to earleme 6, for example, is 3 earlemes.

       Whenever	a token	is given to Marpa to be	scanned, it starts at the
       current earleme.	 In addition to	the type and value of the token, Marpa
       must be told token's length in earlemes.	 The length of a Marpa token
       must be greater than zero.

       This earleme length will	become the distance from the start of the
       token to	the end	of the token.  If the length of	the token is L,	and
       the number of the current earleme is C, the end of the token will be at
       the earleme whose number	is C+L.

THE CHARACTER-PER-EARLEME MODEL
       Many different models of	the relationship between tokens	and earlemes
       are possible, but two are particularly important.  One is the one-
       token-per-earleme model,	which is the default, and which	has already
       been described.	The other is the one-character-per-earleme model.

       In the one-character-per-earleme	model, every character will be treated
       as being	exactly	one earleme in length.	If a token is more than	one
       character in length, that token will span earlemes.  When the lexing is
       ambiguous, tokens may overlap.

       When a one-character-per-earleme	model of input is used,	there may be
       many earlemes at	which no tokens	start.	For example, in	a
       straightforward character-per-earleme implementation of a grammar for a
       language	which allows comments, no tokens will start at any earlemes
       which corresponds to character locations	inside a comment.

OTHER INPUT MODELS
       So far only the token-per-earleme and character-per-earleme models have
       seen any	real use in Marpa programs.  But other models are certainly
       possible.  Using	earlemes, you can structure your input in almost any
       way you like.

       There are only three restrictions:

       1.  Scanning always starts at earleme 0.

       2.  All tokens starting at earleme N must be scanned before any tokens
	   starting at earleme N+1.  In	other words, the tokens	must be
	   scanned in non-decreasing order by start earleme.

       3.  Every token must have a length, in earlemes,	which is greater than
	   zero.  In other words, token	length can never be zero or negative.

COPYRIGHT AND LICENSE
	 Copyright 2012	Jeffrey	Kegler
	 This file is part of Marpa::XS.  Marpa::XS is free software: you can
	 redistribute it and/or	modify it under	the terms of the GNU Lesser
	 General Public	License	as published by	the Free Software Foundation,
	 either	version	3 of the License, or (at your option) any later	version.

	 Marpa::XS is distributed in the hope that it will be useful,
	 but WITHOUT ANY WARRANTY; without even	the implied warranty of
	 MERCHANTABILITY or FITNESS FOR	A PARTICULAR PURPOSE.  See the GNU
	 Lesser	General	Public License for more	details.

	 You should have received a copy of the	GNU Lesser
	 General Public	License	along with Marpa::XS.  If not, see
	 http://www.gnu.org/licenses/.

perl v5.32.0			  2020-08-27	Marpa::XS::Advanced::Models(3)

NAME | ABOUT THIS DOCUMENT | DESCRIPTION | METHODS | ALTERNATIVE MODELS AND THE "read" METHOD | AMBIGUOUS LEXING | DUPLICATE TOKENS | EARLEMES: THE DETAILS | THE CHARACTER-PER-EARLEME MODEL | OTHER INPUT MODELS | COPYRIGHT AND LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Marpa::XS::Advanced::Models&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help