Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
bt_language(3)			    btparse			bt_language(3)

       bt_language - the BibTeX	data language, as recognized by	btparse

	  # Lexical grammar, mode 1: top-level
	  AT			\@
	  NEWLINE		\n
	  COMMENT		\%~[\n]*\n
	  WHITESPACE		[\ \r\t]+
	  JUNK			~[\@\n\	\r\t]+

	  # Lexical grammar, mode 2: in-entry
	  NEWLINE		\n
	  COMMENT		\%~[\n]*\n
	  WHITESPACE		[\ \r\t]+
	  NUMBER		[0-9]+
	  NAME			[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
	  LBRACE		\{
	  RBRACE		\}
	  LPAREN		\(
	  RPAREN		\)
	  EQUALS		=
	  HASH			\#
	  COMMA			,
	  QUOTE			\"

	  # Lexical grammar, mode 3: strings
	  # (very hairy	-- see text)

	  # Syntactic grammar:
	  bibfile : ( entry )*

	  entry	: AT NAME body

	  body : STRING			   # for comment entries
	       | ENTRY_OPEN contents ENTRY_CLOSE

	  contents : ( NAME | NUMBER ) COMMA fields   #	for regular entries
		   | fields		   # for macro definition entries
		   | value		   # for preamble entries

	  fields : field { COMMA fields	}

	  field	: NAME EQUALS value

	  value	: simple_value ( HASH simple_value )*

	  simple_value : STRING
		       | NUMBER
		       | NAME

       One of the problems with	BibTeX is that there is	no formal specifica-
       tion of the language.  This means that users exploring the arcane cor-
       ners of the language are	largely	on their own, and programmers imple-
       menting their own parsers are completely	on their own---except for ob-
       serving the behaviour of	the original implementation.

       Other parser implementors (Nelson Beebe of "bibclean" fame, in particu-
       lar) have taken the trouble to explain the language accepted by their
       parser, and in that spirit the following	is presented.

       If you are unfamiliar with the arcana of	regular	and context-free lan-
       guages, you will	not have any easy time understanding this.  This is
       not an introduction to the BibTeX language; any LaTeX book would	be
       more suitable for learning the data language itself.

       The lexical scanner has three distinct modes: top-level,	in-entry, and
       string.	Roughly	speaking, top-level is the initial mode; we enter in-
       entry mode on seeing an "@" at top-level; and on	seeing the "}" or ")"
       that ends the entry, we return to top-level.  We	enter string mode on
       seeing a	""" or non-entry-delimiting "{"	from in-entry mode.  Note that
       the lexical language is both non-regular	(because braces	must balance)
       and context-sensitive (because "{" can mean different things depending
       on its syntactic	context).  That	said, we will use regular expressions
       to describe the lexical elements, because they are the starting point
       used by the lexical scanner itself.  The	rest of	the lexical grammar
       will be informally explained in the text.

       From top-level, the following tokens are	recognized according to	the
       regular expressions on the right:

	  AT			\@
	  NEWLINE		\n
	  COMMENT		\%~[\n]*\n
	  WHITESPACE		[\ \r\t]+
	  JUNK			~[\@\n\	\r\t]+

       (Note that this is PCCTS	regular	expression syntax, which should	be
       fairly familar to users of other	regex engines.	One oddity is that a
       character class is negated as "~[...]" rather than "[^...]".)

       On seeing "at" at top-level, we enter in-entry mode.  Whitespace, junk,
       newlines, and comments are all skipped, with the	latter two increment-
       ing a line counter.  (Junk is explicitly	recognized to allow for	"bib-
       tex"'s "implicit	comment" scheme.)

       From in-entry mode, we recognize	newline, comment, and whitespace iden-
       tically to top-level mode.  In addition,	the following tokens are rec-

	  NUMBER		[0-9]+
	  NAME			[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
	  LBRACE		\{
	  RBRACE		\}
	  LPAREN		\(
	  RPAREN		\)
	  EQUALS		=
	  HASH			\#
	  COMMA			,
	  QUOTE			\"

       At this point, the lexical scanner starts to sound suspiciously like a
       context-free grammar, rather than a collection of independent regular
       expressions.  However, it is necessary to keep this complexity in the
       scanner because certain characters ("{" and "(" in particular) have
       very different lexical meanings depending on the	tokens that have pre-
       ceded them in the input stream.

       In particular, "{" and "(" are treated as "entry	openers" if they fol-
       low one "at" and	one "name" token, unless the value of the "name" token
       is "comment".  (Note the	switch from top-level to in-entry between the
       two tokens.)  In	the @comment case, the delimiter is considered as
       starting	a string, and we enter string mode.  Otherwise,	the delimiter
       is saved, and when we see a corresponding "}" or	")" it is considered
       an "entry closer".  (Braces are balanced	for free here because the
       string lexer takes care of counting brace-depth.)

       Anywhere	else, "{" is considered	as starting a string, and we enter
       string mode.  """ always	starts a string, regardless of context.	 The
       other tokens ("name", "number", "equals", "hash", and "comma") are rec-
       ognized unconditionally.

       Note that "name"	is a catch-all token used for entry types, citation
       keys, field names, and macro names; because BibTeX has slightly differ-
       ent (largely undocumented) rules	for these various elements, a bit of
       trickery	is needed to make things work.	As a starting point, consider
       BibTeX's	definition of what's allowed for an entry key: a sequence of
       any characters except

	  " # %	' ( ) ,	= { }

       plus space.  There are a	couple of problems with	this scheme.  First,
       without specifying the character	set from which those "magic 10"	char-
       acters are drawn, it's a	bit hard to know just what is allowed.	Sec-
       ond, allowing "@" characters could lead to confusing BibTeX syntax (it
       doesn't confuse BibTeX, but it might confuse a human reader).  Finally,
       allowing	certain	characters that	are special to TeX means that BibTeX
       can generate bogus TeX code: try	putting	a backslash ("\") or tilde
       ("~") in	a citation key.	 (This last exception is rather	specific to
       the "generating (La)TeX code from a BibTeX database" application, but
       since that's the	major application for BibTeX databases,	then it	will
       presumably be the major application for btparse,	at least initially.
       Thus, it	makes sense to pay attention to	this problem.)

       In btparse, then, a name	is defined as any sequence of letters, digits,
       underscores, and	the following characters:

	  ! $ &	* + - .	/ : ; <	> ? [ ]	^ _ ` |

       This list was derived by	removing BibTeX's "magic 10" from the set of
       printable 7-bit ASCII characters	(32-126), and then further removing
       "@", "\", and "~".  This	means that btparse disallows some of the
       weirder entry keys that BibTeX would accept, such as "\foo@bar",	but
       still allows a string with initial digits.  In fact, from the above
       definition it appears that btparse would	accept a string	of all digits
       as a "name;" this is not	the case, though, as the lexical scanner rec-
       ognizes such a digit string as a	number first.  There are two problems
       here: BibTeX entry keys may in fact be entirely numeric,	and field
       names may not begin with	a digit.  (Those are two of the	not-so-obvious
       differences in BibTeX's handling	of keys	and field names.)  The tricks
       used to deal with these problems	are implemented	in the parser rather
       than the	lexical	scanner, so are	described in "SYNTACTIC	GRAMMAR" be-

       The string lexer	recognizes "lbrace", "rbrace", "lparen", and "rparen"
       tokens in order to count	brace- or parenthesis-depth.  This is neces-
       sary so it knows	when to	accept a string	delimited by braces or paren-
       theses.	(Note that a parenthesis-delimited string is only allowed af-
       ter @comment---this is not a normal BibTeX construct.)  In addition, it
       converts	each non-space whitespace character (newline, carriage-return,
       and tab)	to a single space.  (Sequences of whitespace are not col-
       lapsed; that's the domain of string post-processing, which is well re-
       moved from the scanner or parser.)  Finally, it accepts """ to delimit
       quote-delimited strings.	 Apart from those restrictions,	the string
       lexer accepts anything up to the	end-of-string delimiter.

       (The language used to describe the grammar here is the extended Backus-
       Naur Form (EBNF)	used by	PCCTS.	Terminals are represented by uppercase
       strings,	non-terminals by lowercase strings; terminal names are the
       same as those given in the lexical grammar above.  "( foo )*" means
       zero or more repetitions	of the "foo" production, and "{	foo }" means
       an optional "foo".)

       A file is just a	sequence of zero or more entries:

	  bibfile : ( entry )*

       An entry	is an at-sign, a name (the "entry type"), and the entry	body:

	  entry	: AT NAME body

       A body is either	a string (this alternative is only tried if the	entry
       type is "comment") or the entry contents:

	  body : STRING			   # for comment entries
	       | ENTRY_OPEN contents ENTRY_CLOSE

       ("ENTRY_OPEN" and "ENTRY_CLOSE" are either "{" and "}" or "(" and ")",
       depending what is seen in the input for a particular entry.)

       There are three possible	productions for	the "contents" non-terminal.
       Only one	applies	to any given entry, depending on the entry metatype
       (which in turn depends on the entry type).  Currently, btparse supports
       four entry metatypes: comment, preamble,	macro definition, and regular.
       The first two correspond	to @comment and	@preamble entries; "macro def-
       inition"	is for @string entries;	and "regular" is for all other entry
       types.  (The library will be extended to	handle @modify and @alias en-
       try types, and corresponding "modify" and "alias" metatypes, when Bib-
       TeX 1.0 is released and the exact syntax	is known.)  The	"metatype"
       concept is necessary so that all	entry types that aren't	specifically
       recognized fall into the	"regular" metatype.  It's also convenient not
       to have to "strcmp" the entry type all the time.

	  contents : ( NAME | NUMBER ) COMMA fields	# for regular entries
		   | fields		   # for macro definition entries
		   | value		   # for preamble entries

       Note that the entry key is not just a "NAME", but "( NAME | NUMBER)".
       This is necessary because BibTeX	allows all-numeric entry keys, but bt-
       parse's lexical scanner recognizes such digit strings as	"NUMBER" to-

       "fields"	is a comma-separated list of fields, with an optional single
       trailing	comma:

	  fields : field { COMMA fields	}

       A "field" is a single "field = value" assignment:

	  field	: NAME EQUALS value

       Note that "NAME"	here is	a restricted version of	the "name" token de-
       scribed in "LEXICAL GRAMMAR" above.  Any	"name" token will be accepted
       by the parser, but it is	immediately checked to ensure that it doesn't
       begin with a digit; if so, an artificial	syntax error is	triggered.
       (This is	for compatibility with BibTeX, which doesn't allow field names
       to start	with a digit.)

       A "value" is a series of	simple values joined by	'#' characters:

	  value	: simple_value ( HASH simple_value )*

       A simple	value is a string, number, or name (for	macro invocations):

	  simple_value : STRING
		       | NUMBER
		       | NAME


       Greg Ward <>

btparse, version 0.34		  2003-10-25			bt_language(3)


Want to link to this manual page? Use this URL:

home | help