Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
bt_split_names(3)		    btparse		     bt_split_names(3)

       bt_split_names -	splitting up BibTeX names and lists of names

	  bt_stringlist	* bt_split_list	(char *	  string,
					 char *	  delim,
					 char *	  filename,
					 int	  line,
					 char *	  description);
	  void bt_free_list (bt_stringlist *list);
	  bt_name * bt_split_name (char	*  name,
				   char	*  filename,
				   int	   line,
				   int	   name_num);
	  void bt_free_name (bt_name * name);

       When BibTeX files are used for their original purpose---bibliographic
       entries describing scholarly publications---processing lists of names
       (authors	and editors mostly) becomes important.	Although such name-
       processing is outside the general-purpose database domain of most of
       the btparse library, these splitting functions are provided as a	con-
       cession to reality: most	BibTeX data files use the BibTeX conventions
       for author names, and a library to process that data ought to be	capa-
       ble of processing the names.

       Name-processing comes in	two stages: first, split up a list of names
       into individual strings;	second,	split up each name into	"parts"
       (first, von, last, and jr).  The	first is actually quite	general: you
       could pick a delimiter (such as 'and', used for lists of	names) and use
       it to divide any	string into substrings.	 "bt_split_list()" could then
       be called to break up the original string and extract the substrings.
       "bt_split_name()", however, is quite specific to	four-part author names
       written using BibTeX conventions.  (These conventions are described in-
       formally	in any BibTeX documentation; the description you will find
       here is more formal and algorithmic---and thus harder to	understand.)

       See bt_format_names for information on turning split-up names back into
       strings in a variety of ways.

	      bt_stringlist * bt_split_list (char *   string,
					     char *   delim,
					     char *   filename,
					     int      line,
					     char *   description)

	   Splits "string" into	substrings delimited by	"delim"	(a fixed
	   string).  The splitting is done according to	the rules used by Bib-
	   TeX for splitting up	a list of names, in particular:

	   *   delimiters at beginning or end of string	are ignored

	   *   delimiters must be surrounded by	whitespace

	   *   matching	of delimiters is case insensitive

	   *   delimiters at non-zero brace depth are ignored

	   For instance, if the	delimiter is "and", then the string

	      Candy and	Apples AnD {Green Eggs and Ham}

	   splits into three substrings: "Candy", "Apples", and	"{Green	Eggs
	   and Ham}".

	   If there are	extra delimiters at the	extremities of the
	   string---say, an "and" at the beginning of the string---then	they
	   are included	in the first/last string; no warning is	currently
	   printed, but	this may change.  Successive delimiters	("and and")
	   result in a warning and a NULL string being added to	the list of
	   substrings.	For instance, the string

	      and Joe Q. Blow and and Smith, Jr., John

	   would split into three substrings: "and Joe Q. Blow", "NULL", and
	   "Smith, Jr.,	John".

	   (If these rules seem	somewhat odd, don't blame me: I	just imple-
	   mented BibTeX's observed behaviour and added	warning	messages for
	   one of the more obvious and easily-detected mistakes.)

	   The substrings are returned as a "bt_stringlist" structure:

	      typedef struct
		 char *	 string;
		 int	 num_items;
		 char ** items;
	      }	bt_stringlist;

	   There is currently no elegant interface to this structure: you just
	   have	to poke	around in it yourself.	The fields are:

	       a copy of the "string" parameter	passed to "bt_split_list()",
	       but with	NUL characters replacing the space after each sub-
	       string.	(This is safe because delimiters must be surrounded by
	       whitespace, which means that each substring is followed by
	       whitespace which	is not part of the substring.)	You probably
	       shouldn't fiddle	with "string"; it's just there so that
	       "bt_free_list()"	has something to "free()".

	       the number of substrings	found in the string passed to

	       an array	of "num_items" pointers	into "string".	For instance,
	       "items[1]" points to the	second substring.  Since "string" has
	       been mangled with NUL characters, it is safe to treat
	       "items[i]" as a regular C string.

	       "filename", "line", and "description" are all used for generat-
	       ing warning messages.  "filename" and "line" simply describe
	       where the string	came from, and "description" is	a brief	(one
	       word) description of the	substrings.  For instance, if you are
	       splitting a list	of names, supply "name"	for "descrip-
	       tion"---that way, warnings will refer to	"name X" rather	than
	       "substring x".

	      void bt_free_list	(bt_stringlist *list)

	   Frees a "bt_stringlist" structure as	returned by "bt_split_list()".
	   That	is, it frees the copy of the string you	passed to
	   "bt_split_list()", and then frees the structure itself.

	      bt_name *	bt_split_name (char *  name,
				       char *  filename,
				       int     line,
				       int     name_num)

	   Splits a single BibTeX-style	author name into four parts: first,
	   von,	last, and jr.  This can	handle almost all names	in the style
	   of the major	Western	European languages, but	not quite.  (Alas!)

	   A name is split by first dividing into tokens; tokens are separated
	   by whitespace or commas at brace-level zero.	 Thus the name

	      van der Graaf, Horace Q.

	   has five tokens, whereas the	name

	      {Foo, Bar, and Sons}

	   consists of a single	token.

	   How tokens are divided into parts depends on	the form of the	name.
	   If the name has no commas at	brace-level zero (as in	the second ex-
	   ample), then	it is assumed to be in either "first last" or "first
	   von last" form.  If there are no tokens that	start with a lower-
	   case	letter,	then "first last" form is assumed: the final token is
	   the last name, and all other	tokens form the	first name.  Other-
	   wise, the earliest contiguous sequence of tokens with initial
	   lower-case letters is taken as the `von' part; if this sequence in-
	   cludes the final token, then	a warning is printed and the final to-
	   ken is forced to be the `last' part.

	   If a	name has a single comma, then it is assumed to be in "von
	   last, first"	form.  A leading sequence of tokens with initial
	   lower-case letters, if any, forms the `von' part; tokens between
	   the `von' and the comma form	the `last' part; tokens	following the
	   comma form the `first' part.	 Again,	if there are no	token follow-
	   ing a leading sequence of lowercase tokens, a warning is printed
	   and the token immediately preceding the comma is taken to be	the
	   `last' part.

	   If a	name has more than two commas, a warning is printed and	the
	   name	is treated as though only the first two	commas were present.

	   Finally, if a name has two commas, it is assumed to be in "von
	   last, jr, first" form.  (This is the	only way to represent a	name
	   with	a `jr' part.)  The parsing of the name is the same as for a
	   one-comma name, except that tokens between the two commas are taken
	   to be the `jr' part.

	   The one case	not properly handled by	BibTeX name conventions	is a
	   name	with a 'jr' part not separated from the	last name by a comma;
	   for example:

	      Henry Ford Jr.
	      George Herbert Walker Bush III

	   Both	of these would be incorrectly interpreted by both BibTeX and
	   bt_split_name(): the	"Jr." or "III" token would be taken as the
	   last	name, and the other tokekens as	a two- or four-part first
	   name.  The workaround is to shoehorn	the 'jr' into the last name:

	      Henry {Ford Jr.}
	      George Herbert Walker {Bush III}

	   but this will make it impossible to extract the last	name on	its
	   own,	e.g. to	generate "author-year" style citations.	 This design
	   flaw	may be fixed in	a future version of btparse.

	   The split-up	name is	returned as a "bt_name"	structure:

	      typedef struct
		 bt_stringlist * tokens;
		 char ** parts[BT_MAX_NAMEPARTS];
		 int	 part_len[BT_MAX_NAMEPARTS];
	      }	bt_name;

	   Again, there's no nice interface to this structure; you'll just
	   have	to access the fields individually.  They are:

	       the name, broken	down into a flat list of tokens.  See above
	       for a description of the	"bt_stringlist"	structure.

	       an array	of arrays of pointers into the token list.  The	major
	       dimension of this beast is the "name part;" you should index
	       this dimension using the	"bt_namepart" enum.  For instance,
	       "parts[BTN_LAST]" is an array of	pointers to the	tokens com-
	       prising the last	name; "parts[BTN_LAST][1]" is a	"char *": the
	       second token of the 'last' part;	and "parts[BTN_LAST][1][0]" is
	       the first character of the second token of the 'last' part.

	       the length, in tokens, of each part.  For instance, you might
	       loop over all tokens in the 'first' part	as follows (assuming
	       "name" is a "bt_name *" returned	by "bt_split_name()"):

		  for (i = 0; i	< name->part_len[BTN_FIRST]; i++)
		     printf ("token %d of first	name: %s\n",
			     i,	name->parts[BTN_FIRST][i]);

	      void bt_free_name	(bt_name * name)

	   Frees the "bt_name" structure created by "bt_split_name()" (includ-
	   ing the "bt_stringlist" structure inside the	"bt_name").

       btparse,	bt_format_names

       Greg Ward <>

btparse, version 0.34		  2003-10-25		     bt_split_names(3)


Want to link to this manual page? Use this URL:

home | help