Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Text::BibTeX::Name(3) User Contributed Perl DocumentationText::BibTeX::Name(3)

       Text::BibTeX::Name - interface to BibTeX-style author names

	  use Text::BibTeX::Name;

	  $name	= Text::BibTeX::Name->new();
	  $name->split('J. Random Hacker');
	  # or:
	  $name	= Text::BibTeX::Name->new('J. Random Hacker');

	  @firstname_tokens = $name->part ('first');
	  $lastname = join (' ', $name->part ('last'));

	  $format = Text::BibTeX::NameFormat->new();
	  # ...customize $format...
	  $formatted = $name->format ($format);

       "Text::BibTeX::Name" provides an	abstraction for	BibTeX-style names and
       some basic operations on	them.  A name, in the BibTeX world, consists
       of a list of tokens which are divided amongst four parts: `first',
       `von', `last', and `jr'.

       Tokens are separated by whitespace or commas at brace-level zero.  Thus
       the name

	  van der Graaf, Horace	Q.

       has five	tokens,	whereas	the name

	  {Foo,	Bar, and Sons}

       consists	of a single token.  Skip down to "EXAMPLES" for	more examples,
       or read on if you want to know the exact	details	of how names are split
       into tokens and parts.

       How tokens are divided into parts depends on the	form of	the name.  If
       the name	has no commas at brace-level zero (as in the second example),
       then it is assumed to be	in either "first last" or "first von last"
       form.  If there are no tokens that start	with a lower-case letter, then
       "first last" form is assumed: the final token is	the last name, and all
       other tokens form the first name.  Otherwise, the earliest contiguous
       sequence	of tokens with initial lower-case letters is taken as the
       `von' part; if this sequence includes the final token, then a warning
       is printed and the final	token is forced	to be the `last' part.

       If a name has a single comma, then it is	assumed	to be in "von last,
       first" form.  A leading sequence	of tokens with initial lower-case
       letters,	if any,	forms the `von'	part; tokens between the `von' and the
       comma form the `last' part; tokens following the	comma form the `first'
       part.  Again, if	there are no tokens following a	leading	sequence of
       lowercase tokens, a warning is printed and the token immediately
       preceding the comma is taken to be the `last' part.

       If a name has more than two commas, a warning is	printed	and the	name
       is treated as though only the first two commas were present.

       Finally,	if a name has two commas, it is	assumed	to be in "von last,
       jr, first" form.	 (This is the only way to represent a name with	a `jr'
       part.)  The parsing of the name is the same as for a one-comma name,
       except that tokens between the two commas are taken to be the `jr'

       The C code that does the	actual work of splitting up names takes	a
       shortcut	and makes few assumptions about	whitespace.  In	particular,
       there must be no	leading	whitespace, no trailing	whitespace, no
       consecutive whitespace characters in the	string,	and no whitespace
       characters other	than space.  In	other words, all whitespace must
       consist of lone internal	spaces.

       The strings "John Smith"	and "Smith, John" are different
       representations of the same name, so split into parts and tokens	the
       same way, namely	as:

	  first	=> ('John')
	  von	=> ()
	  last	=> ('Smith')
	  jr	=> ()

       Note that every part is a list of tokens, even if there is only one
       token in	that part; empty parts get empty token lists.  Every token is
       just a string.  Writing this example in actual code is simple:

	  $name	= Text::BibTeX::Name->new("John	Smith");  # or "Smith, John"
	  $name->part ('first');       # returns list ("John")
	  $name->part ('last');	       # returns list ("Smith")
	  $name->part ('von');	       # returns list ()
	  $name->part ('jr');	       # returns list ()

       (We'll omit the empty parts in the rest of the examples:	just assume
       that any	unmentioned part is an empty list.)  If	more than two tokens
       are included and	there's	no comma, they'll go to	the first name:	thus
       "John Q.	Smith" splits into

	  first	=> ("John", "Q."))
	  last	=> ("Smith")

       and "J. R. R. Tolkein" into

	  first	=> ("J.", "R.",	"R.")
	  last	=> ("Tolkein")

       The ambiguous name "Kevin Philips Bong" splits into

	  first	=> ("Kevin", "Philips")
	  last	=> ("Bong")

       which may or may	not be the right thing,	depending on the particular
       person.	There's	no way to know though, so if this fellow's last	name
       is "Philips Bong" and not "Bong", the string representation of his name
       must disambiguate.  One possibility is "Philips Bong, Kevin" which
       splits into

	  first	=> ("Kevin")
	  last	=> ("Philips", "Bong")

       Alternately, "Kevin {Philips Bong}" takes advantage of the fact that
       tokes are only split on whitespace at brace-level zero, and becomes

	  first	=> ("Kevin")
	  last	=> ("{Philips Bong}")

       which is	fine if	your names are destined	to be processed	by TeX,	but
       might be	problematic in other contexts.	Similarly, "St John-Mollusc,
       Oliver" becomes

	  first	=> ("Oliver")
	  last	=> ("St", "John-Mollusc")

       which can also be written as "Oliver {St	John-Mollusc}":

	  first	=> ("Oliver")
	  last	=> ("{St John-Mollusc}")

       Since tokens are	separated purely by whitespace,	hyphenated names will
       work either way:	both "Nigel Incubator-Jones" and "Incubator-Jones,
       Nigel" come out as

	  first	=> ("Nigel")
	  last	=> ("Incubator-Jones")

       Multi-token last	names with lowercase components	-- the "von part" --
       work fine: both "Ludwig van Beethoven" and "van Beethoven, Ludwig"
       parse (correctly) into

	  first	=> ("Ludwig")
	  von	=> ("van")
	  last	=> ("Beethoven")

       This allows these European aristocratic names to	sort properly, i.e.
       van Beethoven under B rather than v.  Speaking of aristocratic European
       names, "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin" is
       handled just fine, and splits into

	  first	=> ("Charles", "Louis",	"Xavier", "Joseph")
	  von	=> ("de", "la")
	  last	=> ("Vall{\'e}e", "Poussin")

       so could	be sorted under	V rather than d.  (Note	that the sorting
       algorithm in Text::BibTeX::BibSort is a slavish imitiation of BibTeX
       0.99, and therefore does	the wrong thing	with these names: the sort key
       starts with the "von" part.)

       However,	capitalized "von parts"	don't work so well: "R.	J. Van de
       Graaff" splits into

	  first	=> ("R.", "J.",	"Van")
	  von	=> ("de")
	  last	=> ("Graaff")

       which is	clearly	wrong.	This name should be represented	as "Van	de
       Graaff, R. J."

	  first	=> ("R.", "J.")
	  last	=> ("Van", "de", "Graaff")

       which is	probably right.	 (This particular Van de Graaff	was an
       American, so he probably	belongs	under V	-- which is where my (British)
       dictionary puts him.  Other Van de Graaff's mileages may	vary.)

       Finally,	many names include a suffix: "Jr.", "III", "fils", and so
       forth.  These are handled, but with some	limitations.  If there's a
       comma before the	suffix (the usual U.S. convention for "Jr."), then the
       name should be in last, jr, first form, e.g. "Doe, Jr., John" comes out
       (correctly) as

	  first	=> ("John")
	  last	=> ("Doe")
	  jr	=> ("Jr.")

       but "John Doe, Jr." is ambiguous	and is parsed as

	  first	=> ("Jr.")
	  last	=> ("John", "Doe")

       (so don't do it that way).  If there's no comma before the suffix --
       the usual for Roman numerals, and occasionally seen with	"Jr." -- then
       you're stuck and	have to	make the suffix	part of	the last name.	Thus,
       "Gates III, William H." comes out

	  first	=> ("William", "H.")
	  last	=> ("Gates", "III")

       but "William H. Gates III" is ambiguous,	and becomes

	  first	=> ("William", "H.", "Gates")
	  last	=> ("III")

       -- not what you want.  Again, the curly-brace trick comes in handy, so
       "William	H. {Gates III}"	splits into

	  first	=> ("William", "H.")
	  last	=> ("{Gates III}")

       There is	no way to make a comma-less suffix the "jr" part.  (This is an
       unfortunate consequence of slavishly imitating BibTeX 0.99.)

       Finally,	names that aren't really names of people but rather are
       organization or company names should be forced into a single token by
       wrapping	them in	curly braces.  For example, "Foo, Bar and Sons"	should
       be written "{Foo, Bar and Sons}", which will split as

	  last	=> ("{Foo, Bar and Sons}")

       Of course, if this is one name in a BibTeX "authors" or "editors" list,
       this name has to	be wrapped in braces anyways (because of the " and "),
       but that's another story.

       Putting a split-up name back together again in a	flexible, customizable
       way is the job of another module: see Text::BibTeX::NameFormat.

       new([ [OPTS,] NAME [, FILENAME, LINE, NAME_NUM]])
	   Creates a new "Text::BibTeX::Name" object.  If NAME is supplied, it
	   must	be a string containing a single	name, and it will be be	passed
	   to the "split" method for further processing.  FILENAME, LINE, and
	   NAME_NUM, if	present, are all also passed to	"split"	to allow
	   better error	messages.

	   If the first	argument is a hash reference, it is used to define
	   configuration values. At the	moment the available values are:

	       Set the way Text::BibTeX	deals with strings. By default it
	       manages strings as bytes. You can set BINMODE to	'utf-8'	to get
	       NFC normalized UTF-8 strings and	you can	customise the
	       normalization with the NORMALIZATION option.

		     { binmode => 'utf-8', normalization => 'NFD' },
		     "Alberto Simo~es"});

       split (NAME [, FILENAME,	LINE, NAME_NUM])
	   Splits NAME (a string containing a single name) into	tokens and
	   subsequently	into the four parts of a BibTeX-style name (first,
	   von,	last, and jr).	(Each part is a	list of	tokens,	and tokens are
	   separated by	whitespace or commas at	brace-depth zero.  See above
	   for full details on how a name is split into	its component parts.)

	   The token-lists that	make up	each part of the name are then stored
	   in the "Text::BibTeX::Name" object for later	retrieval or
	   formatting with the "part" and "format" methods.

       part (PARTNAME)
	   Returns the list of tokens in part PARTNAME of a name previously
	   split with "split".	For example, suppose a "Text::BibTeX::Name"
	   object is created and initialized like this:

	      $name = Text::BibTeX::Name->new();
	      $name->split ('Charles Louis Xavier Joseph de la Vall{\'e}e Poussin');

	   Then	this code:

	      $name->part ('von');

	   would return	the list "('de','la')".

       format (FORMAT)
	   Formats a name according to the specifications encoded in FORMAT,
	   which should	be a "Text::BibTeX::NameFormat"	(or descendant)
	   object.  (In	short, it must supply a	method "apply" which takes a
	   "Text::BibTeX::NameFormat" object as	its only argument.)  Returns
	   the formatted name as a string.

	   See Text::BibTeX::NameFormat	for full details on formatting names.

       Text::BibTeX::Entry, Text::BibTeX::NameFormat, bt_split_names.

       Greg Ward <>

       Copyright (c) 1997-2000 by Gregory P. Ward.  All	rights reserved.  This
       file is part of the Text::BibTeX	library.  This library is free
       software; you may redistribute it and/or	modify it under	the same terms
       as Perl itself.

perl v5.32.0			  2020-08-08		 Text::BibTeX::Name(3)


Want to link to this manual page? Use this URL:

home | help