Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Lingua::EN::NameParse(User Contributed Perl DocumentatLingua::EN::NameParse(3)

NAME
       Lingua::EN::NameParse - extract the components of a person or couples
       full name, presented as a text string

SYNOPSIS
	   use Lingua::EN::NameParse qw(clean case_surname);

	   # optional configuration arguments
	   my %args =
	   (
	       auto_clean      => 1,
	       lc_prefix       => 1,
	       initials	       => 3,
	       allow_reversed  => 1,
	       joint_names     => 0,
	       extended_titles => 0
	   );

	   my $name = Lingua::EN::NameParse->new(%args);

	   $error = $name->parse("Estate Of Lt Col AB Van Der Heiden (Hold Mail)");
	   unless ( $error )
	   {
	       print($name->report);

		   Case	all		: Estate Of Lt Col AB Van Der Heiden (Hold Mail)
		   Case	all reversed	: Van Der Heiden, Lt Col AB
		   Salutation		: Dear Friend
		   Type			: Mr_A_Smith
		   Parsing Error	: 0
		   Error description :	:
		   Parsing Warning	: 1
		   Warning description	: ;non_matching	text found : (Hold Mail)

		   COMPONENTS
		   initials_1		: AB
		   non_matching		: (Hold	Mail)
		   precursor		: Estate Of
		   surname_1		: Van Der Heiden
		   title_1		: Lt Col

	       %name_comps = $name->components;
	       $surname	= $name_comps{surname_1};

	       $correct_casing = $name->case_all;

	       $correct_casing = $name->case_all_reversed ;

	       $salutation = $name->salutation(salutation => 'Dear',sal_default	=> 'Friend'));

	       $good_name = clean("Bad Na9me   "); # "Bad Name"

	       %my_properties =	$name->properties;
	       $number_surnames	= $my_properties{number}; # 1
	   }

	   $lc_prefix =	0;
	   $correct_case = case_surname("DE SILVA-MACNAY",$lc_prefix); # A stand alone function, returns: De Silva-MacNay

	   $error = $name->parse("MR AS	& D.E. DE LA MARE");
	   %my_properties = $name->properties;
	   $number_surnames = $my_properties{number}; #	2

DESCRIPTION
       This module takes as input one person's name or a couples names in free
       format text such	as,

	   Mr AB & M/s CD MacNay-Smith
	   MR J.L. D'ANGELO
	   Estate Of The Late Lieutenant Colonel AB Van	Der Heiden

       and attempts to parse it. If successful,	the name is broken down	into
       components and useful functions can be performed	such as	:

	  converting upper or lower case values	to name	case (Mr AB MacNay   )
	  creating a personalised greeting or salutation     (Dear Mr MacNay )
	  extracting the names individual components	     (Mr,AB,MacNay   )
	  determining the type of format the name is in	     (Mr_A_Smith     )

       If the name(s) cannot be	parsed you have	the option of cleaning the
       name(s) of bad characters, or extracting	any portion that was parsed
       and the portion that failed.

       This module can be used for analysing and improving the quality of
       lists of	names.

DEFINITIONS
       The following terms are used by NameParse to define the components that
       can make	up a name.

	  Precursor   -	Estate of (The Late), Right Honourable ...
	  Title	      -	Mr, Mrs, Ms., Sir, Dr, Major, Reverend ...
	  Conjunction -	word to	separate two names, such as "And" or &
	  Initials    -	1-3 letters, each with an optional space and/or	dot
	  Surname     -	De Silva, Van Der Heiden, MacNay-Smith,	O'Reilly ...
	  Suffix      -	Snr., Jnr, III,	V ...

       Refer to	the component grammar defined within the code for a complete
       list of combinations.

       'Name casing' refers to the correct use of upper	and lower case letters
       in peoples names, such as Mr AB McNay.

       To describe the formats supported by NameParse, a short hand
       representation of the name is used. The following formats are currently
       supported :

	   Mr_John_Smith_&_Ms_Mary_Jones
	   Mr_A_Smith_&_Ms_B_Jones
	   Mr_&Ms_A_&_B_Smith
	   Mr_A_&_Ms_B_Smith
	   Mr_&_Ms_A_Smith
	   Mr_A_&_B_Smith
	   John_Smith_&_Mary_Jones
	   John_&_Mary_Smith
	   A_Smith_&_B_Jones

	   Mr_John_Adam_Smith
	   Mr_John_A_Smith
	   Mr_John_Smith
	   Mr_A_Smith
	   John_Adam_Smith
	   John_A_Smith
	   J_Adam_Smith
	   John_Smith
	   A_Smith
	   John

       Precursors and suffixes may be applied to single	names that have	a
       surname

METHODS
   new
       The "new" method	creates	an instance of a name object and sets up the
       grammar used to parse names. This must be called	before any of the
       following methods are invoked. Note that	the object only	needs to be
       created ONCE, and should	be reused with new input data. Calling "new"
       repeatedly will significantly slow your program down.

       Various setup options may be defined in a hash that is passed as	an
       optional	argument to the	"new" method. Note that	all the	arguments are
       optional. You need to define the	combination of arguments that are
       appropriate for your usage.

	  my %args =
	  (
	     auto_clean	    => 1,
	     lc_prefix	    => 1,
	     initials	    => 3,
	     allow_reversed => 1
	  );

	  my $name = Lingua::EN::NameParse->new(%args);

       auto_clean
	   When	this option is set to a	positive value,	any call to the
	   "parse" method that fails will attempt to 'clean' the name and then
	   reparse it. See the "clean" method for details. This	is useful for
	   dirty data with embedded unprintable	or non alphabetic characters.

       lc_prefix
	   When	this option is set to a	positive value,	it will	force the
	   "case_all" and "components" methods to lower	case the first letter
	   of each word	that occurs in the prefix portion of a surname.	For
	   example, Mr AB de Silva, or Ms AS von der Heiden.

       initials
	   Allows the user to control the number of letters that can occur in
	   the initials.  Valid	settings are 1,2 or 3. If no value is supplied
	   a default of	2 is used.

       allow_reversed
	   When	this option is set to a	positive value,	names in reverse order
	   will	be processed. The only valid format is the surname followed by
	   a comma and the rest	of the name, which can be in any of the
	   combinations	allowed	by non reversed	names. Some examples are:

	   Smith, Mr AB	Jones, Jim De Silva, Professor A.B.

	   The program changes the order of the	name back to the non reversed
	   format, and then performs the normal	parsing. Note that if the name
	   can be parsed, the fact that	it's order was originally reversed, is
	   not recorded	as a property of the name object.

       joint_names
	   When	this option is set to a	positive value,	joint names are
	   accounted for:

	   Mr_A_Smith_&Ms_B_Jones Mr_&Ms_A_&B_Smith Mr_A_&Ms_B_Smith
	   Mr_&Ms_A_Smith Mr_A_&B_Smith

	   Note	that if	this option is not specified, than by default joint
	   names are ignored. Disabling	joint names speeds up the processing a
	   lot.

       extended_titles
	   When	this option is set to a	positive value,	all combinations of
	   titles, such	as Colonel, Mother Superior are	used. If this value is
	   not set, only the following titles are accounted for:

	       Mr
	       Ms
	       M/s
	       Mrs
	       Miss
	       Dr
	       Sir
	       Dame

	   Note	that if	this option is not specified, than by default extended
	   titles are ignored. Disabling extended titles speeds	up the
	   parsing.

   parse
	   $error = $name->parse("MR AC	DE SILVA");

       The "parse" method takes	a single parameter of a	text string containing
       a name. It attempts to parse the	name and break it down into the
       components

       Returns an error	flag. If the name was parsed successfully, it's	value
       is 0, otherwise a 1. This step is a prerequisite	for the	following
       methods.

   case_all
	   $correct_casing = $name->case_all;

       The "case_all" method converts the first	letter of each component to
       capitals	and the	remainder to lower case, with the following
       exceptions-

	  initials remain capitalised
	  surname spelling such	as MacNay-Smith, O'Brien and Van Der Heiden are	preserved
	  - see	C<surname_prefs.txt> for user defined exceptions

       A complete definition of	the capitalising rules can be found by
       studying	the case_surname function.

       The method returns the entire cased name	as text.

   case_all_reversed
	   $correct_casing = $name->case_all_reversed;

       The "case_all_reversed" method applies the same type of casing as
       "case_all". However, the	name is	returned as surname followed by	a
       comma and the rest of the name, which can be any	of the combinations
       allowed for a name, except the title. Some examples are:	"Smith,	John",
       "De Silva, A.B."	 This is useful	for sorting names alphabetically by
       surname.

       The method returns the entire reverse order cased name as text.

   components
	  %my_name = $name->components;
	  $cased_surname = $my_name{surname_1};

       The "components"	method does the	same thing as the "case_all" method,
       but returns the name cased components in	a hash.	The following keys are
       used for	each component:

	  precursor
	  title_1
	  title_2
	  given_name_1
	  given_name_2
	  initials_1
	  initials_2
	  middle_name
	  conjunction_1
	  conjunction_2
	  surname_1
	  surname_2
	  suffix

       If a component has no matching data for a given name, it	will not
       appear in the hash

       If the name could not be	parsed,	this method returns null. If you
       assign the return value to a hash, you should check the error status
       returned	by the "parse" method first.  Ohterwise, you will get an odd
       number of values	assigned to the	hash.

   case_surname
	  $correct_casing = case_surname("DE SILVA-MACNAY" [,$lc_prefix]);

       "case_surname" is a stand alone function	that does not require a	name
       object. The input is a text string. An optional input argument controls
       the casing rules	for prefix portions of a surname, as described above
       in the "lc_prefix" section.

       The output is a string converted	to the correct casing for surnames.
       See "surname_prefs.txt" for user	defined	exceptions

       This function is	useful when you	know you are only dealing with names
       that do not have	initials like "Mr John Jones". It is much faster than
       the case_all method, but	does not understand context, and cannot	detect
       errors on strings that are not personal names.

   surname_prefs.txt
       Some surnames can have more than	one form of valid capitalisation, such
       as MacQuarie or Macquarie. Where	the user wants to specify one form as
       the default, a text file	called surname_prefs.txt should	be created and
       placed in the same location as the NameParse module. The	text file
       should contain one surname per line, in the capitalised form you	want,
       such as

	  Macquarie
	  MacHado

       NameParse will still operate if the file	does not exist

   salutation
	   $salutation = $name->salutation(salutation => 'Dear',sal_default => 'Friend',sal_type => 'given_name'));

       The "salutation"	method converts	a name into a personal greeting, such
       as "Dear	Mr & Mrs O'Brien" or "Dear Sue and John"

       Optional	parameters may be specided in a	hash as	follows:

	   salutation:

	   The greeting	word such as 'Dear' or 'Greetings'. If not spefied than	'Dear' is used

	   sal_default:

	   The default word used when a	personalised salution cannot be	generated. If not
	   specified, than 'Friend' is used.

	   sal_type:

	   Can be either 'given_name' such as 'Dear Sue' or 'title_plus_name' such as 'Dear Ms Smith'
	   If not specified, than 'given_name' is used.

       If an error is detected during parsing, such as with the	name "AB Smith
       & Associates", then the value of	sal_default is used instead of a given
       name, or	a title	and surname.  If the input string contains a
       conjunction, an 's' is added to the value of sal_default.

       If the name contains a precursor, a default salutation is produced.

   clean
	  $good_name = clean("Bad Na9me");

       "clean" is a stand alone	function that does not require a name object.
       The input is a text string and the output is the	string with:

	  all repeating	spaces removed
	  all characters not in	the set	(A-Z a-z - ' , . &) removed

   properties
       The "properties"	method returns all the properties of the name,
       non_matching, number and	type, as a hash.

       type
	   The type of format a	name is	in, as one of the following strings:

	       Mr_A_Smith_&Ms_B_Jones
	       Mr_&Ms_A_&B_Smith
	       Mr_A_&Ms_B_Smith
	       Mr_&Ms_A_Smith
	       Mr_A_&B_Smith
	       Mr_John_Adam_Smith
	       Mr_John_A_Smith
	       Mr_John_Smith
	       Mr_A_Smith
	       John_Adam_Smith
	       John_A_Smith
	       J_Adam_Smith
	       John_Smith
	       A_Smith
	       John
	       unknown

       non_matching
	   Returns any unmatched section that was found.

   report
       Create a	formatted text report to standard output listing - the input
       string, - the name and value of each defined component -	any non
       matching	component

LIMITATIONS
       The huge	number of character combinations that can form a valid names
       makes it	is impossible to correctly identify them all. Firstly, there
       are many	ambiguities, which have	no right answer.

	  Macbeth or MacBeth, are both valid spellings
	  Is ED	WOOD E.D. Wood or Edward Wood
	  Is 'Mr Rapid Print' a	name or	a company
	  Does	John Bradfield Smith have a middle name	of Bradfield, or a surname of Bradfield-Smith?

       One approach is to have large lookup files of names and words,
       statistical rules and fuzzy logic to attempt to derive context. This
       approach	gives high levels of accuracy but uses a lot of	your computers
       time and	resources.

       NameParse takes the approach of using a limited set of rules, based on
       the formats that	are commonly used by business to represent peoples
       names. This gives us fairly high	accuracy, with acceptable speed	and
       program size.

       NameParse will accept names from	many countries,	like Van Der Heiden,
       De La Mare and Le Fontain. Having said that, it is still	biased toward
       English,	because	the precursors,	titles and conjunctions	are based on
       English usage.

       Names with two or more words, but no separating hyphen are not
       recognized.  This is a real quandary as Indian, Chinese and other names
       can have	several	components. If these are allowed for, any component
       after the surname will also be picked up. For example in	"Mr AB Jones
       Trading As Jones	Pty Ltd" will return a surname of "Jones Trading".

       Because of the large combination	of possible names defined in the
       grammar,	the program is not very	fast, except for the more limited
       "case_surname" subroutine.  See the "Future Directions" section for
       possible	speed ups.

       As the parser has a very	limited	understanding of context, the
       "John_Adam_Smith" name type is most likely  to cause problems, as it
       contains	no known tokens	like a title. A	string such as "National
       Australia Bank" would be	accepted as a valid name, first	name National
       etc. Supplying  a list of common	pronouns as exceptions could solve
       this problem.

REFERENCES
       "The Wordsworth Dictionary of Abbreviations & Acronyms" (1997)

       Australian Standard AS4212-1994 "Geographic Information Systems - Data
       Dictionary for transfer of street addressing information"

FUTURE DIRECTIONS
       Define grammar for other	languages. Hopefully, all that would be	needed
       is to specify a new module with its own grammar,	and inherit all	the
       existing	methods. I don't have the knowledge of the naming conventions
       for non-english languages.

REPOSITORY
       <https://github.com/kimryan/Lingua-EN-NameParse>

SEE ALSO
       Lingua::EN::AddressParse, Lingua::EN::MatchNames,
       Lingua::EN::NickNames, Lingua::EN::NameCase, Parse::RecDescent

BUGS
       Names with accented characters (acute, circumfelx etc) will not be
       parsed correctly. A work	around is to replace the character class [a-z]
       with \w in the appropriate rules	in the grammar tree, but this could
       lower the accuracy of names based purely	on ASCII text.

CREDITS
       Thanks to all the people	who provided ideas and suggestions, including
       -

	  Damian Conway,  author of Parse::RecDescent
	  Mark Summerfield author of Lingua::EN::NameCase,
	  Ron Savage, Alastair Adam Huffman, Douglas Wilson
	  Peter	Schendzielorz

AUTHOR
       NameParse was written by	Kim Ryan <kimryan at cpan dot org>

COPYRIGHT AND LICENSE
       Copyright (c) 2016 Kim Ryan. All	rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

case_all_reversed
       Apply correct capitalisation to a person's entire name and reverse the
       order so	that surname is	first, followed	by the other components, such
       as: Smith, Mr John A Useful for creating	a list of names	that can be
       sorted by surname.

       If name type is unknown , returns null

       If the name type	has a joint name, such as 'Mr_A_Smith_Ms_B_Jones',
       return null, as it is ambiguous which surname to	place at the start of
       the string

       Else, returns a string of all cased components in correct reversed
       order

perl v5.24.1			  2016-07-15	      Lingua::EN::NameParse(3)

NAME | SYNOPSIS | DESCRIPTION | DEFINITIONS | METHODS | LIMITATIONS | REFERENCES | FUTURE DIRECTIONS | REPOSITORY | SEE ALSO | BUGS | CREDITS | AUTHOR | COPYRIGHT AND LICENSE | case_all_reversed

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Lingua::EN::NameParse&sektion=3&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help