Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Lingua::EN::AddressParUser)Contributed Perl DocumenLingua::EN::AddressParse(3)

NAME
       Lingua::EN::AddressParse	- extract components of	a street address from
       free format text

SYNOPSIS
	   use Lingua::EN::AddressParse;

	   my %args =
	   (
	     country	 => 'US',
	     auto_clean	 => 1,
	     force_case	 => 1,
	     abbreviate_subcountry => 0,
	     abbreviated_subcountry_only => 0,
	     force_post_code =>	0
	   );

	   my $address = Lingua::EN::AddressParse->new(%args);
	   $error = $address->parse("40	1/2 N OLD MASSACHUSETTS	AVE APT	3B Washington Valley Washington	98100: HOLD MAIL");

	   print $address->report;

	       Country address format  'US'
	       Address type	       'suburban'
	       Non matching part       'HOLD MAIL '
	       Error		       '1'
	       Error descriptions      'non matching section : HOLD MAIL '
	       Warning		       '1'
	       Warning description     ''
	       Case all		       '40 1/2 N Old Massachusetts Ave Apt 3B Washington Valley	WA 98100'
	       COMPONENTS	       ''
	       base_street_name	       'Old Massachusetts'
	       post_code	       '98100'
	       property_identifier     '40 1/2'
	       street_direction_prefix 'N'
	       street_name	       'N Old Massachusetts'
	       street_type	       'Ave'
	       sub_property_identifier '3B'
	       sub_property_type       'Apt'
	       subcountry	       'WASHINGTON'
	       suburb		       'Washington Valley'

	   %address_components = $address->components;
	   print $address_components{sub_property_type};       # APT
	   print $address_components{sub_property_identifier}; # 3B
	   print $address_components{property_identifier};     # 40 1/2

	   %address_properties = $address->properties;
	   print $address_properties{type};	       # suburban
	   print $address_properties{non_matching};    # : HOLD	MAIL

	   $correct_casing = $address->case_all;

DESCRIPTION
       This module takes as input a suburban, rural or postal address in free
       format text such	as,

	   3080	28TH AVE N ST PETERSBURG, FL 33713-3810
	   12 1st Avenue N Suite # 2 Somewhere CA 12345	USA
	   C/O JOHN, KENNETH JR	POA 744	WIND RIVER DR SYLVANIA,	OH 43560-4317

	   9 Church Street, Abertillery, Mid Glamorgan NP13 1DA
	   27 Bury Street, Abingdon, Oxfordshire OX14 3QT

	   2A O'CONNELL	ST KEW NSW 2123
	   12/3-5 AUBREY ST MOUNT VICTORIA VICTORIA 3133
	   "OLD	REGRET"	WENTWORTH FALLS	NSW 2782 AUSTRALIA
	   GPO Box K318, HAYMARKET, NSW	2000

       and attempts to parse it. If successful,	the address is broken down
       into it's components and	useful functions can be	performed such as :

	   converting upper or lower case values to title case (2A O'Connell St	Kew NSW	2123)
	   extracting the addresses individual components      (2A,O'Connell,St,KEW,NSW,2123)
	   determining the type	of format the address is in    ('suburban')

       If the address cannot be	parsed you have	the option of cleaning the
       address of bad characters, or extracting	any portion that was parsed
       and the portion that failed.

       This module can be used for analysing and improving the quality of
       lists of	residential and	postal addresses.

       By using	a large	combination of regular expressiosn with	look ahead
       analysis, patterns can be parsed	that confuse many other	parsers.
       Examples	are

       Street names with several street	types: Lane Cove Road Suburbs which
       include street types: Smith Road	St Marys Suburbs that include state
       names: Fort Washington Washington

DEFINITIONS
       The following terms are used by AddressParse to define the components
       that can	make up	an address.

	   Pre cursor :	C/O MR A Smith...
	   Sub property	identifier : Level 1A Unit 2, Apartment	B, Lot 12, Suite # 12 ...
	   Property Identifier : 12/66A, 24-34,	2A, 23B/12C, 12/42-44, 2.5
	   Property name   : "Old Regret"
	   Post	Box	   : GP0 Box K123, LPO 2345, RMS 23 ...
	   Road	Box	   : RMB 24A, RMS 234 ...
	   Street Direction: North, SE,	Sth. etc
	   Street name	   : O'Hare, New South Head, The Causeway, Broadway
	   Street type	   : Road, Rd.,	St, Lane, Highway, Crescent, Circuit ...
	   Suburb	   : Dee Why, St. John's Wood ...
	   Sub country	   : NSW, New South Wales, ACT,	NY, New	Jersey AZ ...
	   Post	(zip) code : 2062, 34532-1234, SG12A 9ET
	   Country	   : Australia,	UK, US or Canada

       The main	 address formats  currently supported are as follows. (a ?
       means the component is optional):

	   'suburban' :	sub_property(?)	property_identifier(?) street street_type suburb subcountry post_code(?)country(?)

	   OR for the USA
	   'suburban' :	property_identifier(?) street street_type sub_property(?) suburb subcountry post_code(?) country(?)

	   'rural'    :	property_name suburb subcountry	post_code(?) country(?)
	   'post_box' :	post_box suburb	subcountry post_code(?)	country(?)
	   'road_box' :	road_box street	street_type suburb subcountry post_code(?) country(?)
	   'road_box' :	road_box suburb	subcountry post_code(?)	country(?)

       Note that suburb	and subcountry are not optional. The accuracy of the
       parser is improved by providing as much context as possible. Proding a
       suburb can ehlp to identify street names	that would itherwise be
       ambigious.

       For the case where you only have	a street address, dummy	(but still
       valid) values can be used for suburb (such as 'Somewhere') and sub
       country (such as	'NY'). These dummy values will be parsed  but can be
       ignored.

       All formats may contain a precursor

       Refer to	the component grammar defined in the
       Lingua::EN::AddressParse::Grammar module	for a complete list of
       combinations.

METHODS
   new
       The "new" method	creates	an instance of an address object and sets up
       the grammar used	to parse addresses. This must be called	before any of
       the following methods are invoked. Note that the	object only needs to
       be created once,	and can	be reused with new input data.

       Various setup options may be defined in a hash that is passed as	an
       optional	argument to the	"new" method.

	   my %args =
	   (
	     country	 => 'US',
	     auto_clean	 => 1,
	     force_case	 => 1,
	     abbreviate_subcountry => 1,
	     abbreviated_subcountry_only => 1,
	     force_post_code =>	1
	   );

	   my $address = Lingua::EN::AddressParse->new(%args);

       country
	   The country argument	must be	specified. It determines the possible
	   list	of valid sub countries (states,	counties etc, defined in the
	   Locale::SubCountry module) and post code formats. Either the	full
	   name	or abbreviation	may be specified. The currently	supported
	   country names and codes are:

	       AU or Australia
	       CA or Canada
	       GB or United Kingdom
	       US or United States

	   All forms of	upper/lower case are acceptable	in the country's
	   spelling. If	a country name is supplied that	the module doesn't
	   recognise, it will die.

       force_case (optional)
	   This	option only applies to the "case_all" method, see below.

       auto_clean (optional)
	   When	this option is set to a	positive value,	the input string is
	   'cleaned' to	try and	normalise bad patterns.	The type of cleaning
	   includes

	       remove non alphanumeric characters
	       remove full stops
	       remove redundant	white space
	       add missing space separators
	       expand abbreviations to more common forms
	       remove bracketed	annotations
	       fix badly formed	sub property identifiers

       abbreviate_subcountry (optional)
	   When	this option is set to a	positive value,	the sub	country	is
	   forced to it's abbreviated form, so "New South Wales" becomes
	   "NSW". If the sub country is	already	abbreviated then it's value is
	   not altered.

       abbreviated_subcountry_only (optional)
	   When	this option is set to a	positive value,	only the abbreviated
	   form	of sub country is allowed, such	as "NSW" and not "New South
	   Wales". This	will make parsing quicker and ensure that addresses
	   comply with postal standards	that normally permit only abbreviated
	   sub countries.

	   It also avoids matching a sub_country name too early, as in the
	   case	of 'Port Washington New	Jersey'	Normally, 'Washington would be
	   consumed as the sub country,	but by first converting	the address to
	   'Port Washington NJ'	we avoid this problem

       force_post_code (optional)
	   When	this option is set to a	positive value,	the address must
	   contain a post code.	If it does not then an error flag is raised.
	   If this option is set to 0 than a post code is optional.

	   By default for this option is true.

   parse
	   $error = $address->parse("12/3-5 AUBREY ST VERMONT VIC 3133");

       The "parse" method takes	a single parameter of a	text string containing
       a address. It attempts to parse the address and break it	down into the
       components described below. If the address is parsed successfully, a 0
       is returned, otherwise a	1.

       Note that you can successfully parse all	the components of an address
       and still have an error returned. This occurs when you have non
       matching	data following a valid address.	To check if the	data is
       unusable, you also need to use the "properties" method to check the
       address type is 'unknown'

       This method is a	prerequisite for all the following methods.

   components
	   %address = $address->components($upper_case_all);
	   $suburb = $address{suburb};

       If the optional argument	$upper_case_all	is set to a postive value, all
       components are converted	to upper case.

       The "components"	method returns all the address components in a hash.
       The following keys are used for each component:

	   pre_cursor -	such as	'C/O Mr	A Smith'
	   po_box_type - such as 'Private Boxes'
	   post_box
	   road_box
	   sub_property_type
	   sub_property_identifier
	   property_identifier
	   property_name
	   level - such	as 12th	Floor
	   building - such as Tower A
	   street_direction_prefix (such as East, NW, North etc)
	   base_street_name (the name with direction removed, such as "Main" in	"East Main St")
	   street_name (the full street	name such as "East Main")
	   street_type
	   street_direction_suffix (US only, abbreviated only such as N, SE etc)
	   suburb
	   subcountry
	   post_code
	   country

       If a component has no matching data for a given address,	it's values
       will be set to the empty	string.

       Each component is converted to title case, meaning the first letter of
       each component is set to	capitals and the remainder to lower case.

       Proper name capitalisations such	as MacNay and O'Brien are observed

       The following components	are not	converted to title case:

	   post_box
	   road_box
	   subcountry
	   post_code
	   country
	   street_direction_suffix

       If your input data is all upper case and	you want to retian that	format
       for parsed data,	you will need to apply the 'uc'	function to each
       component.

   case_all
	   $correct_casing = $address->case_all;

       The "case_all" method does the same thing as the	"components" method
       except the entire address is returned as	a title	cased text string.

       If the force_case option	was set	in the "new" method above, address
       case the	entire input string, including any unmatched sections after a
       recognisable address that failed	parsing. This option is	useful when
       you know	you have invalid data, but you still want to title case	what
       you have.

   properties
       The "properties"	method returns several properties of the address as a
       hash.  The  following keys are used for each property -

	   type	- either suburban ,rural,post_box,road_box,unknown
	   non_matching	- any trailing string not part the address

       Additional properties can be accessed with the following

	 $address->{original_input}
	 $address->{input_string} - string after auto_clean option has been applied
	 $address->{country_code} - abbreviated	Country	address	format (as defined in the C<new> method)
	 $address->{error} - error flag, 0 = good, 1 = error
	 $address->{error_desc}	- text to describe the type of parsing error
	 $address->{warning} - warning flag, 0 = good, 1 = warning
	 $address->{warning_desc} - text to to describe	the type of parsing warning(s)

       Warnings	mean that the address has parsed but there may still be	errors
       within it's components

   report
       Create a	formatted text report

	   the input string
	   the cleaned input string
	   the country type
	   the address type
	   any non matching part of input string
	   if any parsing errors occurred
	   error description
	   if any parsing warning occurred
	   warning description

	   the name and	value of each defined component

       Returns a string	containing a multi line	formatted text report

DEPENDENCIES
       Lingua::EN::NameParse, Locale::SubCountry, Parse::RecDescent

BUGS
LIMITATIONS
       Streets such as 'The Esplanade' will return a street of 'The Esplanade'
       and a street type of null string.

       The abbreviation	'St' can be interpreted	as either street or Saint.
       This leads to ambiguities such as '12 East St Thomas Lane'. This	could
       be 'East	Street', suburb	of 'Thomas Lane' or 'East St Thomas Lane'. And
       the first pattern is the	more common, that is what will match.

       For US addresses, an ambiguity arises between a street directional
       suffix and a suburb directional prefix, such as '12 Main	St S
       Springfield CA 92345'. Is it South Main St, or South Springfield? The
       parser assumes that 'S' belongs to the street description.

       The huge	number of character combinations that can form a valid address
       makes it	is impossible to correctly identify them all.

       Valid addresses must contain:

	   property address, suburb, subcountry	(aka state) in that order.

       This format is widely accepted in Australia and the US.

       UK addresses will often include suburb, town, city and county, formats
       that are	very difficult to parse.

       Property	names must be enclosed in single or double quotes like "Old
       Regret"

       Because of the large combination	of possible addresses defined in the
       grammar,	the program is not very	fast.

REFERENCES
       "The Wordsworth Dictionary of Abbreviations & Acronyms" (1997)

       Australian Standard AS4212-1994 "Geographic Information Systems - Data
       Dictionary for transfer of street addressing information"

       ISO 3166-2:1998,	Codes for the representation of	names of countries and
       their subdivisions. Also	released as AS/NZS 2632.2:1999

SEE ALSO
       AddressParse is designed	to identify properties,	which have a unique
       physical	location. Geo::StreetAddress::US will also parse addresses for
       the USA,	and can	handle locations defined by street intersections, such
       as: "Hollywood &	Vine, Los Angeles, CA" "Mission	Street at Valencia
       Street, San Francisco, CA"

	   L<Lingua::EN::NameParse>
	   L<Geo::StreetAddress::US>
	   L<Parse::RecDescent>
	   L<Locale::SubCountry>

       See
       <http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml>
       for a list of different addressing formats from around the world. And
       also <http://www.bitboost.com/ref/international-address-formats.html>

REPOSITORY
       <https://github.com/kimryan/Lingua-EN-AddressParse>

TO DO
       Define grammar for other	languages. Hopefully, all that would be	needed
       is to specify a new module with its own grammar,	and inherit all	the
       existing	methods. I don't have the knowledge of the naming conventions
       for non-english languages.

AUTHOR
       AddressParse was	written	by Kim Ryan <kimryan at	cpan d o t org>

COPYRIGHT AND LICENSE
       Copyright (c) 2015 Kim Ryan. All	rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.24.1			  2016-09-15	   Lingua::EN::AddressParse(3)

NAME | SYNOPSIS | DESCRIPTION | DEFINITIONS | METHODS | DEPENDENCIES | BUGS | LIMITATIONS | REFERENCES | SEE ALSO | REPOSITORY | TO DO | AUTHOR | COPYRIGHT AND LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Lingua::EN::AddressParse&sektion=3&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help