Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Ident(3)	      User Contributed Perl Documentation	      Ident(3)

NAME
       Lingua::Ident - Statistical language identification

SYNOPSIS
	use Lingua::Ident;
	$classifier = new Lingua::Ident("filename 1", ..., "filename n");
	$lang =	$classifier->identify("text to classify");
	$probabilities = $classifier->calculate("text to classify");

DESCRIPTION
       This module implements a	statistical language identifier	based on the
       approach	Ted Dunning described in his 1994 report Statistical
       Identification of Language.

METHODS
   Lingua::Ident->new($filename, ...)
       Construct a new classifier.  The	filename arguments to the constructor
       must refer to files containing tables of	n-gram probabilites for
       languages (language models).  These tables can be generated using the
       trainlid(1) utility program.

   $classifier->identify($string)
       Identify	the language of	a text given in	$string.  The identify()
       method returns the value	specified in the _LANG field of	the
       probabilities table of the language in which the	text is	most likely
       written (see "WARNINGS" below).

       Internally, the identify() method calls the calculate() method.

   $classifier->calculate($string)
       Calculate the probabilities for a text to be in the languages known to
       the classifier.	This method returns a reference	to an array.  The
       array represents	a table	of languages and the probabiliy	for each
       language.  Each array element is	a reference to an array	containing two
       elements: The language name and the associated probability.  For
       example,	you may	get something like this:

	  [['de.iso-8859-1', -317.980835274509],
	   ['en.iso-8859-1', -450.804230119916], ...]

       The elements are	sorted in descending order by probability.  You	can
       use this	data to	assess the reliability of the categorization and make
       your own	decision using application-specific metrics.

       When neither a trigram nor a bigram is found, the calculation deviates
       slightly	from the formula given by Dunning (1994).  According to
       Dunning's formula, one would estimate the probability as:

	 p = log(1/#alph)

       where #alph is the size of the alphabet of a particular language.  This
       penalizes different language models with	different values because the
       alphabet	sizes of the languages differ.

       However,	the size of the	alphabet is much larger	for Asian languages
       than for	European languages.  For example, for the sample data in the
       Lingua::Ident distribution trainlid(1) reports #alph = 127 for zh.big5
       vs. #alph = 31 for de.iso-8859-1.  This means that Asian	languages are
       penalized much harder than European languages when an estimation	must
       be made.

       To use the same penalty for all languages, calculate() now uses the
       average of all alphabet sizes instead.

       NOTE: This has only been	lightly	tested yet--feedback is	welcome.

WARNINGS
       Since Lingua::Ident is based on statistics it cannot be 100% accurate.
       More precisely, Dunning (see below) reports his implementation to
       achieve 92% accuracy with 50 KB of training text	for 20-character
       strings discriminating between English and Spanish.  This
       implementation should be	as accurate as Dunning's.  However, not	only
       the size	but also the quality of	the training text plays	a role.

       The current implementation doesn't use a	threshold to determine if the
       most probable language has a high enough	probability; if	you're trying
       to classify a text in a language	for which there	is no probability
       table, this results in getting an incorrect language.

AUTHOR
       Lingua::Ident was developed by Michael Piotrowski <mxp@dynalabs.de>.

LICENSE
       This program is free software; you may redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       Dunning,	Ted (1994).  Statistical Identification	of Language.
       Technical report	CRL MCCS-94-273.  Computing Research Lab, New Mexico
       State University.

perl v5.24.1			  2010-05-14			      Ident(3)

NAME | SYNOPSIS | DESCRIPTION | METHODS | WARNINGS | AUTHOR | LICENSE | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Lingua::Ident&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help