Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Locale::Maketext::TPJ13Perl Programmers	Reference GuLocale::Maketext::TPJ13(3)

       Locale::Maketext::TPJ13 -- article about	software localization

	 # This	an article, not	a module.

       The following article by	Sean M.	Burke and Jordan Lachler first
       appeared	in The Perl Journal #13	and is copyright 1999 The Perl
       Journal.	It appears courtesy of Jon Orwant and The Perl Journal.	 This
       document	may be distributed under the same terms	as Perl	itself.

Localization and Perl: gettext breaks, Maketext	fixes
       by Sean M. Burke	and Jordan Lachler

       This article points out cases where gettext (a common system for
       localizing software interfaces -- i.e., making them work	in the user's
       language	of choice) fails because of basic differences between human
       languages.  This	article	then describes Maketext, a new system capable
       of correctly treating these differences.

   A Localization Horror Story:	It Could Happen	To You
	   "There are a	number of languages spoken by human beings in this

	   -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
	   Identification of Languages"

       Imagine that your task for the day is to	localize a piece of software
       -- and luckily for you, the only	output the program emits is two
       messages, like this:

	 I scanned 12 directories.

	 Your query matched 10 files in	4 directories.

       So how hard could that be?  You look at the code	that produces the
       first item, and it reads:

	 printf("I scanned %g directories.",

       You think about that, and realize that it doesn't even work right for
       English,	as it can produce this output:

	 I scanned 1 directories.

       So you rewrite it to read:

	 printf("I scanned %g %s.",
		$directory_count == 1 ?
		  "directory" :	"directories",

       ...which	does the Right Thing.  (In case	you don't recall, "%g" is for
       locale-specific number interpolation, and "%s" is for string

       But you still have to localize it for all the languages you're
       producing this software for, so you pull	Locale::gettext	off of CPAN so
       you can access the "gettext" C functions	you've heard are standard for
       localization tasks.

       And you write:

	 printf(gettext("I scanned %g %s."),
		$dir_scan_count	== 1 ?
		  gettext("directory") : gettext("directories"),

       But you then read in the	gettext	manual (Drepper, Miller, and Pinard
       1995) that this is not a	good idea, since how a single word like
       "directory" or "directories" is translated may depend on	context	-- and
       this is true, since in a	case language like German or Russian, you'd
       may need	these words with a different case ending in the	first instance
       (where the word is the object of	a verb)	than in	the second instance,
       which you haven't even gotten to	yet (where the word is the object of a
       preposition, "in	%g directories") -- assuming these keep	the same
       syntax when translated into those languages.

       So, on the advice of the	gettext	manual,	you rewrite:

	 printf( $dir_scan_count == 1 ?
		  gettext("I scanned %g	directory.") :
		  gettext("I scanned %g	directories."),
		$dir_scan_count	);

       So, you email your various translators (the boss	decides	that the
       languages du jour are Chinese, Arabic, Russian, and Italian, so you
       have one	translator for each), asking for translations for "I scanned
       %g directory." and "I scanned %g	directories.".	When they reply,
       you'll put that in the lexicons for gettext to use when it localizes
       your software, so that when the user is running under the "zh"
       (Chinese) locale, gettext("I scanned %g directory.") will return	the
       appropriate Chinese text, with a	"%g" in	there where printf can then
       interpolate $dir_scan.

       Your Chinese translator emails right back -- he says both of these
       phrases translate to the	same thing in Chinese, because,	in linguistic
       jargon, Chinese "doesn't	have number as a grammatical category" --
       whereas English does.  That is, English has grammatical rules that
       refer to	"number", i.e.,	whether	something is grammatically singular or
       plural; and one of these	rules is the one that forces nouns to take a
       plural suffix (generally	"s") when in a plural context, as they are
       when they follow	a number other than "one" (including, oddly enough,
       "zero").	 Chinese has no	such rules, and	so has just the	one phrase
       where English has two.  But, no problem,	you can	have this one Chinese
       phrase appear as	the translation	for the	two English phrases in the
       "zh" gettext lexicon for	your program.

       Emboldened by this, you dive into the second phrase that	your software
       needs to	output:	"Your query matched 10 files in	4 directories.".  You
       notice that if you want to treat	phrases	as indivisible,	as the gettext
       manual wisely advises, you need four cases now, instead of two, to
       cover the permutations of singular and plural on	the two	items,
       $dir_count and $file_count.  So you try this:

	 printf( $file_count ==	1 ?
	   ( $directory_count == 1 ?
	    gettext("Your query	matched	%g file	in %g directory.") :
	    gettext("Your query	matched	%g file	in %g directories.") ) :
	   ( $directory_count == 1 ?
	    gettext("Your query	matched	%g files in %g directory.") :
	    gettext("Your query	matched	%g files in %g directories.") ),
	  $file_count, $directory_count,

       (The case of "1 file in 2 [or more] directories"	could, I suppose,
       occur in	the case of symlinking or something of the sort.)

       It occurs to you	that this is not the prettiest code you've ever
       written,	but this seems the way to go.  You mail	off to the translators
       asking for translations for these four cases.  The Chinese guy replies
       with the	one phrase that	these all translate to in Chinese, and that
       phrase has two "%g"s in it, as it should	-- but there's a problem.  He
       translates it word-for-word back: "In %g	directories contains %g	files
       match your query."  The %g slots	are in an order	reverse	to what	they
       are in English.	You wonder how you'll get gettext to handle that.

       But you put it aside for	the moment, and	optimistically hope that the
       other translators won't have this problem, and that their languages
       will be better behaved -- i.e., that they will be just like English.

       But the Arabic translator is the	next to	write back.  First off,	your
       code for	"I scanned %g directory." or "I	scanned	%g directories."
       assumes there's only singular or	plural.	 But, to use linguistic	jargon
       again, Arabic has grammatical number, like English (but unlike
       Chinese), but it's a three-term category: singular, dual, and plural.
       In other	words, the way you say "directory" depends on whether there's
       one directory, or two of	them, or more than two of them.	 Your test of
       "($directory == 1)" no longer does the job.  And	it means that where
       English's grammatical category of number	necessitates only the two
       permutations of the first sentence based	on "directory [singular]" and
       "directories [plural]", Arabic has three	-- and,	worse, in the second
       sentence	("Your query matched %g	file in	%g directory."), where English
       has four, Arabic	has nine.  You sense an	unwelcome, exponential trend
       taking shape.

       Your Italian translator emails you back and says	that "I	searched 0
       directories" (a possible	English	output of your program)	is stilted,
       and if you think	that's fine English, that's your problem, but that
       just will not do	in the language	of Dante.  He insists that where
       $directory_count	is 0, your program should produce the Italian text for
       "I didn't scan any directories.".  And ditto for	"I didn't match	any
       files in	any directories", although he says the last part about "in any
       directories" should probably just be left off.

       You wonder how you'll get gettext to handle this; to accommodate	the
       ways Arabic, Chinese, and Italian deal with numbers in just these few
       very simple phrases, you	need to	write code that	will ask gettext for
       different queries depending on whether the numerical values in question
       are 1, 2, more than 2, or in some cases 0, and you still	haven't
       figured out the problem with the	different word order in	Chinese.

       Then your Russian translator calls on the phone,	to personally tell you
       the bad news about how really unpleasant	your life is about to become:

       Russian,	like German or Latin, is an inflectional language; that	is,
       nouns and adjectives have to take endings that depend on	their case
       (i.e., nominative, accusative, genitive,	etc...)	-- which is roughly a
       matter of what role they	have in	syntax of the sentence -- as well as
       on the grammatical gender (i.e.,	masculine, feminine, neuter) and
       number (i.e., singular or plural) of the	noun, as well as on the
       declension class	of the noun.  But unlike with most other inflected
       languages, putting a number-phrase (like	"ten" or "forty-three",	or
       their Arabic numeral equivalents) in front of noun in Russian can
       change the case and number that noun is,	and therefore the endings you
       have to put on it.

       He elaborates:  In "I scanned %g	directories", you'd expect
       "directories" to	be in the accusative case (since it is the direct
       object in the sentence) and the plural number, except where
       $directory_count	is 1, then you'd expect	the singular, of course.  Just
       like Latin or German.  But!  Where $directory_count % 10	is 1 ("%" for
       modulo, remember), assuming $directory count is an integer, and except
       where $directory_count %	100 is 11, "directories" is forced to become
       grammatically singular, which means it gets the ending for the
       accusative singular...  You begin to visualize the code it'd take to
       test for	the problem so far, and	still work for Chinese and Arabic and
       Italian,	and how	many gettext items that'd take,	but he keeps going...
       But where $directory_count % 10 is 2, 3,	or 4 (except where
       $directory_count	% 100 is 12, 13, or 14), the word for "directories" is
       forced to be genitive singular -- which means another ending... The
       room begins to spin around you, slowly at first...  But with all	other
       integer values, since "directory" is an inanimate noun, when preceded
       by a number and in the nominative or accusative cases (as it is here,
       just your luck!), it does stay plural, but it is	forced into the
       genitive	case --	yet another ending...  And you never hear him get to
       the part	about how you're going to run into similar (but	maybe subtly
       different) problems with	other Slavic languages like Polish, because
       the floor comes up to meet you, and you fade into unconsciousness.

       The above cautionary tale relates how an	attempt	at localization	can
       lead from programmer consternation, to program obfuscation, to a	need
       for sedation.  But careful evaluation shows that	your choice of tools
       merely needed further consideration.

   The Linguistic View
	   "It is more complicated than	you think."

	   -- The Eighth Networking Truth, from	RFC 1925

       The field of Linguistics	has expended a great deal of effort over the
       past century trying to find grammatical patterns	which hold across
       languages; it's been a constant process of people making
       generalizations that should apply to all	languages, only	to find	out
       that, all too often, these generalizations fail -- sometimes failing
       for just	a few languages, sometimes whole classes of languages, and
       sometimes nearly	every language in the world except English.  Broad
       statistical trends are evident in what the "average language" is	like
       as far as what its rules	can look like, must look like, and cannot look
       like.  But the "average language" is just as unreal a concept as	the
       "average	person"	-- it runs up against the fact no language (or person)
       is, in fact, average.  The wisdom of past experience leads us to
       believe that any	given language can do whatever it wants, in any	order,
       with appeal to any kind of grammatical categories wants -- case,
       number, tense, real or metaphoric characteristics of the	things that
       words refer to, arbitrary or predictable	classifications	of words based
       on what endings or prefixes they	can take, degree or means of certainty
       about the truth of statements expressed,	and so on, ad infinitum.

       Mercifully, most	localization tasks are a matter	of finding ways	to
       translate whole phrases,	generally sentences, where the context is
       relatively set, and where the only variation in content is usually in a
       number being expressed -- as in the example sentences above.
       Translating specific, fully-formed sentences is,	in practice, fairly
       foolproof -- which is good, because that's what's in the	phrasebooks
       that so many tourists rely on.  Now, a given phrase (whether in a
       phrasebook or in	a gettext lexicon) in one language might have a
       greater or lesser applicability than that phrase's translation into
       another language	-- for example,	strictly speaking, in Arabic, the
       "your" in "Your query matched..." would take a different	form depending
       on whether the user is male or female; so the Arabic translation
       "your[feminine] query" is applicable in fewer cases than	the
       corresponding English phrase, which doesn't distinguish the user's
       gender.	(In practice, it's not feasible	to have	a program know the
       user's gender, so the masculine "you" in	Arabic is usually used,	by

       But in general, such surprises are rare when entire sentences are being
       translated, especially when the functional context is restricted	to
       that of a computer interacting with a user either to convey a fact or
       to prompt for a piece of	information.  So, for purposes of
       localization, translation by phrase (generally by sentence) is both the
       simplest	and the	least problematic.

   Breaking gettext
	   "It Has To Work."

	   -- First Networking Truth, RFC 1925

       Consider	that sentences in a tourist phrasebook are of two types: ones
       like "How do I get to the marketplace?" that don't have any blanks to
       fill in,	and ones like "How much	do these ___ cost?", where there's one
       or more blanks to fill in (and these are	usually	linked to a list of
       words that you can put in that blank: "fish", "potatoes", "tomatoes",
       etc.).  The ones	with no	blanks are no problem, but the fill-in-the-
       blank ones may not be really straightforward. If	it's a Swahili
       phrasebook, for example,	the authors probably didn't bother to tell you
       the complicated ways that the verb "cost" changes its inflectional
       prefix depending	on the noun you're putting in the blank.  The trader
       in the marketplace will still understand	what you're saying if you say
       "how much do these potatoes cost?" with the wrong inflectional prefix
       on "cost".  After all, you can't	speak proper Swahili, you're just a
       tourist.	 But while tourists can	be stupid, computers are supposed to
       be smart; the computer should be	able to	fill in	the blank, and still
       have the	results	be grammatical.

       In other	words, a phrasebook entry takes	some values as parameters (the
       things that you fill in the blank or blanks), and provides a value
       based on	these parameters, where	the way	you get	that final value from
       the given values	can, properly speaking,	involve	an arbitrarily complex
       series of operations.  (In the case of Chinese, it'd be not at all
       complex,	at least in cases like the examples at the beginning of	this
       article;	whereas	in the case of Russian it'd be a rather	complex	series
       of operations.  And in some languages, the complexity could be spread
       around differently: while the act of putting a number-expression	in
       front of	a noun phrase might not	be complex by itself, it may change
       how you have to,	for example, inflect a verb elsewhere in the sentence.
       This is what in syntax is called	"long-distance dependencies".)

       This talk of parameters and arbitrary complexity	is just	another	way to
       say that	an entry in a phrasebook is what in a programming language
       would be	called a "function".  Just so you don't	miss it, this is the
       crux of this article: A phrase is a function; a phrasebook is a bunch
       of functions.

       The reason that using gettext runs into walls (as in the	above second-
       person horror story) is that you're trying to use a string (or worse, a
       choice among a bunch of strings)	to do what you really need a function
       for -- which is futile.	Preforming (s)printf interpolation on the
       strings which you get back from gettext does allow you to do some
       common things passably well... sometimes... sort	of; but, to paraphrase
       what some people	say about "csh"	script programming, "it	fools you into
       thinking	you can	use it for real	things,	but you	can't, and you don't
       discover	this until you've already spent	too much time trying, and by
       then it's too late."

   Replacing gettext
       So, what	needs to replace gettext is a system that supports lexicons of
       functions instead of lexicons of	strings.  An entry in a	lexicon	from
       such a system should not	look like this:

	 "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"

       [\xE9 is	e-acute	in Latin-1.  Some pod renderers	would scream if	I used
       the actual character here. -- SB]

       but instead like	this, bearing in mind that this	is just	a first	stab:

	 sub I_found_X1_files_in_X2_directories	{
	   my( $files, $dirs ) = @_[0,1];
	   $files = sprintf("%g	%s", $files,
	     $files == 1 ? 'fichier' : 'fichiers');
	   $dirs = sprintf("%g %s", $dirs,
	     $dirs == 1	? "r\xE9pertoire" : "r\xE9pertoires");
	   return "J'ai	trouv\xE9 $files dans $dirs.";

       Now, there's no particularly obvious way	to store anything but strings
       in a gettext lexicon; so	it looks like we just have to start over and
       make something better, from scratch.  I call my shot at a gettext-
       replacement system "Maketext", or, in CPAN terms, Locale::Maketext.

       When designing Maketext,	I chose	to plan	its main features in terms of
       "buzzword compliance".  And here	are the	buzzwords:

   Buzzwords: Abstraction and Encapsulation
       The complexity of the language you're trying to output a	phrase in is
       entirely	abstracted inside (and encapsulated within) the	Maketext
       module for that interface.  When	you call:

	 print $lang->maketext("You have [quant,_1,piece] of new mail.",

       you don't know (and in fact can't easily	find out) whether this will
       involve lots of figuring, as in Russian (if $lang is a handle to	the
       Russian module),	or relatively little, as in Chinese.  That kind	of
       abstraction and encapsulation may encourage other pleasant buzzwords
       like modularization and stratification, depending on what design
       decisions you make.

   Buzzword: Isomorphism
       "Isomorphism" means "having the same structure or form";	in discussions
       of program design, the word takes on the	special, specific meaning that
       your implementation of a	solution to a problem has the same structure
       as, say,	an informal verbal description of the solution,	or maybe of
       the problem itself.  Isomorphism	is, all	things considered, a good
       thing --	it's what problem-solving (and solution-implementing) should
       look like.

       What's wrong the	with gettext-using code	like this...

	 printf( $file_count ==	1 ?
	   ( $directory_count == 1 ?
	    "Your query	matched	%g file	in %g directory." :
	    "Your query	matched	%g file	in %g directories." ) :
	   ( $directory_count == 1 ?
	    "Your query	matched	%g files in %g directory." :
	    "Your query	matched	%g files in %g directories." ),
	  $file_count, $directory_count,

       is first	off that it's not well abstracted -- these ways	of testing for
       grammatical number (as in the expressions like "foo == 1	?
       singular_form : plural_form") should be abstracted to each language
       module, since how you get grammatical number is language-specific.

       But second off, it's not	isomorphic -- the "solution" (i.e., the
       phrasebook entries) for Chinese maps from these four English phrases to
       the one Chinese phrase that fits	for all	of them.  In other words, the
       informal	solution would be "The way to say what you want	in Chinese is
       with the	one phrase 'For	your question, in Y directories	you would find
       X files'" -- and	so the implemented solution should be, isomorphically,
       just a straightforward way to spit out that one phrase, with numerals
       properly	interpolated.  It shouldn't have to map	from the complexity of
       other languages to the simplicity of this one.

   Buzzword: Inheritance
       There's a great deal of reuse possible for sharing of phrases between
       modules for related dialects, or	for sharing of auxiliary functions
       between related languages.  (By "auxiliary functions", I	mean functions
       that don't produce phrase-text, but which, say, return an answer	to
       "does this number require a plural noun after it?".  Such auxiliary
       functions would be used in the internal logic of	functions that
       actually	do produce phrase-text.)

       In the case of sharing phrases, consider	that you have an interface
       already localized for American English (probably	by having been written
       with that as the	native locale, but that's incidental).	Localizing it
       for UK English should, in practical terms, be just a matter of running
       it past a British person	with the instructions to indicate what few
       phrases would benefit from a change in spelling or possibly minor
       rewording.  In that case, you should be able to put in the UK English
       localization module only	those phrases that are UK-specific, and	for
       all the rest, inherit from the American English module.	(And I expect
       this same situation would apply with Brazilian and Continental
       Portugese, possibly with	some very closely related languages like Czech
       and Slovak, and possibly	with the slightly different "versions" of
       written Mandarin	Chinese, as I hear exist in Taiwan and mainland

       As to sharing of	auxiliary functions, consider the problem of Russian
       numbers from the	beginning of this article; obviously, you'd want to
       write only once the hairy code that, given a numeric value, would
       return some specification of which case and number a given quantified
       noun should use.	 But suppose that you discover,	while localizing an
       interface for, say, Ukrainian (a	Slavic language	related	to Russian,
       spoken by several million people, many of whom would be relieved	to
       find that your Web site's or software's interface is available in their
       language), that the rules in Ukrainian are the same as in Russian for
       quantification, and probably for	many other grammatical functions.
       While there may well be no phrases in common between Russian and
       Ukrainian, you could still choose to have the Ukrainian module inherit
       from the	Russian	module,	just for the sake of inheriting	all the
       various grammatical methods.  Or, probably better organizationally, you
       could move those	functions to a module called "_E_Slavic" or something,
       which Russian and Ukrainian could inherit useful	functions from,	but
       which would (presumably)	provide	no lexicon.

   Buzzword: Concision
       Okay, concision isn't a buzzword.  But it should	be, so I decree	that
       as a new	buzzword, "concision" means that simple	common things should
       be expressible in very few lines	(or maybe even just a few characters)
       of code -- call it a special case of "making simple things easy and
       hard things possible", and see also the role it played in the
       MIDI::Simple language, discussed	elsewhere in this issue	[TPJ#13].

       Consider	our first stab at an entry in our "phrasebook of functions":

	 sub I_found_X1_files_in_X2_directories	{
	   my( $files, $dirs ) = @_[0,1];
	   $files = sprintf("%g	%s", $files,
	     $files == 1 ? 'fichier' : 'fichiers');
	   $dirs = sprintf("%g %s", $dirs,
	     $dirs == 1	? "r\xE9pertoire" : "r\xE9pertoires");
	   return "J'ai	trouv\xE9 $files dans $dirs.";

       You may sense that a lexicon (to	use a non-committal catch-all term for
       a collection of things you know how to say, regardless of whether
       they're phrases or words) consisting of functions expressed as above
       would make for rather long-winded and repetitive	code --	even if	you
       wisely rewrote this to have quantification (as we call adding a number
       expression to a noun phrase) be a function called like:

	 sub I_found_X1_files_in_X2_directories	{
	   my( $files, $dirs ) = @_[0,1];
	   $files = quant($files, "fichier");
	   $dirs =  quant($dirs,  "r\xE9pertoire");
	   return "J'ai	trouv\xE9 $files dans $dirs.";

       And you may also	sense that you do not want to bother your translators
       with having to write Perl code -- you'd much rather that	they spend
       their very costly time on just translation.  And	this is	to say nothing
       of the near impossibility of finding a commercial translator who	would
       know even simple	Perl.

       In a first-hack implementation of Maketext, each	language-module's
       lexicon looked like this:

	%Lexicon = (
	  "I found %g files in %g directories"
	  => sub {
	     my( $files, $dirs ) = @_[0,1];
	     $files = quant($files, "fichier");
	     $dirs =  quant($dirs,  "r\xE9pertoire");
	     return "J'ai trouv\xE9 $files dans	$dirs.";
	 ... and so on with other phrase => sub	mappings ...

       but I immediately went looking for some more concise way	to basically
       denote the same phrase-function -- a way	that would also	serve to
       concisely denote	most phrase-functions in the lexicon for most
       languages.  After much time and even some actual	thought, I decided on
       this system:

       * Where a value in a %Lexicon hash is a contentful string instead of an
       anonymous sub (or, conceivably, a coderef), it would be interpreted as
       a sort of shorthand expression of what the sub does.  When accessed for
       the first time in a session, it is parsed, turned into Perl code, and
       then eval'd into	an anonymous sub; then that sub	replaces the original
       string in that lexicon.	(That way, the work of parsing and evaling the
       shorthand form for a given phrase is done no more than once per

       * Calls to "maketext" (as Maketext's main function is called) happen
       thru a "language	session	handle", notionally very much like an IO
       handle, in that you open	one at the start of the	session, and use it
       for "sending signals" to	an object in order to have it return the text
       you want.

       So, this:

	 $lang->maketext("You have [quant,_1,piece] of new mail.",

       basically means this: look in the lexicon for $lang (which may inherit
       from any	number of other	lexicons), and find the	function that we
       happen to associate with	the string "You	have [quant,_1,piece] of new
       mail" (which is,	and should be, a functioning "shorthand" for this
       function	in the native locale --	English	in this	case).	If you find
       such a function,	call it	with $lang as its first	parameter (as if it
       were a method), and then	a copy of scalar(@messages) as its second, and
       then return that	value.	If that	function was found, but	was in string
       shorthand instead of being a fully specified function, parse it and
       make it into a function before calling it the first time.

       * The shorthand uses code in brackets to	indicate method	calls that
       should be performed.  A full explanation	is not in order	here, but a
       few examples will suffice:

	 "You have [quant,_1,piece] of new mail."

       The above code is shorthand for,	and will be interpreted	as, this:

	 sub {
	   my $handle =	$_[0];
	   my(@params) = @_;
	   return join '',
	     "You have ",
	     $handle->quant($params[1],	'piece'),
	     "of new mail.";

       where "quant" is	the name of a method you're using to quantify the noun
       "piece" with the	number $params[0].

       A string	with no	brackety calls,	like this:

	 "Your search expression was malformed."

       is somewhat of a	degenerate case, and just gets turned into:

	 sub { return "Your search expression was malformed." }

       However,	not everything you can write in	Perl code can be written in
       the above shorthand system -- not by a long shot.  For example,
       consider	the Italian translator from the	beginning of this article, who
       wanted the Italian for "I didn't	find any files"	as a special case,
       instead of "I found 0 files".  That couldn't be specified (at least not
       easily or simply) in our	shorthand system, and it would have to be
       written out in full, like this:

	 sub {	# pretend the English strings are in Italian
	   my($handle, $files, $dirs) =	@_[0,1,2];
	   return "I didn't find any files" unless $files;
	   return join '',
	     "I	found ",
	     $handle->quant($files, 'file'),
	     " in ",
	     $handle->quant($dirs,  'directory'),

       Next to a lexicon full of shorthand code, that sort of sticks out like
       a sore thumb -- but this	is a special case, after all; and at least
       it's possible, if not as	concise	as usual.

       As to how you'd implement the Russian example from the beginning	of the
       article,	well, There's More Than	One Way	To Do It, but it could be
       something like this (using English words	for Russian, just so you know
       what's going on):

	 "I [quant,_1,directory,accusative] scanned."

       This shifts the burden of complexity off	to the quant method.  That
       method's	parameters are:	the numeric value it's going to	use to
       quantify	something; the Russian word it's going to quantify; and	the
       parameter "accusative", which you're using to mean that this sentence's
       syntax wants a noun in the accusative case there, although that
       quantification method may have to overrule, for grammatical reasons you
       may recall from the beginning of	this article.

       Now, the	Russian	quant method here is responsible not only for
       implementing the	strange	logic necessary	for figuring out how Russian
       number-phrases impose case and number on	their noun-phrases, but	also
       for inflecting the Russian word for "directory".	 How that inflection
       is to be	carried	out is no small	issue, and among the solutions I've
       seen, some (like	variations on a	simple lookup in a hash	where all
       possible	forms are provided for all necessary words) are
       straightforward but can become cumbersome when you need to inflect more
       than a few dozen	words; and other solutions (like using algorithms to
       model the inflections, storing only root	forms and irregularities) can
       involve more overhead than is justifiable for all but the largest

       Mercifully, this	design decision	becomes	crucial	only in	the hairiest
       of inflected languages, of which	Russian	is by no means the worst case
       scenario, but is	worse than most.  Most languages have simpler
       inflection systems; for example,	in English or Swahili, there are
       generally no more than two possible inflected forms for a given noun
       ("error/errors";	"kosa/makosa"),	and the	rules for producing these
       forms are fairly	simple -- or at	least, simple rules can	be formulated
       that work for most words, and you can then treat	the exceptions as just
       "irregular", at least relative to your ad hoc rules.  A simpler
       inflection system (simpler rules, fewer forms) means that design
       decisions are less crucial to maintaining sanity, whereas the same
       decisions could incur overhead-versus-scalability problems in languages
       like Russian.  It may also be likely that code (possibly	in Perl, as
       with Lingua::EN::Inflect, for English nouns) has	already	been written
       for the language	in question, whether simple or complex.

       Moreover, a third possibility may even be simpler than anything
       discussed above:	"Just require that all possible	(or at least
       applicable) forms be provided in	the call to the	given language's quant
       method, as in:"

	 "I found [quant,_1,file,files]."

       That way, quant just has	to chose which form it needs, without having
       to look up or generate anything.	 While possibly	not optimal for
       Russian,	this should work well for most other languages,	where
       quantification is not as	complicated an operation.

   The Devil in	the Details
       There's plenty more to Maketext than described above -- for example,
       there's the details of how language tags	("en-US", "i-pwn", "fi", etc.)
       or locale IDs ("en_US") interact	with actual module naming
       ("BogoQuery/Locale/"), and what magic can ensue;	there's	the
       details of how to record	(and possibly negotiate) what character
       encoding	Maketext will return text in (UTF8? Latin-1? KOI8?).  There's
       the interesting fact that Maketext is for localization, but nowhere
       actually	has a ""use locale;"" anywhere in it.  For the curious,
       there's the somewhat frightening	details	of how I actually implement
       something like data inheritance so that searches	across modules'
       %Lexicon	hashes can parallel how	Perl implements	method inheritance.

       And, most importantly, there's all the practical	details	of how to
       actually	go about deriving from Maketext	so you can use it for your
       interfaces, and the various tools and conventions for starting out and
       maintaining individual language modules.

       That is all covered in the documentation	for Locale::Maketext and the
       modules that come with it, available in CPAN.  After having read	this
       article,	which covers the why's of Maketext, the	documentation, which
       covers the how's	of it, should be quite straightforward.

   The Proof in	the Pudding: Localizing	Web Sites
       Maketext	and gettext have a notable difference: gettext is in C,
       accessible thru C library calls,	whereas	Maketext is in Perl, and
       really can't work without a Perl	interpreter (although I	suppose
       something like it could be written for C).  Accidents of	history	(and
       not necessarily lucky ones) have	made C++ the most common language for
       the implementation of applications like word processors,	Web browsers,
       and even	many in-house applications like	custom query systems.  Current
       conditions make it somewhat unlikely that the next one of any of	these
       kinds of	applications will be written in	Perl, albeit clearly more for
       reasons of custom and inertia than out of consideration of what is the
       right tool for the job.

       However,	other accidents	of history have	made Perl a well-accepted
       language	for design of server-side programs (generally in CGI form) for
       Web site	interfaces.  Localization of static pages in Web sites is
       trivial,	feasible either	with simple language-negotiation features in
       servers like Apache, or with some kind of server-side inclusions	of
       language-appropriate text into layout templates.	 However, I think that
       the localization	of Perl-based search systems (or other kinds of
       dynamic content)	in Web sites, be they public or	access-restricted, is
       where Maketext will see the greatest use.

       I presume that it would be only the exceptional Web site	that gets
       localized for English and Chinese and Italian and Arabic	and Russian,
       to recall the languages from the	beginning of this article -- to	say
       nothing of German, Spanish, French, Japanese, Finnish, and Hindi, to
       name a few languages that benefit from large numbers of programmers or
       Web viewers or both.

       However,	the ever-increasing internationalization of the	Web (whether
       measured	in terms of amount of content, of numbers of content writers
       or programmers, or of size of content audiences)	makes it increasingly
       likely that the interface to the	average	Web-based dynamic content
       service will be localized for two or maybe three	languages.  It is my
       hope that Maketext will make that task as simple	as possible, and will
       remove previous barriers	to localization	for languages dissimilar to


       Sean M. Burke ( has a Master's in linguistics from
       Northwestern University;	he specializes in language technology.	Jordan
       Lachler ( is a PhD student in the Department of
       Linguistics at the University of	New Mexico; he specializes in
       morphology and pedagogy of North	American native	languages.

       Alvestrand, Harald Tveit.  1995.	 RFC 1766: Tags	for the	Identification
       of Languages.  "<>" [Now see RFC

       Callon, Ross, editor.  1996.  RFC 1925: The Twelve Networking Truths.

       Drepper,	Ulrich,	Peter Miller, and Francois Pinard.  1995-2001.	GNU
       "gettext".  Available in	"<>", with
       extensive docs in the distribution tarball.  [Since I wrote this
       article in 1998,	I now see that the gettext docs	are now	trying more to
       come to terms with plurality.  Whether useful conclusions have come
       from it is another question altogether. -- SMB, May 2001]

       Forbes, Nevill.	1964.  Russian Grammar.	 Third Edition,	revised	by J.
       C. Dumbreck.  Oxford University Press.

perl v5.35.5			  2021-09-26	    Locale::Maketext::TPJ13(3)

NAME | SYNOPSIS | DESCRIPTION | Localization and Perl: gettext breaks, Maketext fixes

Want to link to this manual page? Use this URL:

home | help