Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Encoding::FixLatin(3) User Contributed Perl DocumentationEncoding::FixLatin(3)

       Encoding::FixLatin - takes mixed	encoding input and produces UTF-8

	   use Encoding::FixLatin qw(fix_latin);

	   my $utf8_string = fix_latin($mixed_encoding_string);

       Most encoding conversion	tools take input in one	encoding and produce
       output in another encoding.  This module	takes input which may contain
       characters in more than one encoding and	makes a	best effort to convert
       them all	to UTF-8 output.

       Nothing is exported by default.	The only public	function is
       "fix_latin" which will be exported on request (as per SYNOPSIS).

   fix_latin( string, options ... )
       Decodes the supplied 'string' and returns a UTF-8 version of the
       string.	The following rules are	used:

       o   ASCII characters (single bytes in the range 0x00 - 0x7F) are	passed
	   through unchanged.

       o   Well-formed UTF-8 multi-byte	characters are also passed through

       o   UTF-8 multi-byte character which are	over-long but otherwise	well-
	   formed are converted	to the shortest	UTF-8 normal form.

       o   Bytes in the	range 0xA0 - 0xFF are assumed to be Latin-1 characters
	   (ISO8859-1 encoded) and are converted to UTF-8.

       o   Bytes in the	range 0x80 - 0x9F are assumed to be Win-Latin-1
	   characters (CP1252 encoded) and are converted to UTF-8.  Except for
	   the five bytes in this range	which are not defined in CP1252	(see
	   the "ascii_hex" option below).

       The achilles heel of these rules	is that	it's possible for certain
       combinations of two consecutive Latin-1 characters to be	misinterpreted
       as a single UTF-8 character - ie: there is some risk of data
       corruption.  See	the 'LIMITATIONS' section below	to quantify this risk
       for the type of data you're working with.

       If you pass in a	string that is already a UTF-8 character string	(the
       utf8 flag is set	on the Perl scalar) then the string will simply	be
       returned	unchanged.  However if the 'bytes_only'	option is specified
       (see below), the	returned string	will be	a byte string rather than a
       character string.  The rules described above will not be	applied	in
       either case.

       The "fix_latin" function	accepts	options	as name	=> value pairs.
       Recognised options are:

       bytes_only => 1/0
	   The value returned by fix_latin is normally a Perl character	string
	   and will have the utf8 flag set if it contains non-ASCII
	   characters.	If you set the "bytes_only" option to a	true value,
	   the returned	string will be a binary	string of UTF-8	bytes.	The
	   utf8	flag will not be set.  This is useful if you're	going to
	   immediately use the string in an IO operation and wish to avoid the
	   overhead of converting to and from Perl's internal representation.

       ascii_hex => 1/0
	   Bytes in the	range 0x80-0x9F	are assumed to be CP1252, however
	   CP1252 does not define a mapping for	5 of these bytes (0x81,	0x8D,
	   0x8F, 0x90 and 0x9D).  Use this option to specify how they should
	   be handled:

	   o   If the ascii_hex	option is set to true (the default), these
	       bytes will be converted to 3 character ASCII hex	strings	of the
	       form %XX.  For example the byte 0x81 will become	%81.

	   o   If the ascii_hex	option is set to false,	these bytes will be
	       treated as Latin-1 control characters and converted to the
	       equivalent UTF-8	multi-byte sequences.

	   When	processing text	strings	you will almost	certainly never
	   encounter these bytes at all.  The most likely reason you would see
	   them	is if a	malicious attacker was feeding random bytes to your
	   application.	 It is difficult to conceive of	a scenario in which it
	   makes sense to change this option from its default setting.

       overlong_fatal => 1/0
	   An over-long	UTF-8 byte sequence is one which uses more than	the
	   minimum number of bytes required to represent the character.	 Use
	   this	option to specify how overlong sequences should	be handled.

	   o   If the overlong_fatal option is set to false (the default)
	       over-long sequences will	be converted to	the shortest normal
	       UTF-8 sequence.	For example the	input byte string
	       "\xC0\xBCscript>" would be converted to "<script>".

	   o   If the overlong_fatal option is set to true, this module	will
	       die with	an error when an overlong sequence is encountered.
	       You would probably want to use eval to trap and handle this

	   There is a strong argument that overlong sequences are only ever
	   encountered in malicious input and therefore	they should always be

       use_xs => 'auto'	| 'always' | 'never'
	   This	option controls	whether	or not the XS (compiled	C)
	   implementation of "fix_latin" is used.  Note, the
	   Encoding::FixLatin::XS module must be installed separately.	The
	   three possible values for this option are:

	   o   'auto' is the default behaviour - if Encoding::FixLatin::XS is
	       installed, it will be loaded and	used, otherwise	the pure Perl
	       implementation will be used.

	   o   'always'	means the XS module will be used and a fatal exception
	       will be thrown if it is not available.

	   o   'never' means no	attempt	will be	made to	use the	XS module.

       This module is perfectly	safe when handling data	containing only	ASCII
       and UTF-8 characters.  Introducing ISO8859-1 or CP1252 characters does
       add a risk of data corruption (ie: some characters in the input being
       converted to incorrect characters in the	output).  To quantify the risk
       it is necessary to understand it's cause.  First, let's break the input
       bytes into two categories.

       o   ASCII bytes fall into the range 0x00-0x7F - the most	significant
	   bit is always set to	zero.  I'll use	the symbol 'a' to represent
	   these bytes.

       o   Non-ASCII bytes fall	into the range 0x80-0xFF - the most
	   significant bit is always set to one.  I'll use the symbol 'B' to
	   represent these bytes.

       A sequence of ASCII bytes ('aaa') is always unambiguous and will	not be

       Lone non-ASCII bytes within sequences of	ASCII bytes ('aaBaBa') are
       also unambiguous	and will not be	misinterpreted.

       The potential for error occurs with two (or more) consecutive non-ASCII
       bytes.  For example the sequence	'BB' might be intended to represent
       two characters in one of	the legacy encodings or	a single character in
       UTF-8.  Because this module gives precedence to the UTF-8 characters it
       is possible that	a random pair of legacy	characters may be
       misinterpreted as a single UTF-8	character.

       The risk	is reduced by the fact that not	all pairs of non-ASCII bytes
       form valid UTF-8	sequences.  Every non-ASCII UTF-8 character is made up
       of two or more 'B' bytes	and no 'a' bytes.  For a two-byte character,
       the first byte must be in the range 0xC0-0xDF and the second must be in
       the range 0x80-0xBF.

       Any pair	of 'BB'	bytes that do not fall into the	required ranges	are
       unambiguous and will not	be misinterpreted.

       Pairs of	'BB' bytes that	are actually individual	Latin-1	characters but
       happen to fall into the required	ranges to be misinterpreted as a UTF-8
       character are rather unlikely to	appear in normal text.	If you look
       those ranges up on a Latin-1 code chart you'll see that the first
       character would need to be an uppercase accented	letter and the second
       would need to be	a non-printable	control	character or a special
       punctuation symbol.

       One way to summarise the	role of	this module is that it guarantees to
       produce UTF-8 output, possibly at the cost of introducing the odd

       Please report any bugs to "bug-encoding-fixlatin	at", or
       through the web interface at
       <>.  I
       will be notified, and then you'll automatically be notified of progress
       on your bug as I	make changes.

       You can also look for information at:

       o   Issue tracker


       o   AnnoCPAN: Annotated CPAN documentation


       o   CPAN	Ratings


       o   Search CPAN


       o   Source code repository


       Copyright 2009-2014 Grant McLean	"<>"

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.1			  2014-05-22		 Encoding::FixLatin(3)


Want to link to this manual page? Use this URL:

home | help