Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Unicode::Japanese(3)  User Contributed Perl Documentation Unicode::Japanese(3)

NAME
       Unicode::Japanese - Convert encoding of japanese	text

SYNOPSIS
	use Unicode::Japanese;
	use Unicode::Japanese qw(unijp);

	# convert utf8 -> sjis

	print Unicode::Japanese->new($str)->sjis;
	print unijp($str)->sjis; # same	as above.

	# convert sjis -> utf8

	print Unicode::Japanese->new($str,'sjis')->get;

	# convert sjis (imode_EMOJI) ->	utf8

	print Unicode::Japanese->new($str,'sjis-imode')->get;

	# convert zenkaku (utf8) -> hankaku (utf8)

	print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION
       The Unicode::Japanese module converts encoding of japanese text from
       one encoding to another.

   FEATURES
       o An instance of	Unicode::Japanese internally holds a string in UTF-8.

       o This module is	implemented in two ways: XS and	pure perl. If
	 efficiency is important for you, you should build and install the XS
	 module. If you	don't want to, or if you can't build the XS module,
	 you may use the pure perl module instead. In that case, only you have
	 to do is to copy Japanese.pm into somewhere in	@INC.

       o This module can convert characters from zenkaku (full-width) form to
	 hankaku (half-width) form, and	vice versa. Conversion between
	 hiragana (one of two sets of japanese phonetical alphabet) and
	 katakana (another set of japanese phonetical alphabet)	is also
	 supported.

       o This module has mapping tables	for emoji (graphic characters) defined
	 by various japanese mobile phones; DoCoMo i-mode, ASTEL dot-i and
	 J-PHONE J-Sky.	Those letters are mapped on Unicode Private Use	Area
	 so unicode strings it outputs are still valid even if they contain
	 emoji,	and you	can safely pass	them to	other softwares	that can
	 handle	Unicode.

       o This module can map some emoji	from one set to	another. Different
	 mobile	phones define different	sets of	emoji, so mapping each other
	 is not	always possible. But since some	emoji exist in two or more
	 sets with similar appearance, this module considers those emoji to be
	 the same.

       o This module uses the mapping table for	MS-CP932 instead of the
	 standard Shift_JIS. The Shift_JIS encoding used by MS-Windows
	 (MS-SJIS/MS-CP932) slightly differs from the standard.

       o When the module converts strings from Unicode to Shift_JIS, EUC-JP or
	 ISO-2022-JP, unicode letters which can't be represented in those
	 encodings will	be encoded in "&#dddd;"	form (decimal character
	 reference). Note, however, that letters in Unicode Private Use	Area
	 will be replaced with '?' mark	('QUESTION MARK'; U+003F) instead of
	 being encoded.	In addition, encoding to character sets	for mobile
	 phones	makes every unrepresentable letters being '?' mark.

       o On perl-5.8.0 or later, this module handles the UTF-8 flag: the
	 method	utf8() returns UTF-8 byte string, and the method getu()
	 returns UTF-8 character string.

	 Currently the method get() returns UTF-8 byte string but this
	 behavior may be changed in the	future.

	 Methods like sjis(), jis(), utf8(), and such like return byte string.
	 new(),	set(), getcode() methods just ignore the UTF-8 flag of strings
	 they take.

REQUIREMENT
       o   perl	5.10.x,	5.8.x, etc. (5.004 and later)

       o   (optional) C	Compiler.  This	module supports	both XS	and Pure Perl.
	   If you have no C Compilers, Unicode::Japanese will be installed as
	   Pure	Perl module.

       o   (optional) Test.pm and Test::More for testing.

       No other	modules	are required at	run time.

METHODS
       $s = Unicode::Japanese->new($str	[, $icode [, $encode]])
	   Create a new	instance of Unicode::Japanese.

	   Any given parameters	will be	internally passed to the method
	   "set"().

       $s = unijp($str [, $icode [, $encode]])
	   Same	as Unicode::Jananese->new(...).

       $s->set($str [, $icode [, $encode]])
	   $str: string
	   $icode: optional character encoding (default: 'utf8')
	   $encode: optional binary encoding (default: no binary encodings are
	   assumed)

	   Store a string into the instance.

	   Possible character encodings	are:

	    auto
	    utf8 ucs2 ucs4
	    utf16-be utf16-le utf16
	    utf32-be utf32-le utf32
	    sjis cp932 euc euc-jp jis
	    sjis-imode sjis-imode1 sjis-imode2
	    utf8-imode utf8-imode1 utf8-imode2
	    sjis-doti sjis-doti1
	    sjis-jsky sjis-jsky1 sjis-jsky2
	    jis-jsky  jis-jsky1	 jis-jsky2
	    utf8-jsky utf8-jsky1 utf8-jsky2
	    sjis-au sjis-au1 sjis-au2
	    jis-au  jis-au1  jis-au2
	    sjis-icon-au sjis-icon-au1 sjis-icon-au2
	    euc-icon-au	 euc-icon-au1  euc-icon-au2
	    jis-icon-au	 jis-icon-au1  jis-icon-au2
	    utf8-icon-au utf8-icon-au1 utf8-icon-au2
	    ascii binary

	   (see	also "SUPPORTED	ENCODINGS".)

	   If you want the Unicode::Japanese detect the	character encoding of
	   string, you must explicitly specify 'auto' as the second argument.
	   In that case, the given string will be passed to the	method
	   getcode() to	guess the encoding.

	   For binary encodings, only 'base64' is currently supported. If you
	   specify 'base64' as the third argument, the given string will be
	   decoded using Base64	decoder.

	   Specify 'binary' as the second argument if you want your string to
	   be stored without modification.

	   When	you specify 'sjis-imode' or 'sjis-doti'	as the character
	   encoding, any occurences of '&#dddd;' (decimal character reference)
	   in the string will be interpreted and decoded as code point of
	   emoji, just like emoji implanted into the string in binary form.

	   Since encoded forms of strings in various encodings are not clearly
	   distinctive to each other, it is not	always certainly possible to
	   detect what encoding	is used	for a given string.

	   When	a given	string is possibly interpreted as both Shift_JIS and
	   UTF-8 string, this module considers such a string to	be encoded in
	   Shift_JIS. And if the encoding is not distinguishable between
	   'sjis-au' and 'sjis-doti', this module considers it 'sjis-au'.

       $str = $s->get
	   $str: string	(UTF-8)

	   Get the internal string in UTF-8.

	   This	method currently returns a byte	string (whose UTF-8 flag is
	   turned off),	but this behavior may be changed in the	future.

	   If you absolutely want a byte string, you should use	the method
	   utf8() instead. And if you want a character string (whose UTF-8
	   flag	is turned on), you have	to use the method getu().

       $str = $s->getu
	   $str: string	(UTF-8)

	   Get the internal string in UTF-8.

	   On perl-5.8.0 or later, this	method returns a character string with
	   its UTF-8 flag turned on.

       $code = $s->getcode($str)
	   $str: string
	   $code: name of character encoding

	   Detect the character	encoding of given string.

	   Note	that this method, exceptionaly,	doesn't	deal with the internal
	   string of an	instance.

	   To guess the	encoding, the following	algorithm is used:

	   (For	pure perl implementation)

	   1.  If the string has an UTF-32 BOM,	its encoding is	'utf32'.

	   2.  If it has an UTF-16 BOM,	its encoding is	'utf16'.

	   3.  If it is	valid for UTF-32BE, its	encoding is 'utf32-be'.

	   4.  If it is	valid for UTF-32LE, its	encoding is 'utf32-le'.

	   5.  If it contains no ESC characters	or bytes whose eighth bit is
	       on, its encoding	is 'ascii'. Every ASCII	control	characters
	       (0x00-0x1F and 0x7F) except ESC (0x1B) are considered to	be in
	       the range of 'ascii'.

	   6.  If it contains escape sequences of ISO-2022-JP, its encoding is
	       'jis'.

	   7.  If it contains any emoji	defined	for J-PHONE, its encoding is
	       'sjis-jsky'.

	   8.  If it is	valid for EUC-JP, its encoding is 'euc'.

	   9.  If it is	valid for Shift_JIS, its encoding is 'sjis'.

	   10. If it contains any emoji	defined	for au,	and everything else is
	       valid for Shift_JIS, its	encoding is 'sjis-au'.

	   11. If it contains any emoji	defined	for i-mode, and	everything
	       else is valid for Shift_JIS, its	encoding is 'sjis-imode'.

	   12. If it contains any emoji	defined	for dot-i, and everything else
	       is valid	for Shift_JIS, its encoding is 'sjis-doti'.

	   13. If it is	valid for UTF-8, its encoding is 'utf8'.

	   14. If no conditions	above are fulfilled, its encoding is
	       'unknown'.

	   (For	XS implementation)

	   1.  If the string has an UTF-32 BOM,	its encoding is	'utf32'.

	   2.  If it has an UTF-16 BOM,	its encoding is	'utf16'.

	   3.  Find all	possible encodings that	might have been	applied	to the
	       string from the following:

	       ascii / euc / sjis / jis	/ utf8 / utf32-be / utf32-le / sjis-
	       jsky / sjis-imode / sjis-au / sjis-doti

	   4.  If any encodings	have been found	possible, this module picks
	       out one encoding	having the highest priority among them.	The
	       priority	order is as follows:

	       utf32-be	/ utf32-le / ascii / jis / euc / sjis /	sjis-jsky /
	       sjis-imode / sjis-au / sjis-doti	/ utf8

	   5.  If no conditions	above are fulfilled, its encoding is
	       'unknown'.

	   Pay attention to the	following pitfalls in the above	algorithm:

	   o UTF-8 strings might be accidentally considered to be encoded in
	     Shift_JIS.

	   o UCS-2 strings (sequence of	raw UCS-2 letters in big-endian; each
	     letters has always	2 bytes) can't be detected because they	look
	     like nothing but sequences	of random bytes	whose length is	an
	     even number.

	   o UTF-16 strings must have BOM to be	detected.

	   o Emoji are only be recognized if they are implanted	into the
	     string in binary form. If they are	described in '&#dddd;' form,
	     they aren't considered to be emoji.

	   Since the XS	and pure perl implementations use different algorithms
	   to guess encoding, they may guess differently for the same string.
	   Especially, the pure	perl implementation finds Shift_JIS strings
	   containing ESC character (0x1B) to be actually encoded in Shift_JIS
	   but XS implementation doesn't. This is because such strings can
	   hardly be distinguished from	'sjis-jsky'. In	addition, EUC-JP
	   strings containing ESC character are	also rejected for the same
	   reason.

       $code = $s->getcodelist($str)
	   $str: string
	   $code: name of character encodings

	   Detect the character	encoding of given string.

	   Unlike the method getcode(),	getcodelist() returns a	list of
	   possible encodings.

       $str = $s->conv($ocode, $encode)
	   $ocode: character encoding (possible	encodings are:)
	      utf8 ucs2	ucs4 utf16
	      sjis cp932 euc euc-jp jis
	      sjis-imode sjis-imode1 sjis-imode2
	      utf8-imode utf8-imode1 utf8-imode2
	      sjis-doti	sjis-doti1
	      sjis-jsky	sjis-jsky1 sjis-jsky2
	      jis-jsky	jis-jsky1  jis-jsky2
	      utf8-jsky	utf8-jsky1 utf8-jsky2
	      sjis-au sjis-au1 sjis-au2
	      jis-au  jis-au1  jis-au2
	      sjis-icon-au sjis-icon-au1 sjis-icon-au2
	      euc-icon-au  euc-icon-au1	 euc-icon-au2
	      jis-icon-au  jis-icon-au1	 jis-icon-au2
	      utf8-icon-au utf8-icon-au1 utf8-icon-au2
	      binary

	     (see also "SUPPORTED ENCODINGS".)

	     Some encodings for	mobile phones have a trailing digit like
	     'sjis-au2'. Those digits represent	the version number of
	     encodings.	Such encodings have a variant with no trailing digits,
	     like 'sjis-au', which is the same as the latest version among its
	     variants.

	   $encode: optional binary encoding
	   $str: string

	   Get the internal string of instance with encoding it	using a	given
	   character encoding method.

	   If you want the resulting string to be encoded in Base64, specify
	   'base64' as the second argument.

	   On perl-5.8.0 or later, the UTF-8 flag of resulting string is
	   turned off even if you specify 'utf8' to the	first argument.

       $s->tag2bin
	   Interpret decimal character references (&#dddd;) in the instance,
	   and replaces	them with single characters they represent.

       $s->z2h
	   Replace zenkaku (full-width)	letters	in the instance	with hankaku
	   (half-width)	letters.

       $s->h2z
	   Replace hankaku (half-width)	letters	in the instance	with zenkaku
	   (full-width)	letters.

       $s->hira2kata
	   Replace any hiragana	in the instance	with katakana.

       $s->kata2hira
	   Replace any katakana	in the instance	with hiragana.

       $str = $s->jis
	   $str: byte string in	ISO-2022-JP

	   Get the internal string of instance with encoding it	in
	   ISO-2022-JP.

       $str = $s->euc
	   $str: byte string in	EUC-JP

	   Get the internal string of instance with encoding it	in EUC-JP.

       $str = $s->utf8
	   $str: byte string in	UTF-8

	   Get the internal UTF-8 string of instance.

	   On perl-5.8.0 or later, the UTF-8 flag of resulting string is
	   turned off.

       $str = $s->ucs2
	   $str: byte string in	UCS-2

	   Get the internal string of instance as a sequence of	raw UCS-2
	   letters in big-endian. Note that this is different from UTF-16BE as
	   raw UCS-2 sequence has no concept of	surrogate pair.

       $str = $s->ucs4
	   $str: byte string in	UCS-4

	   Get the internal string of instance as a sequence of	raw UCS-4
	   letters in big-endian. This is practically the same as UTF-32BE.

       $str = $s->utf16
	   $str: byte string in	UTF-16

	   Get the insternal string of instance	with encoding it in UTF-16 in
	   big-endian with no BOM prepended.

       $str = $s->sjis
	   $str: byte string in	Shift_JIS

	   Get the internal string of instance with encoding it	in Shift_JIS
	   (MS-SJIS / MS-CP932).

       $str = $s->sjis_imode
	   $str: byte string in	'sjis-imode'

	   Get the internal string of instance with encoding it	in
	   'sjis-imode'.

       $str = $s->sjis_imode1
	   $str: byte string in	'sjis-imode1'

	   Get the internal string of instance with encoding it	in
	   'sjis-imode1'.

       $str = $s->sjis_imode2
	   $str: byte string in	'sjis-imode2'

	   Get the internal string of instance with encoding it	in
	   'sjis-imode2'.

       $str = $s->sjis_doti
	   $str: byte string in	'sjis-doti'

	   Get the internal string of instance with encoding it	in
	   'sjis-doti'.

       $str = $s->sjis_jsky
	   $str: byte string in	'sjis-jsky'

	   Get the internal string of instance with encoding it	in
	   'sjis-jsky'.

       $str = $s->sjis_jsky1
	   $str: byte string in	'sjis-jsky1'

	   Get the internal string of instance with encoding it	in
	   'sjis-jsky1'.

       $str = $s->sjis_jsky
	   $str: byte string in	'sjis-jsky'

	   Get the internal string of instance with encoding it	in
	   'sjis-jsky'.

       $str = $s->sjis_icon_au
	   $str: byte string in	'sjis-icon-au'

	   Get the internal string of instance with encoding it	in
	   'sjis-icon-au'.

       $str_arrayref = $s->strcut($len)
	   $len: maximum length	of each	chunks (in number of full-width
	   characters)
	   $str_arrayref: reference to array of	strings

	   Split the internal string of	instance into chunks of	a given
	   length.

	   On perl-5.8.0 or later, UTF-8 flags of each chunks are turned on.

       $len = $s->strlen
	   $len: character width of the	internal string

	   Calculate the character width of the	internal string. Half-width
	   characters have width of one	unit, and full-width characters	have
	   width of two	units.

       $s->join_csv(@values);
	   @values: array of strings

	   Build a line	of CSV from the	arguments, and store it	into the
	   instance. The resulting line	has a trailing line break ("\n").

       @values = $s->split_csv;
	   @values: array of strings

	   Parse a line	of CSV in the instance and return each columns.	The
	   line	will be	chomp()ed before getting parsed.

	   If the internal string was decoded from 'binary' encoding (see
	   methods new() and set()), the UTF-8 flags of	the resulting array of
	   strings are turned off. Otherwise the flags are turned on.

SUPPORTED ENCODINGS
	+---------------+----+-----+-------+
	|encoding	| in | out | guess |
	+---------------+----+-----+-------+
	|auto		: OK : --  | ----- |
	+---------------+----+-----+-------+
	|utf8		: OK : OK  | OK	   |
	|ucs2		: OK : OK  | ----- |
	|ucs4		: OK : OK  | ----- |
	|utf16-be	: OK : --  | ----- |
	|utf16-le	: OK : --  | ----- |
	|utf16		: OK : OK  | OK(#) |
	|utf32-be	: OK : --  | OK	   |
	|utf32-le	: OK : --  | OK	   |
	|utf32		: OK : --  | OK(#) |
	+---------------+----+-----+-------+
	|sjis		: OK : OK  | OK	   |
	|cp932		: OK : OK  | ----- |
	|euc		: OK : OK  | OK	   |
	|euc-jp		: OK : OK  | ----- |
	|jis		: OK : OK  | OK	   |
	+---------------+----+-----+-------+
	|sjis-imode	: OK : OK  | OK	   |
	|sjis-imode1	: OK : OK  | ----- |
	|sjis-imode2	: OK : OK  | ----- |
	|utf8-imode	: OK : OK  | ----- |
	|utf8-imode1	: OK : OK  | ----- |
	|utf8-imode2	: OK : OK  | ----- |
	+---------------+----+-----+-------+
	|sjis-doti	: OK : OK  | OK	   |
	|sjis-doti1	: OK : OK  | ----- |
	+---------------+----+-----+-------+
	|sjis-jsky	: OK : OK  | OK	   |
	|sjis-jsky1	: OK : OK  | ----- |
	|sjis-jsky2	: OK : OK  | ----- |
	|jis-jsky	: OK : OK  | ----- |
	|jis-jsky1	: OK : OK  | ----- |
	|jis-jsky2	: OK : OK  | ----- |
	|utf8-jsky	: OK : OK  | ----- |
	|utf8-jsky1	: OK : OK  | ----- |
	|utf8-jsky2	: OK : OK  | ----- |
	+---------------+----+-----+-------+
	|sjis-au	: OK : OK  | OK	   |
	|sjis-au1	: OK : OK  | ----- |
	|sjis-au2	: OK : OK  | ----- |
	|jis-au		: OK : OK  | ----- |
	|jis-au1	: OK : OK  | ----- |
	|jis-au2	: OK : OK  | ----- |
	|sjis-icon-au	: OK : OK  | ----- |
	|sjis-icon-au1	: OK : OK  | ----- |
	|sjis-icon-au2	: OK : OK  | ----- |
	|euc-icon-au	: OK : OK  | ----- |
	|euc-icon-au1	: OK : OK  | ----- |
	|euc-icon-au2	: OK : OK  | ----- |
	|jis-icon-au	: OK : OK  | ----- |
	|jis-icon-au1	: OK : OK  | ----- |
	|jis-icon-au2	: OK : OK  | ----- |
	|utf8-icon-au	: OK : OK  | ----- |
	|utf8-icon-au1	: OK : OK  | ----- |
	|utf8-icon-au2	: OK : OK  | ----- |
	+---------------+----+-----+-------+
	|ascii		: OK : --  | OK	   |
	|binary		: OK : OK  | ----- |
	+---------------+----+-----+-------+
	(#): guessed when it has bom.

   GUESSING ORDER
	1.  utf32 (#)
	2.  utf16 (#)
	3.  utf32-be
	4.  utf32-le
	5.  ascii
	6.  jis
	7.  sjis-jsky (pp)
	8.  euc
	9.  sjis
	10. sjis-jsky (xs)
	11. sjis-au
	12. sjis-imode
	13. sjis-doti
	14. utf8
	15. unknown

DESCRIPTION OF UNICODE MAPPING
       Transcoding between Unicode encodings and other ones is performed as
       below:

       Shift_JIS
	 This module uses the mapping table of MS-CP932.

	 <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT>

	 When the module tries to convert Unicode string to Shift_JIS, it
	 represents most letters which isn't available in Shift_JIS as decimal
	 character reference ('&#dddd;'). There	is one exception to this:
	 every graphic characters for mobile phones are	replaced with '?'
	 mark.

	 For variants of Shift_JIS defined for mobile phones, every
	 unrepresentable characters are	replaced with '?' mark unlike the
	 plain Shift_JIS.

       EUC-JP/ISO-2022-JP
	 This module doesn't directly convert Unicode string from/to EUC-JP or
	 ISO-2022-JP: it once converts from/to Shift_JIS and then do the rest
	 translation. So characters which aren't available in the Shift_JIS
	 can not be properly translated.

       DoCoMo i-mode
	 This module maps emoji	in the range of	F800 - F9FF to U+0FF800	-
	 U+0FF9FF.

       ASTEL dot-i
	 This module maps emoji	in the range of	F000 - F4FF to U+0FF000	-
	 U+0FF4FF.

       J-PHONE J-SKY
	 The encoding method defined by	J-SKY is as follows: first an escape
	 sequence "\e\$" comes to indicate the beginning of emoji, then	the
	 first byte of an emoji	comes next, then the second bytes of at	least
	 one emoji comes next, then "\x0f" comes last to indicate the end of
	 emoji.	If a string contains a series of emoji whose first bytes are
	 identical, such sequence can be compressed by cascading second	bytes
	 of them to the	single first byte.

	 This module considers a pair of those first and second	bytes to be
	 one letter, and map them from 4500 - 47FF to U+0FFB00 - U+0FFDFF.

	 When the module encodes J-SKY emoji, it performs the compression
	 automatically.

       AU
	 This module maps AU emoji to U+0FF500 - U+0FF6FF.

PurePerl mode
	  use Unicode::Japanese	qw(PurePerl);

       If you want to explicitly take the pure perl implementation, pass
       'PurePerl' to the argument of the "use" statement.

BUGS
       Please report bugs and requests to "bug-unicode-japanese	at
       rt.cpan.org" or
       <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese>. If
       you report them to the web interface, any progress to your report will
       be automatically	sent back to you.

       o This module doesn't directly convert Unicode string from/to EUC-JP or
	 ISO-2022-JP: it once converts from/to Shift_JIS and then do the rest
	 translation. So characters which aren't available in the Shift_JIS
	 can not be properly translated.

       o The XS	implementation of getcode() fails to detect the	encoding when
	 the given string contains \e while its	encoding is EUC-JP or
	 Shift_JIS.

       o Japanese.pm is	composed of textual perl script	and binary character
	 conversion table. If you transfer it on FTP using ASCII mode, the
	 file will collapse.

SUPPORT
       You can find documentation for this module with the perldoc command.

	   perldoc Unicode::Japanese

       You can find more information at:

       o   AnnoCPAN: Annotated CPAN documentation

	   <http://annocpan.org/dist/Unicode-Japanese>

       o   CPAN	Ratings

	   <http://cpanratings.perl.org/d/Unicode-Japanese>

       o   RT: CPAN's request tracker

	   <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Unicode-Japanese>

       o   Search CPAN

	   <http://search.cpan.org/dist/Unicode-Japanese>

CREDITS
       Thanks very much	to:

       NAKAYAMA	Nao

       SUGIURA Tatsuki & Debian	JP Project

COPYRIGHT & LICENSE
       Copyright 2001-2008 SANO	Taku (SAWATARI Mikage) and YAMASHINA Hio, all
       rights reserved.

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.1			  2012-02-26		  Unicode::Japanese(3)

NAME | SYNOPSIS | DESCRIPTION | REQUIREMENT | METHODS | SUPPORTED ENCODINGS | DESCRIPTION OF UNICODE MAPPING | PurePerl mode | BUGS | SUPPORT | CREDITS | COPYRIGHT & LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Unicode::Japanese&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help