Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
EBook::Tools::MobipockUser)Contributed Perl DocumenEBook::Tools::Mobipocket(3)

NAME
       EBook::Tools::Mobipocket	- Palm::PDB handler for	manipulating the
       Mobipocket format.

SYNOPSIS
	use EBook::Tools::Mobipocket qw(:all);
	my $mobi = EBook::Tools::Mobipocket->new();
	$mobi->Load('filename.prc');
	print "Title: ",$mobi->{title},"\n";
	print "Author: ",$mobi->{header}{exth}{author},"\n";
	print "Language: ",$mobi->{header}{mobi}{language},"\n";

	my $mobigen = find_mobigen();
	system_mobigen('myfile.opf');

DEPENDENCIES
       o   "Bit::Vector"

       o   "Compress::Zlib"

       o   "HTML::Tree"

       o   "Image::Size"

       o   "List::MoreUtils"

       o   "P5-Palm"

       o   "String::CRC32"

CONSTRUCTOR
   "new()"
       Instantiates a new Ebook::Tools::Mobipocket object.

ACCESSOR METHODS
   "drm()"
       Returns 1 if the	"drmoffset" header value is neither 0 nor 0xffffffff.
       Returns undef if	"drmoffset" is undefined. Returns 0 otherwise.

   "text()"
       Returns the text	of the file

   "write_images()"
       Writes each image record	to the disk.

       Returns the number of images written.

   "write_text($filename)"
       Writes the book text to disk with the given filename.  This filename
       must match the filename given to	"fix_html()" for the internal links to
       be consistent.

       Croaks if $filename is not specified.

       Returns 1 on success, or	undef if there was no text to write.

   "write_unknown_records()"
       Writes each unidentified	record to disk with a filename in the format
       of 'raw-record-####', where ####	is the record number (not the record
       ID).

       Returns the number of records written.

MODIFIER METHODS
       These methods have two naming/capitalization schemes -- methods
       directly	related	to the subclassing of Palm::PDB	use its	MethodName
       capitalization style.  Any other	methods	are lowercase_with_underscores
       for consistency with the	rest of	EBook::Tools.

   "Load($filename)"
       Sets "$self->{filename}"	and then loads and parses the file specified
       by $filename, calling "ParseRecord(%record)" on every record found.

       If DictionaryHuffman compression	is detected, text records will be left
       untouched during	the ParseRecord	pass, and
       "uncompress_dictionaryhuffman_records()"	will be	called after the
       initial parsing pass is complete.

   "ParseRecord(%record)"
       Parses PDB records, updating the	object attributes.  This method	is
       called automatically on every database record during "Load()".

   "ParseRecord0($data)"
       Parses the header record	and places the parsed values into the hashref
       "$self->{header}{palm}",	the hashref "$self->{header}{mobi}", and
       "$self->{header}{exth}" by calling "parse_palmdoc_header()",
       "parse_mobi_header()", and "parse_mobi_exth()" respectively.

   "ParseRecordCDIC(\$data)"
       Parses a	CDIC record.  Takes as a sole argument a reference to the data
       of the record.

       Record format

       o   Offset 0: Record identifier

	   4 bytes, always 'CDIC'

       o   Offset 4: Header length

	   4 bytes, big-endian long int, always	= 16

       o   Offset 8: Index count

	   4 bytes, big-endian long int, marks the number of big-endian	short
	   ints	immediately following the header used as index points into the
	   dictionary data

       o   Offset 12: Codelength

	   4 bytes, big-endian long int, number	of code	bits

       o   Offset 16: Indexes

	   A number of big-endian short	ints used as index points into the
	   dictionary data

       o   Offset ??: Dictionary data

	   Dictionary result strings immediately following the indexes

   "ParseRecordHUFF(\$data)"
       Parses a	HUFF record.  Takes as a sole argument a reference to the data
       of the record.

       Record format

       o   Offset 0: Record identifier

	   4 bytes, always 'HUFF'

       o   Offset 4: Header length

	   4 bytes, big-endian long int, always	= 24

       o   Offset 8: Cache table (big-endian) offset

	   4 bytes, big-endian long int, always	= 24

       o   Offset 12: Base table (big-endian) offset

	   4 bytes, big-endian long int, always	= 1048

       o   Offset 16: Cache table (little-endian) offset

	   4 bytes, big-endian long int, always	= 1304

       o   Offset 20: Base table (little-endian) offset

	   4 bytes, big-endian long int, always	= 2328

       o   Offset 24: Cache table (big-endian)

	   1024	bytes, 256 big-endian long ints

	   This	is a look up table for the length and decoding of short
	   codewords.  If the codeword represented by the 8 bits is unique,
	   then	bit 7 (0x80) will be set, and the low 5	bits are the length in
	   bits	of the code.  The high three bytes partially represent the
	   final symbol.

	   If bit 7 is clear, then the code is looked up in the	base table

       o   Offset 1048:	Base table (big-endian)

	   256 bytes, 64 big-endian long ints

	   This	is where the codeword is looked	up if it isn't found in	the
	   cache table.

       o   Offset 1304:	Cache table (little-endian)

	   1024	bytes, 256 little-endian long ints.

	   This	contains exactly the same data as in the cache table at	offset
	   24, except that all of the values are stored	in little-endian
	   format instead of big-endian.

	   Presumably this is for a speed advantage on slow little-endian
	   processors.	This module uses only the big-endian tables.

       o   Offset 2328:	Base table (little-endian)

	   256 bytes, 64 little-endian long ints

	   This	contains exactly the same data as in the base table at offset
	   1048, except	that all of the	values are stored in little-endian
	   format instead of big-endian.

	   Presumably this is for a speed advantage on slow little-endian
	   processors.	This module uses only the big-endian tables.

   "ParseRecordImage(\$dataref)"
       Parses image records, updating object attributes, most notably adding
       the image data to the hash "$self->{imagedata}",	adding the image
       filename	to "$self->{recindexlinks}", and incrementing
       "$self->{recindex}".

       Takes as	an argument a reference	to the record data.  Croaks if it
       isn't provided, or isn't	a reference.

       This is called automatically by "ParseRecord()" and "ParseResource()"
       as needed.

   "ParseRecordText(\$dataref)"
       Parses text records, updating object attributes,	most notably appending
       text to "$self->{text}".	 Takes as an argument a	reference to the
       record data.

       This is called automatically by "ParseRecord()" and "ParseResource()"
       as needed.

   fix_html(%args)
       Takes raw Mobipocket text and replaces the custom tags and file
       position	anchors

       Arguments

       o   "filename"

	   The name of the output HTML file (used in generating	hrefs).	 The
	   procedure croaks if this is not supplied.

       o   "nonewlines"	(optional)

	   If this is set to true, the procedure will not attempt to insert
	   newlines for	readability.  This will	leave the output in a single
	   unreadable line, but	has the	advantage of reducing the processing
	   time, especially useful if tidy is going to be run on the output
	   anyway.

   "fix_html_filepos()"
       Takes the raw HTML text of the object and replaces the filepos anchors.
       This has	to be called before any	other action that modifies the text,
       or the filepos positions	will not be valid.

       Returns 1 if successful,	undef if there was no text to fix.

       This is called automatically by "fix_html()".

   "uncompress_dictionaryhuffman_records()"
       Uncompresses all	text records using "uncompress_dictionaryhuffman()".
       This destroys the existing contents of $self->{text} if any.

       This method is called automatically at the end of "Load()" if
       DictionaryHuffman encoding is detected.

PROCEDURES
       All procedures are exportable, but none are exported by default.	 All
       procedures can be exported by using the ":all" tag.

   "find_mobidedrm()"
       Attempts	to locate a copy of the	MobiDeDrm script by searching PATH and
       looking in the EBook::Tools user	configuration directory	(see
       "userconfigdir()" in EBook::Tools.

       Returns the complete path to the	script,	or undef if nothing was	found.

       This will use package variable $mobidedrm_cmd as	its first guess, and
       set that	variable to the	return value as	well.

   "find_mobigen()"
       Attempts	to locate the mobigen executable by making a test execution on
       predicted locations (including just checking PATH) and looking in the
       EBook::Tools user configuration directory (see "userconfigdir()"	in
       EBook::Tools.

       Returns the system command used for a successful	invocation, or undef
       if nothing worked.

       This will use package variable $mobigen_cmd as its first	guess, and set
       that variable to	the return value as well.

   "parse_mobi_exth($headerdata)"
       Takes as	an argument a scalar containing	the variable-length Mobipocket
       EXTH data from the first	record.	 Returns an array of hashes, each hash
       containing the data from	one EXTH record	with values from that data
       keyed to	recognizable names.

       If $headerdata doesn't appear to	be an EXTH header, carps a warning and
       returns an empty	list.

       See:

       http://wiki.mobileread.com/wiki/MOBI

       Hash keys

       o   "type"

	   A numeric value indicating the type of EXTH data in the record.
	   See package variable	%exthtypes.

       o   "length"

	   The length of the "data" value in bytes

       o   "data"

	   The data of the record.

   parse_mobi_header($headerdata)
       Takes as	an argument a scalar containing	the variable-length
       Mobipocket-specific header data from the	first record.  Returns a hash
       containing values from that data	keyed to recognizable names.

       See:

       http://wiki.mobileread.com/wiki/MOBI

       keys

       The returned hash will have the following keys (documented in the order
       in which	they are encountered in	the header):

       "identifier"
	   This	should always be the string 'MOBI'.  If	it isn't, the
	   procedure croaks.

       "headerlength"
	   This	is the size of the complete header.  If	this value is
	   different from the length of	the argument, the procedure croaks.

       "type"
	   A numeric code indicating what category of Mobipocket file this is.

       "encoding"
	   A numeric code representing the encoding.  Expected values are
	   '1252' (for Windows-1252) and '65001	(for UTF-8).

	   The procedure carps a warning if an unexpected value	is
	   encountered.

       "uniqueid"
	   This	is thought to be a unique ID for the book, but its actual use
	   is unknown.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "version"
	   This	is thought to be the Mobipocket	format version.	 A second
	   version code	shows up again later as	"version2" which is usually
	   the same on unprotected books but different on DRMd books.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "reserved"
	   40 bytes of reserved	data.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "indxrecord"
	   This	is thought to be the record offset to the first	'INDX' record,
	   so named for	its first four letters.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "titleoffset"
	   Offset in record 0 (not from	start of file) of the full title of
	   the book.

       "titlelength"
	   Length in bytes of the full title of	the book

       "languageunknown"
	   16 bits of unknown data thought to be related to the	book language.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "language"
	   A pseudo-IANA language code string representing the main book
	   language (i.e. the value of <dc:language>).	See %mobilangcodes for
	   an exact map	of raw values to this string and notes on non-
	   compliant results.

       "dilanguageunknown"
	   16 bits of unknown data thought to be related to the	dictionary
	   input language.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "dilanguage"
	   A pseudo-IANA language code string for the DictionaryInLanguage
	   element.  See %mobilangcodes	for an exact map of raw	values to this
	   string and notes on non-compliant results.

       "dolanguageunknown"
	   16 bits of unknown data thought to be related to the	dictionary
	   output language.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "dolanguage"
	   A pseudo-IANA language code string for the DictionaryOutLanguage
	   element.  See %mobilangcodes	for an exact map of raw	values to this
	   string and notes on non-compliant results.

       "version2"
	   This	is another Mobipocket format version related to	DRM.  If no
	   DRM is present, it should be	the same as "version".

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "firstimagerecord"
	   This	is thought to be an index to the first record containing image
	   data.  If there are no images in the	book, this value will be
	   4294967295 (0xffffffff)

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "huffrecord"
	   This	is thought to be the record offset to the 'HUFF' record, used
	   in HUFF/CDIC	decompression.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "huffreccnt"
	   This	is thought to be the number of HUFF and	CDIC records, starting
	   at "huffrecord".

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "datprecord"
	   This	is thought to be the record offset to the first	'DATP' record,
	   so named for	its first four letters.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "datpreccnt"
	   This	is thought to be the number of 'DATP' records present.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "exthflags"
	   A 32-bit bitfield related to	the Mobipocket EXTH data.  If bit 6
	   (0x40) is set, then there is	at least one EXTH record.

       "unknown116"
	   36 bytes of unknown data at offset 116.  This value will be
	   undefined if	the header data	was not	long enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "drmoffset"
	   A number thought to be the byte offset inside of the	record 0 data
	   in which DRM	data can be found.  If present and no DRM is set,
	   contains either the value 0xFFFFFFFF	(normal	books) or 0x00000000
	   (samples).  This value will be undefined if the header data was not
	   long	enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "drmcount"
	   A number thought to be related to DRM.

	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "drmsize"
	   A number thought to be the size of the data in bytes	after
	   "drmoffset" containing DRM keys.

	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "drmflags"
	   A number thought to be related to DRM.

	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown168"
	   32 bits of unknown data at offset 168, usually zeroes.  This	value
	   will	be undefined if	the header data	was not	long enough to contain
	   it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown172"
	   32 bits of unknown data at offset 172, usually zeroes.  This	value
	   will	be undefined if	the header data	was not	long enough to contain
	   it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown176"
	   16 bits of unknown data at offset 176.  This	value will be
	   undefined if	the header data	was not	long enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "lastimagerecord"
	   This	is thought to be an index to the last record containing	image
	   data.  If there are no images in the	book, this value will be 65535
	   (0xffff).

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown180"
	   32 bits of unknown data at offset 180.  This	value will be
	   undefined if	the header data	was not	long enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "fcisrecord"
	   This	is thought to be an index to a 'FCIS' record, so named because
	   those are always the	first four characters when the record data is
	   decompressed	using uncompress_palmdoc().

	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown188"
	   32 bits of unknown data at offset 188.  This	value will be
	   undefined if	the header data	was not	long enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "flisrecord"
	   This	is thought to be an index to a 'FLIS' record, so named because
	   those are always the	first four characters when the record data is
	   decompressed	using uncompress_palmdoc().

	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown196"
	   32 bits of unknown data at offset 180.  This	value will be
	   undefined if	the header data	was not	long enough to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "unknown200"
	   Unknown data	of unknown length running to the end of	the header.
	   This	value will be undefined	if the header data was not long	enough
	   to contain it.

	   Use with caution.  This key may be renamed in the future if more
	   information is found.

       "extradataflags"
	   Two bytes sometimes found inside of "unknown200", used to determine
	   if extra data has been appended to each text	record that should not
	   be used in decompression.

   "parse_mobi_language($languagecode, $regioncode)"
       Takes the integer values	$languagecode and $regioncode unpacked from
       the Mobipocket header and returns a language string mostly (but not
       entirely) conformant to the IANA	language subtag	registry codes.

       Croaks if $languagecode is not provided.	 If $regioncode	is not
       provided	or not recognized, it is disregarded and the base language
       string (with no region or script) is returned.

       If $languagecode	is not provided, the sub croaks.  If it	isn't
       recognized, a warning is	carped and the sub returns undef.  Note	that
       0,0 is a	recognized code	returning an empty string.

       See %mobilanguagecodes for an exact map of values.  Note	that the
       bottom two bits of the region code appear to be unused (i.e. the	values
       are all multiples of 4).

   "pid_append_checksum($pid)"
       Computes	the Mobipocket PID checksum used as the	final two bytes	of the
       PID and appends them to $pid, returning the merged string.

       Used by "pid_is_valid($pid)".

   "pid_is_valid($pid)"
       Returns 1 if the	PID is a valid Mobipocket/Kindle PID and 0 otherwise.

       This is determined by first ensuring that $pid is exactly ten bytes
       long, and then stripping	the final two bytes normally used as a
       checksum	and recomputing	them, returning	1 only if they are recomputed
       correctly.

   "pukall_cipher_1(%args)"
       This is a COMPLETELY UNTESTED implementation of the Pukall Cipher 1
       algorithm used for encryption and decryption in Mobipocket files.  It
       is a 128-bit stream cipher.  For	more information and alternate
       implementations,	see <http://membres.lycos.fr/pc1/>.

       Use at your own risk.  Bug reports appreciated.

       Arguments

       o   "key"

	   16-byte encryption key.  This must be provided, and must be exactly
	   16 bytes, or	the procedure will croak.

       o   "input"

	   Input data to be either encrypted or	decrypted.  If this is not
	   provided, the procedure croaks.

       o   "encrypt" (optional)

	   If set to true, the cipher will be used to encrypt the input	data.
	   If not set, or set to false,	the cipher will	be used	to decrypt the
	   input data.

   "record_extradata_size(%args)"
       This checks the end of a	text record for	extra data that	should not be
       made part of decompression and returns the total	size of	all data
       fields.

       Arguments

       o   "dataref"

	   A reference to the record data

       o   "extradataflags"

	   16 bits worth of flags indicating which extra data fields are
	   present.

   "system_mobidedrm(%args)"
       Runs python on a	copy of	"MobiDeDrm.py" if it is	available (not
       included	with this distribution)	to downconvert a Mobipocket file.

       Returns the output filename on success, or undef	otherwise.

       Arguments

       o   "infile"

	   The input filename.	If not specified or invalid, the procedure
	   returns undef.

       o   "outfile"

	   The output filename.	 If not	specified, the program will use	a name
	   based on the	input file, appending '-nodrm' to the basename and
	   keeping the extension.  In the special case of Mobipocket files
	   ending in '-sm', the	'-sm' portion of the basename is simply
	   removed, and	nothing	else is	appended.

       o   "pid"

	   The PID to use to decrypt the file.	If not specified or invalid,
	   the procedure returns undef.

   "system_mobigen(%args)"
       Runs "mobigen" to convert OPF, HTML, or ePub input into a Mobipocket
       .prc/.mobi book.	 The procedure find_mobigen() is called	to locate the
       executable.

       Returns the return value	from mobigen, or undef if no filename was
       specified or the	file did not exist.  Also returns undef	if mobigen
       could not be found.

       Arguments

       o   "infile"

	   The input filename.	If not specified or invalid, the procedure
	   returns undef.

       o   "outfile"

	   The output filename.	 The mobigen executable	will choose its	own
	   filename for	direct output, but if this argument is specified, the
	   output file will be renamed to the specified	filename instead.

	   If not specified, the default output	will be	left in	place.

       o   "dir"

	   The directory in which to place the output file.  The mobigen
	   executable itself will always place its output into the current
	   working directory, but if this argument is specified, the output
	   file	will be	moved into the specified directory, creating that
	   directory if	necessary.

       o   "compression"

	   Compression level from 0-2, where 0 is no compression, 1 is PalmDoc
	   compression,	and 2 is HUFF/CDIC compression.	 If not	specified,
	   defaults to 1 (PalmDoc compression).

   "uncompress_dictionaryhuffman(%args)"
       Uncompresses text compressed with the DictionaryHuffman compression
       scheme.

       Arguments

       o   "data"

	   A scalar containing the compressed data to uncompress.

       o   "huff"

	   A hashref pointing to the HUFF record data

       o   "cdics"

	   An arrayref pointing	to the CDIC record data

       o   "depth"

	   The current depth of	the huffman tree, currently only used in
	   debugging.

   "unpack_mobi_language($data)"
       Takes as	an argument 4 bytes of data.  If less data is provided,	the
       sub croaks.  If more, a debug warning is	provided, but the sub
       continues.

       In scalar context returns a language string mostly (but not entirely)
       conformant to the IANA language subtag registry codes.

       In list context,	returns	the language string, an	unknown	code integer,
       a region	code integer, and a language code integer, with	the last three
       being directly unpacked values.

       See %mobilangcodes for an exact map of values.  Note that the bottom
       two bits	of the region code appear to be	unused (i.e. the values	are
       all multiples of	4).  The unknown code integer appears to be unused,
       and is generally	zero.

       The original implementation by Mobipocket may have been via Microsoft's
       .NET CultureInfo	class.	See:
       <http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx>

BUGS AND LIMITATIONS
       o   Unpacking DRM-protected text	isn't supported.  Although
	   infrastructure may be added later to	make use of external helpers
	   and plugins,	direct DRM support will	never be added to the main
	   code	for legal reasons.

       o   Repacking a .prc without fully extracting to	OPF and	completely
	   converting back isn't supported.  This will have to be implemented
	   before an interface to perform minor	metadata alterations can be
	   implemented.

       o   Mobipocket HUFF/CDIC	decoding (used mostly on dictionaries) isn't
	   well	documented.

       o   Not all Mobipocket data is understood, so a conversion from OPF to
	   Mobipocket .prc back	to OPF will not	result in all data being
	   retained.  Patches welcome.

       o   Mobipocket INDX, DATP, FCIS,	and FLIS records are not understood
	   and are completely ignored

       o   Mobipocket EXTH subjectcode records may not end up attached to the
	   correct subject element if the number of subject records differs
	   from	the number of subjectcode records.  This is because the
	   Mobipocket format leaves the	EXTH subjectcode records completely
	   unlinked from the subject records, and there	is no way to detect if
	   a subject with no associated	subjectcode comes before a subject
	   with	an associated subjectcode.

	   Fortunately,	this should rarely be a	problem	with real data,	as
	   Mobipocket Creator only allows a single subject to be set, and the
	   only	other way to have a subjectcode	attached to a subject is to
	   manually edit the OPF file and insert an additional dc:Subject
	   element with	a BASICCode attribute.

	   Mobipocket has indicated that they may move data currently in their
	   custom elements and attributes to the standard <meta> elements in a
	   future release, so this problem may become moot then.

AUTHOR
       Zed Pobre <zed@debian.org>

LICENSE	AND COPYRIGHT
       Copyright 2008 Zed Pobre

       Licensed	to the public under the	terms of the GNU GPL, version 2

perl v5.24.1			  2017-07-08	   EBook::Tools::Mobipocket(3)

NAME | SYNOPSIS | DEPENDENCIES | CONSTRUCTOR | ACCESSOR METHODS | MODIFIER METHODS | PROCEDURES | BUGS AND LIMITATIONS | AUTHOR | LICENSE AND COPYRIGHT

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=EBook::Tools::Mobipocket&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help