Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ReadSXC(3)	      User Contributed Perl Documentation	    ReadSXC(3)

       Spreadsheet::ReadSXC - Extract OpenOffice 1.x spreadsheet data

	 use Spreadsheet::ReadSXC qw(read_sxc);
	 my $workbook_ref = read_sxc("/path/to/file.sxc");

	 # Alternatively, unpack the .sxc file yourself	and pass content.xml

	 use Spreadsheet::ReadSXC qw(read_xml_file);
	 my $workbook_ref = read_xml_file("/path/to/content.xml");

	 # Alternatively, pass the XML string directly

	 use Spreadsheet::ReadSXC qw(read_xml_string);
	 use Archive::Zip;
	 my $zip = Archive::Zip->new("/path/to/file.sxc");
	 my $content = $zip->contents('content.xml');
	 my $workbook_ref = read_xml_string($content);

	 # Control the output through a	hash of	options	(below are the defaults):

	 my %options = (
	       ReplaceNewlineWith      => "",
	       IncludeCoveredCells     => 0,
	       DropHiddenRows	       => 0,
	       DropHiddenColumns       => 0,
	       NoTruncate	       => 0,
	       StandardDate	       => 0,
	       StandardTime	       => 0,
	       OrderBySheet	       => 0,
	 my $workbook_ref = read_sxc("/path/to/file.sxc", \%options );

	 # Iterate over	every worksheet, row, and cell:

	 use Unicode::String qw(utf8);

	 foreach ( sort	keys %$workbook_ref ) {
	    print "Worksheet ",	$_, " contains ", $#{$$workbook_ref{$_}} + 1, "	row(s):\n";
	    foreach ( @{$$workbook_ref{$_}} ) {
	       foreach ( map { defined $_ ? $_ : '' } @{$_} ) {
		  print	utf8(" '$_'")->as_string;
	       print "\n";

	 # Cell	D2 of worksheet	"Sheet1"

	 $cell = $$workbook_ref{"Sheet1"}[1][3];

	 # Row 1 of worksheet "Sheet1":

	 @row =	@{$$workbook_ref{"Sheet1"}[0]};

	 # Worksheet "Sheet1":

	 @sheet	= @{$$workbook_ref{"Sheet1"}};

       Spreadsheet::ReadSXC extracts data from OpenOffice 1.x spreadsheet
       files (.sxc). It	exports	the function read_sxc()	which takes a filename
       and an optional reference to a hash of options as arguments and returns
       a reference to a	hash of	references to two-dimensional arrays. The hash
       keys correspond to the names of worksheets in the OpenOffice workbook.
       The two-dimensional arrays correspond to	rows and cells in the
       respective spreadsheets.	If you don't like this because the order of
       sheets is not preserved in a hash, read on. The 'OrderBySheet' option
       provides	an array of hashes instead.

       If you prefer to	unpack the .sxc	file yourself, you can use the
       function	read_xml_file()	instead	and pass the path to content.xml as an
       argument. Or you	can extract the	XML string from	content.xml and	pass
       the string to the function read_xml_string(). Both functions also take
       a reference to a	hash of	options	as an optional second argument.

       Spreadsheet::ReadSXC requires XML::Parser to parse the XML contained in
       .sxc files. Only	the contents of	text:p elements	are returned, not the
       actual values of	table:value attributes.	For example, a cell might have
       a table:value-type attribute of "currency", a table:value attribute of
       "-1500.99" and a	table:currency attribute of "USD". The text:p element
       would contain "-$1,500.99". This	is the string which is returned	by the
       read_sxc() function, not	the value of -1500.99.

       Spreadsheet::ReadSXC was	written	with data import into an SQL database
       in mind.	Therefore empty	spreadsheet cells correspond to	undef values
       in array	rows. The example code above shows how to replace undef	values
       with empty strings.

       If the .sxc file	contains an empty spreadsheet its hash element will
       point to	an empty array (unless you use the 'NoTruncate'	option in
       which case it will point	to an array of an array	containing one
       undefined element).

       OpenOffice uses UTF-8 encoding. It depends on your environment how the
       data returned by	the XML	Parser is best handled:

	 use Unicode::String qw(latin1 utf8);
	 $unicode_string = utf8($$workbook_ref{"Sheet1"}[0][0])->as_string;

	 # this	will not work for characters outside ISO-8859-1:

	 $latin1_string	= utf8($$workbook_ref{"Sheet1"}[0][0])->latin1;

       Of course there are other modules than Unicode::String on CPAN that
       handle conversion between encodings. It's your choice.

       Table rows in .sxc files	may have a "table:number-rows-repeated"
       attribute, which	is often used for consecutive empty rows. When you
       format whole rows and/or	columns	in OpenOffice, it sets the numbers of
       rows in a worksheet to 32,000 and the number of columns to 256, even if
       only a few lower-numbered rows and cells	actually contain data.
       Spreadsheet::ReadSXC truncates such sheets so that there	are no empty
       rows after the last row containing data and no empty columns after the
       last column containing data (unless you use the 'NoTruncate' option).

       Still it	is perfectly legal for an .sxc file to apply the
       "table:number-rows-repeated" attribute to rows that actually contain
       data (although I	have only been able to produce such files manually,
       not through OpenOffice itself). To save on memory usage in these	cases,
       Spreadsheet::ReadSXC does not copy rows by value, but by	reference
       (remember that multi-dimensional	arrays in Perl are really arrays of
       references to arrays). Therefore, if you	change a value in one row, it
       is possible that	you find the corresponding value in the	next row
       changed,	too:

	 $$workbook_ref{"Sheet1"}[0][0]	= 'new string';
	 print $$workbook_ref{"Sheet1"}[1][0];

       As of version 0.20 the references returned by read_sxc()	et al. remain
       valid after subsequent calls to the same	function. In earlier versions,
       calling read_sxc() with a different file	as the argument	would change
       the data	referenced by the original return value, so you	had to
       derefence it before making another call.	Thanks to H. Merijn Brand for
       fixing this.

	   By default, newlines	within cells are ignored and all lines in a
	   cell	are concatenated to a single string which does not contain a
	   newline. To keep the	newline	characters, use	the following
	   key/value pair in your hash of options:

	     ReplaceNewlineWith	=> "\n"

	   However, you	may replace newlines with any string you like.

	   By default, the content of cells that are covered by	other cells is
	   ignored because you wouldn't	see it in OpenOffice unless you
	   unmerge the merged cells. To	include	covered	cells in the data
	   structure which is returned by parse_sxc(), use the following
	   key/value pair in your hash of options:

	     IncludeCoveredCells => 1

	   By default, hidden rows are included	in the data structure returned
	   by parse_sxc(). To drop those rows, use the following key/value
	   pair	in your	hash of	options:

	     DropHiddenRows => 1

	   By default, hidden columns are included in the data structure
	   returned by parse_sxc(). To drop those rows,	use the	following
	   key/value pair in your hash of options:

	     DropHiddenColumns => 1

	   By default, the two-dimensional arrays that contain the data	within
	   each	worksheet are truncated	to get rid of empty rows below the
	   last	row containing data and	empty columns beyond the last column
	   containing data. If you prefer to keep those	rows and columns, use
	   the following key/value pair	in your	hash of	options:

	     NoTruncate	=> 1

	   By default, date cells are returned as formatted. If	you prefer to
	   obtain the date value as contained in the table:date-value
	   attribute, use the following	key/value pair in your hash of

	     StandardDate => 1

	   By default, time cells are returned as formatted. If	you prefer to
	   obtain the time value as contained in the table:time-value
	   attribute, use the following	key/value pair in your hash of

	     StandardTime => 1

	   These options are a first step on the way to	a different approach
	   at reading data from	.sxc files. There should be more options to
	   read	in values instead of the strings OpenOffice displays. It
	   should give more flexibility	in working with	the data obtained from
	   OpenOffice spreadsheets. 'float' and	'percentage' values could be
	   next.  'currency' is	less obvious, though, as we need to consider
	   both	its value and the 'table:currency' attribute. Formulas and
	   array formulas are yet another issue. I probably won't deal with
	   this	until I've given this module an	object-oriented	interface.

	   The disadvantage of storing worksheets by name in a hash is that
	   the order of	sheets is lost.	If you prefer not to obtain such a
	   hash, but an	array of worksheets insted, use	the following
	   key/value pair in your hash of options:

	     OrderBySheet => 1

	   Thus	the read_sxc function will return an array of hashes, each of
	   which will have two keys, "label" and "data". The value of "label"
	   is the name of the sheet. The value of data is a reference to a
	   two-dimensional array containing rows and columns of	the worksheet:

	     my	$worksheets_ref	= read_sxc("/path/to/file.sxc");
	     my	$name_of_first_sheet = $$worksheets_ref[0]{label};
	     my	$first_cell_of_first_sheet = $$worksheets_ref[0]{data}[0][0];

SEE ALSO has extensive	documentation
       of the OpenOffice 1.x XML file format (soon to be replaced by the OASIS
       file format, see

       Christoph Terhechte, <>

       Copyright 2005 by Christoph Terhechte

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.24.1			  2005-06-17			    ReadSXC(3)


Want to link to this manual page? Use this URL:

home | help