Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
pstotext(1)		    General Commands Manual		   pstotext(1)

       pstotext	- extract ASCII	text from a PostScript or PDF file

       pstotext	[option|pathname]...

       where option includes:

       -output file
       -gs command

       pstotext	reads one or more PostScript or	PDF files, and writes to stan-
       dard output a representation of the plain text that would be  displayed
       if  the	PostScript  file were printed.	As is described	in the DETAILS
       section below, this representation is only an approximation.  Neverthe-
       less,  it  is  often  useful  for  information retrieval	(e.g., running
       grep(1) or building a full-text index) or to recover the	 text  from  a
       PostScript file whose source you	have lost.

       pstotext	 calls	Ghostscript,  and requires Aladdin Ghostscript version
       3.51 or newer.  Ghostscript must	be invokable  on  the  current	search
       path  as	 gs.  Alternatively, you can use the -gs option	to specify the
       command (pathname and options) to run  Ghostscript.   For  example,  on
       Windows you might use -gs "c:\gs\gswin32c.exe -Ic:\gs;c:\gs\fonts".

       pstotext	 reads	and processes its command line from left to right, ig-
       noring the case of options.  When it encounters a  pathname,  it	 opens
       the  file  and  expects	to  find  a  PostScript	job or PDF document to
       process.	 The option - means to read and	process	a PostScript job  from
       standard	 input.	  If no	- or pathname arguments	are encountered, psto-
       text reads a PostScript job from	standard input.	(PDF documents require
       random  access,	hence cannot be	read from standard input.) You can use
       the -output option to specify an	output file (remember to invoke	it be-
       fore the	input file); otherwise pstotext	writes to standard output.

       The  option  -cork  is  only  relevant for PostScript files produced by
       dvips from TeX or LaTeX documents; it tells pstotext to	use  the  Cork
       encoding	 (known	 as T1 in LaTeX) rather	than the old TeX text encoding
       (known as OT1 in	LaTeX).	Unfortunately files produced  by  dvips	 don't
       distinguish which font encodings	were used.

       The options -landscape and -landscapeOther should be used for documents
       that must be rotated 90 degrees clockwise or counterclockwise,  respec-
       tively, in order	to be readable.

       The options -debug and -bboxes are mostly of use	for the	maintainers of
       pstotext.  -debug shows Ghostscript output and error messages.  -bboxes
       outputs one word	per line with bounding box information.

       pstotext	 does its work by telling Ghostscript to load a	PostScript li-
       brary that causes it to write to	its standard output information	 about
       each  string rendered by	a PostScript job or PDF	document.  This	infor-
       mation includes the characters of the string, and enough	additional in-
       formation  to  approximate  the	string's bounding rectangle.  pstotext
       post-processes this information and outputs a sequence of words	delim-
       ited by space, newline, and formfeed.

       pstotext	outputs	words in the same sequence as they are rendered	by the
       document.  This usually,	but not	always,	follows	the order that a human
       would  read the words on	a page.	 Within	this sequence, words are sepa-
       rated by	either space or	newline	depending on whether or	not they  fall
       on the same line.  Each page is terminated with a formfeed.  If you use
       the incorrect  option  from  the	 set  {-portrait,  -landscape,	-land-
       scapeOther}, pstotext is	likely to substitute newline for space.

       A  PostScript  job  or  PDF  document often renders one word as several
       strings in order	to get correct spacing	between	 particular  pairs  of
       characters.  pstotext does its best to assemble these strings back into
       words, using a simple heuristic:	strings	separated  by  a  distance  of
       less  than 0.3 times the	minimum	of the average character widths	in the
       two strings are considered to be	part of	the same word.	Note that this
       typically  causes leading and trailing punctuation characters to	be in-
       cluded with a word.

       The PostScript language provides	a flexible encoding  scheme  by	 which
       character  codes	 in strings select specific characters (symbols), so a
       PostScript job is free to use any character code.  On the  other	 hand,
       pstotext	 always	translates to the ISO 8859-1 (Latin-1) character code,
       which is	an extension to	ASCII covering most of	the  Western  European
       languages.  When	a character isn't present in ISO 8859-1, pstotext uses
       a sequence of characters, e.g.,	"---"  for  em	dash  or  "A\226"  for
       Abreve.	pstotext can be	fooled by a font whose Encoding	vector doesn't
       follow Adobe's conventions, but it contains heuristics allowing	it  to
       handle a	wide variety of	misbehaving fonts.

       (pstotext no longer translates hyphen (\255) to minus (\055).)

       Andrew Birrell (PostScript libraries), Paul McJones (application), Rus-
       sell Lang (Windows and OS/2 adaptation),	and Hunter Goatley (VMS	 adap-

       pstotext	 incorporates  technology originally developed for the Virtual
       Paper project at	SRC; see

       As  mentioned  above,  pstotext	invokes	 Ghostscript.	See  gs(1)  or

       Copyright 1995-8	Digital	Equipment Corporation.
       Distributed only	by permission.
       See file	/usr/local/share/licenses/pstotext-1.9_6/EULA for details.

       Last modified on	Sat Feb	 5 21:00:00 AEST 2000 by rjl
	    modified on	Fri Jun	 5 14:02:37 PDT	1998 by	mcjones
	    modified on	Wed Jun	 7 17:47:56 PDT	1995 by	birrell

       This file was generated automatically by	mtex software;	see  the  mtex
       home page at



Want to link to this manual page? Use this URL:

home | help