Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
PDFSANDWICH(April 03, 2022)			   PDFSANDWICH(April 03, 2022)

NAME
       pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf	files

SYNOPSIS
       pdfsandwich [options] inputfile.pdf

DESCRIPTION
       pdfsandwich  generates  "sandwich"  OCR pdf files, i.e. pdf files which
       contain only images (no text) will be processed	by  optical  character
       recognition  (OCR)  and	the  text will be added	to each	page invisibly
       "behind"	the images.  Note that pdfsandwich needs  the  following  pro-
       grams:  unpaper,	 convert,  gs,	hocr2pdf  (for	tesseract < 3.03), and
       tesseract.  As tesseract	>= 3.03	can write pdf files, hocr2pdf is  only
       needed  for  older  versions of tesseract.  Please visit	http://www.to-
       bias-elze.de/pdfsandwich.

OPTIONS
       -convert
	      -convert filename	: name of convert binary (default: convert)

       -coo   -coo options : additional	convert	options; make sure  to	quote;
	      e.g.  -coo "-normalize -black-threshold 75%" call	convert	--help
	      or man convert for all convert options

       -debug keep all temporary files in /tmp (for debugging)

       -enforcehocr2pdf
	      use hocr2pdf even	if tesseract >=	3.03

       -first_page
	      -first_page number : number of page to start OCR from  (default:
	      1)

       -gray  use grayscale for	images (default: black and white)

       -grayfilter
	      enable  unpaper's	 gray  filter;	further	 options can be	set by
	      -unpo

       -gs    -gs filename : name of gs	binary (default: gs);  optional,  only
	      required for resizing

       -hocr2pdf
	      -hocr2pdf	  filename   :	 name  of  hocr2pdf  binary  (default:
	      hocr2pdf); ignored for tesseract >= 3.03 unless option -enforce-
	      hocr2pdf is set

       -hoo   -hoo options : additional	hocr2pdf options; make sure to quote

       -identify
	      -identify	filename : name	of identify binary (default: identify)

       -last_page
	      -last_page  number  :  number of page up to which	to process OCR
	      (default:	number of pages	in inputfile)

       -lang  -lang language : language	of the text; option to tesseract  (de-
	      fault:  eng)  e.g:  eng, deu, deu-frak, fra, rus,	swe, spa, ita,
	      ...  see option -list_langs; Multiple languages  may  be	speci-
	      fied, separated by plus characters.

       -layout
	      -layout  {  single  |  double  |	none } : layout	of the scanned
	      pages; requires unpaper single: one page per sheet  double:  two
	      pages per	sheet none: no auto-layout (default)

       -list_langs
	      list  currently  available languages and exit; in	case of	custom
	      binaries of tesseract, place this	after the -tesseract option

       -maxpixels
	      -maxpixels NUM : maximal number of pixels	allowed	for input file
	      if  (resolution/72)^2  *width*height > maxpixels then scale page
	      of input file down prior to OCR so that page size	in pixels cor-
	      responds to maxpixels; default: 17415167 (A3 @ 300 dpi)

       -noimage
	      do not place the image over the text (requires hocr2pdf; ignored
	      without -enforcehocr2pdf option)

       -nopreproc
	      do not preprocess	with unpaper

       -nthreads
	      -nthreads	number : number	of parallel threads (default:  guessed
	      number of	CPUs; if guessing fails: 1)

       -o     -o filename : output file; default: inputfile_ocr.pdf (if	exten-
	      sion is different	from .pdf, original extension is kept)

       -omp_thread_limit
	      -omp_thread_limit	number : number	of threads tesseract  may  use
	      for each page (default: 1)

       -pagesize
	      -pagesize	 {  original | NUMxNUM } : set page size of output pdf
	      (requires	ghostscript) original: same as	input  file  (default)
	      NUMxNUM:	width  x  height  in  pixel  (e.g.  for	 A4: -pagesize
	      595x842)

       -pdfinfo
	      -pdfinfo filename	: name of pdfinfo binary (default: pdfinfo)

       -pdfunite
	      -pdfunite	filename : name	of pdfunite binary (default: pdfunite)

       -resolution
	      -resolution NUM :	resolution (dpi) used for OCR (default:	300)

       -rgb   use RGB color space for images (default: black and  white);  use
	      with care: causes	problems with some color spaces

       -sloppy_text
	      sloppily place text, group words,	do not draw single glyphs; ig-
	      nored for	tesseract >= 3.03 unless  option  -enforcehocr2pdf  is
	      set

       -tesseract
	      -tesseract filename : name of tesseract binary (default: tesser-
	      act)

       -tesso -tesso options : additional  tesseract  options;	make  sure  to
	      quote

       -unpaper
	      -unpaper filename	: name of unpaper binary (default: unpaper)

       -unpo  -unpo options : additional unpaper options; make sure to quote

       -quiet suppress output

       -verbose
	      produce more output

       -version
	      print version and	quit

       -help  Display this list	of options

       --help Display this list	of options

LANGUAGES
       Via Tesseract, numerous language	packagess available - follow this link
       http://code.google.com/p/tesseract-ocr/downloads/list  for  a  complete
       list.  Here is an incomplete selection of supported languages and their
       abbreviations:

       ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat  (Catalan),  ces
       (Czech),	 chi_sim  (Simplified Chinese),	chi_tra	(Traditional Chinese),
       chr (Cherokee), dan (Danish), dan-frak (Danish  (Fraktur)),  deu	 (Ger-
       man),  ell  (Greek), eng	(English), enm (Old English), epo (Esperanto),
       est (Estonian), fin (Finnish), fra  (French),  frm  (Old	 French),  glg
       (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian),
       ind (Indonesian), ita (Italian),	 jpn  (Japanese),  kor	(Korean),  lav
       (Latvian),  lit	(Lithuanian),  nld (Dutch), nor	(Norwegian), pol (Pol-
       ish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian),
       slv  (Slovenian),  sqi  (Albanian),  spa	 (Spanish), srp	(Serbian), swe
       (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog),  tha  (Thai),  tur
       (Turkish), ukr (Ukrainian), vie (Vietnamese)

       Multiple	languages may be specified, separated by plus characters. Note
       that the	respective tesseract language package needs to be installed on
       your  system  to	be usable by pdfsandwich. Option -list_langs lists the
       languages which are available on	your system.

AVAILABILITY
       Sources and packages as well as comprehensive  help  can	 be  found  at
       http://www.tobias-elze.de/pdfsandwich.

AUTHOR
       Tobias Elze <sourceforge@tobias-elze.de>

						   PDFSANDWICH(April 03, 2022)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | LANGUAGES | AVAILABILITY | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=pdfsandwich&sektion=1&manpath=FreeBSD+13.1-RELEASE+and+Ports>

home | help