Creating a searchable scanned and OCR'd book

General PDF instructions (involves some resampling)

Work in empty directory

mkdir foo
cd foo

Separate to individual pages

pdfseparate ../foo.pdf separated-%05d.pdf

Render to PNG:

for f in separated-*pdf; do t=${f#separated-}; echo convert -density 300 $f rendered-${t%.pdf}.png ; done  | parallel

OCR:

export OMP_THREAD_LIMIT=1
for f in *.png; do t=${f#rendered-}; echo tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f ocr-${t%.png} pdf; done  | parallel

Combine originals + OCR text layer:

for f in ocr-*.pdf; do t=${f#ocr-}; pdftk separated-$t background ocr-$t output combined-$t; done

Produce final output:

pdfunite combined*.pdf output.pdf

Old instructions - one image per page

Prerequisites:

a directory with image files, for example in ppm format, one for each page (use “pdfimages file.pdf directory/” if you have one multipage PDF), names sorted in the right order
tesseract-ocr version 4. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate
language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech)
imagemagick, pdftk, poppler-utils, parallel (beware, there is “parallel” program in moreutils package, and this one is different)

First, create individual PDFs out of these images.

for f in *.ppm; do echo "convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf"; done | parallel

adjust “level” to match your scanner. The goal is to have black black and white white.
adjust “density” to your scanner DPI

Next, run OCR engine on the original files and create PDFs with text layer only:

export OMP_THREAD_LIMIT=1
for f in *.ppm; do echo "tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel

adjust “dpi” to your scanner DPI
“l” is language, e.g. eng, ces or deu
“pkill -f -USR1 parallel” to get the progress

Now, merge the image layer and the text layer

for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; done | parallel

And finally merge all the generated pages into one big PDF

pdfunite combined*.pdf output.pdf

wiki

Table of Contents

Creating a searchable scanned and OCR'd book

General PDF instructions (involves some resampling)

Old instructions - one image per page