Table of Contents

Creating a searchable scanned and OCR'd book

General PDF instructions (involves some resampling)

Work in empty directory

mkdir foo
cd foo

Separate to individual pages

pdfseparate ../foo.pdf separated-%05d.pdf

Render to PNG:

for f in separated-*pdf; do t=${f#separated-}; echo convert -density 300 $f rendered-${t%.pdf}.png ; done  | parallel

OCR:

export OMP_THREAD_LIMIT=1
for f in *.png; do t=${f#rendered-}; echo tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f ocr-${t%.png} pdf; done  | parallel

Combine originals + OCR text layer:

for f in ocr-*.pdf; do t=${f#ocr-}; pdftk separated-$t background ocr-$t output combined-$t; done

Produce final output:

pdfunite combined*.pdf output.pdf

Old instructions - one image per page

Prerequisites:

First, create individual PDFs out of these images.

for f in *.ppm; do echo "convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf"; done | parallel

Next, run OCR engine on the original files and create PDFs with text layer only:

export OMP_THREAD_LIMIT=1
for f in *.ppm; do echo "tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel

Now, merge the image layer and the text layer

for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; done | parallel

And finally merge all the generated pages into one big PDF

pdfunite combined*.pdf output.pdf