====== Creating a searchable scanned and OCR'd book ======

===== General PDF instructions (involves some resampling) =====

Work in empty directory
<code>
mkdir foo
cd foo
</code>

Separate to individual pages
<code>
pdfseparate ../foo.pdf separated-%05d.pdf
</code>

Render to PNG:
<code>
for f in separated-*pdf; do t=${f#separated-}; echo convert -density 300 $f rendered-${t%.pdf}.png ; done  | parallel
</code>

OCR:
<code>
export OMP_THREAD_LIMIT=1
for f in *.png; do t=${f#rendered-}; echo tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f ocr-${t%.png} pdf; done  | parallel
</code>

Combine originals + OCR text layer:
<code>
for f in ocr-*.pdf; do t=${f#ocr-}; pdftk separated-$t background ocr-$t output combined-$t; done
</code>

Produce final output:
<code>
pdfunite combined*.pdf output.pdf
</code>


===== Old instructions - one image per page =====

Prerequisites:
  * a directory with image files, for example in ppm format, one for each page (use "pdfimages file.pdf directory/" if you have one multipage PDF), names sorted in the right order
  * tesseract-ocr version **4**. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate
  * language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech)
  * imagemagick, pdftk, poppler-utils, parallel (beware, there is "parallel" program in moreutils package, and this one is different)

First, create individual PDFs out of these images.
<code>
for f in *.ppm; do echo "convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf"; done | parallel
</code>
  * adjust "level" to match your scanner. The goal is to have black black and white white.
  * adjust "density" to your scanner DPI

Next, run OCR engine on the original files and create PDFs with text layer only:
<code>
export OMP_THREAD_LIMIT=1
for f in *.ppm; do echo "tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel
</code>
  * adjust "dpi" to your scanner DPI
  * "l" is language, e.g. eng, ces or deu
  * "pkill -f -USR1 parallel" to get the progress

Now, merge the image layer and the text layer
<code>
for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; done | parallel
</code>

And finally merge all the generated pages into one big PDF
<code>
pdfunite combined*.pdf output.pdf
</code>