User Tools

Site Tools


pdfwarez

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
pdfwarez [2018-11-23 19:02:47] – created - external edit 127.0.0.1pdfwarez [2025-10-20 18:01:09] (current) jenda
Line 1: Line 1:
 ====== Creating a searchable scanned and OCR'd book ====== ====== Creating a searchable scanned and OCR'd book ======
 +
 +===== General PDF instructions (involves some resampling) =====
 +
 +Work in empty directory
 +<code>
 +mkdir foo
 +cd foo
 +</code>
 +
 +Separate to individual pages
 +<code>
 +pdfseparate ../foo.pdf separated-%05d.pdf
 +</code>
 +
 +Render to PNG:
 +<code>
 +for f in separated-*pdf; do t=${f#separated-}; echo convert -density 300 $f rendered-${t%.pdf}.png ; done  | parallel
 +</code>
 +
 +OCR:
 +<code>
 +export OMP_THREAD_LIMIT=1
 +for f in *.png; do t=${f#rendered-}; echo tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f ocr-${t%.png} pdf; done  | parallel
 +</code>
 +
 +Combine originals + OCR text layer:
 +<code>
 +for f in ocr-*.pdf; do t=${f#ocr-}; pdftk separated-$t background ocr-$t output combined-$t; done
 +</code>
 +
 +Produce final output:
 +<code>
 +pdfunite combined*.pdf output.pdf
 +</code>
 +
 +
 +===== Old instructions - one image per page =====
  
 Prerequisites: Prerequisites:
pdfwarez.1542996167.txt.gz · Last modified: by 127.0.0.1

Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain
Public Domain Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki