pdfwarez
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | pdfwarez [2018-11-23 19:02:47] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Creating a searchable scanned and OCR'd book ====== | ||
+ | Prerequisites: | ||
+ | * a directory with image files, for example in ppm format, one for each page (use " | ||
+ | * tesseract-ocr version **4**. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate | ||
+ | * language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech) | ||
+ | * imagemagick, | ||
+ | |||
+ | First, create individual PDFs out of these images. | ||
+ | < | ||
+ | for f in *.ppm; do echo " | ||
+ | </ | ||
+ | * adjust " | ||
+ | * adjust " | ||
+ | |||
+ | Next, run OCR engine on the original files and create PDFs with text layer only: | ||
+ | < | ||
+ | export OMP_THREAD_LIMIT=1 | ||
+ | for f in *.ppm; do echo " | ||
+ | </ | ||
+ | * adjust " | ||
+ | * " | ||
+ | * "pkill -f -USR1 parallel" | ||
+ | |||
+ | Now, merge the image layer and the text layer | ||
+ | < | ||
+ | for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; | ||
+ | </ | ||
+ | |||
+ | And finally merge all the generated pages into one big PDF | ||
+ | < | ||
+ | pdfunite combined*.pdf output.pdf | ||
+ | </ |
pdfwarez.txt · Last modified: by 127.0.0.1