pdfwarez
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| pdfwarez [2018-11-23 19:02:47] – created - external edit 127.0.0.1 | pdfwarez [2025-10-20 18:01:09] (current) – jenda | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Creating a searchable scanned and OCR'd book ====== | ====== Creating a searchable scanned and OCR'd book ====== | ||
| + | |||
| + | ===== General PDF instructions (involves some resampling) ===== | ||
| + | |||
| + | Work in empty directory | ||
| + | < | ||
| + | mkdir foo | ||
| + | cd foo | ||
| + | </ | ||
| + | |||
| + | Separate to individual pages | ||
| + | < | ||
| + | pdfseparate ../foo.pdf separated-%05d.pdf | ||
| + | </ | ||
| + | |||
| + | Render to PNG: | ||
| + | < | ||
| + | for f in separated-*pdf; | ||
| + | </ | ||
| + | |||
| + | OCR: | ||
| + | < | ||
| + | export OMP_THREAD_LIMIT=1 | ||
| + | for f in *.png; do t=${f# | ||
| + | </ | ||
| + | |||
| + | Combine originals + OCR text layer: | ||
| + | < | ||
| + | for f in ocr-*.pdf; do t=${f# | ||
| + | </ | ||
| + | |||
| + | Produce final output: | ||
| + | < | ||
| + | pdfunite combined*.pdf output.pdf | ||
| + | </ | ||
| + | |||
| + | |||
| + | ===== Old instructions - one image per page ===== | ||
| Prerequisites: | Prerequisites: | ||
pdfwarez.1542996167.txt.gz · Last modified: by 127.0.0.1
