Preserve image data (/filesize) from original PDF #15

fsiegert · 2014-09-19T21:11:34Z

Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.

Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.

I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).

Cheers,
Frank

wodin · 2017-01-17T16:08:07Z

#23 and #28 are related to this.

imanzuk mentioned this issue Dec 27, 2015

Do not depend on pdftk #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve image data (/filesize) from original PDF #15

Preserve image data (/filesize) from original PDF #15

fsiegert commented Sep 19, 2014

wodin commented Jan 17, 2017

Preserve image data (/filesize) from original PDF #15

Preserve image data (/filesize) from original PDF #15

Comments

fsiegert commented Sep 19, 2014

wodin commented Jan 17, 2017