Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve image data (/filesize) from original PDF #15

Open
fsiegert opened this issue Sep 19, 2014 · 1 comment
Open

Preserve image data (/filesize) from original PDF #15

fsiegert opened this issue Sep 19, 2014 · 1 comment

Comments

@fsiegert
Copy link

Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.

Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.

I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).

Cheers,
Frank

@wodin
Copy link

wodin commented Jan 17, 2017

#23 and #28 are related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants