You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.
Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.
I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).
Cheers,
Frank
The text was updated successfully, but these errors were encountered:
Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.
Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.
I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).
Cheers,
Frank
The text was updated successfully, but these errors were encountered: