-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Files created with wkhtmltopdf cannot have their outline parsed #312
Comments
Preliminary exploration confirmed that pdftk is able to do the kind of segmenting operations we require on wkhtmltopdf generated PDFs, so using that seems like the best way forward. |
I've done some digging here and want to update the issue.
This isn't the issue - the issue is that Also that PyPDF2 issue is actually describing an issue that happens during instantiation - it just so happens that instantiation manages to call the "private" method So - the actual issue is: PDFs generated without proper metadata can cause PyPDF2 to throw an error from which we do not gracefully recover - nor do we provide a meaningful and informative error message about why we get the error. Potential Solutions 1. Just catch the error that is thrown by PyPDF2 and raise a new one to inform the user of why it won't work. In this case we need to let the user know:
Also - likely this should be better documented so that the error can point to a part of the documentation. For example, this is the data structure output from 2. Wrap pdftk - breaks the current PDFParser API PDFParser does much of its heavy lifting using the PyPDF2 tools. Changing this won't be too hard. I've already done so in #324 to some degree and it won't be too much work to make However, there are some things to consider:
3. Fork PyPDF2 and comment the error out
|
Description
Because of an extant bug in PyPDF2 py-pdf/pypdf#193 trying to read the outline for a file generated in wkhtmltopdf results in an error. This means that large PDF files generated in this way cannot be split up into smaller PDFs.
Potential resolutions
The issue contains a way to bypass this issue by patching the library - but it's not clear whether it actually fixes the problem, or just prevents a blocking error. I suspect that it just lets the PDF to continue to be parsed, but that we would not get the outline data - which we need in order to do the by chapter and subchapter splitting that we are using PyPDF2 for.
Doing a more systematic fix on this would potentially involve forking and diving deeper into PyPDF2, as there appears to be no current maintenance of the library.
There is another Python library https://github.com/pdfminer/pdfminer.six that is actively maintained - but it appears that it only extracts layout metadata. In order to use this, we would have to migrate the outline reading code to use pdfminer.six, but still continue to use PyPDF2 for the actual page splitting.
Another alternative is to use the pdftk binary https://www.pdflabs.com/tools/pdftk-server/ - while there is a Python wrapper library https://github.com/revolunet/pypdftk for this, it has a very limited API and does not expose the tools we need. However, pdftk is able to read the entire outline and dump all the relevant information to JSON, which we could then read back into Python. We could then also use pdftk to do the splitting as well.
In terms of ongoing maintainability, I think option 4 is probably the most promising bet, as we are leveraging a widely used open source binary to handle the complex business of PDF parsing. One possible upside is that whereas PyPDF2 failed to preserve accessibility focused aspects of PDFs during the splitting, pdftk might do a better job of this.
We would have to add good graceful failure and the ability to configure the path to the pdftk binary in order to leverage this.
Next steps
First we would need to sanity check that pdftk run from the command line can handle wkhtmltopdf generated PDFs and spit out their outlines and split them.
If this does indeed work, to move forward with this strategy, we would have to remove PyPDF2 and replace operations with subprocess calls to operations in pdftk: https://www.pdflabs.com/docs/pdftk-man-page/
dump_data
command. Writing the JSON out to a temporary file in a subprocess, reading that JSON back into Python and then cleaning up the temporary file.cat
command and pass the page ranges parsed fromget_toc
to generate the new file.Because of deferring to pdftk for the PDF processing, most of the other logic in the
PDFParser
class could be removed.The text was updated successfully, but these errors were encountered: