Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Files created with wkhtmltopdf cannot have their outline parsed #312

Open
rtibbles opened this issue Mar 4, 2021 · 2 comments
Open
Assignees

Comments

@rtibbles
Copy link
Member

rtibbles commented Mar 4, 2021

Description

Because of an extant bug in PyPDF2 py-pdf/pypdf#193 trying to read the outline for a file generated in wkhtmltopdf results in an error. This means that large PDF files generated in this way cannot be split up into smaller PDFs.

Potential resolutions

  1. The issue contains a way to bypass this issue by patching the library - but it's not clear whether it actually fixes the problem, or just prevents a blocking error. I suspect that it just lets the PDF to continue to be parsed, but that we would not get the outline data - which we need in order to do the by chapter and subchapter splitting that we are using PyPDF2 for.

  2. Doing a more systematic fix on this would potentially involve forking and diving deeper into PyPDF2, as there appears to be no current maintenance of the library.

  3. There is another Python library https://github.com/pdfminer/pdfminer.six that is actively maintained - but it appears that it only extracts layout metadata. In order to use this, we would have to migrate the outline reading code to use pdfminer.six, but still continue to use PyPDF2 for the actual page splitting.

  4. Another alternative is to use the pdftk binary https://www.pdflabs.com/tools/pdftk-server/ - while there is a Python wrapper library https://github.com/revolunet/pypdftk for this, it has a very limited API and does not expose the tools we need. However, pdftk is able to read the entire outline and dump all the relevant information to JSON, which we could then read back into Python. We could then also use pdftk to do the splitting as well.

In terms of ongoing maintainability, I think option 4 is probably the most promising bet, as we are leveraging a widely used open source binary to handle the complex business of PDF parsing. One possible upside is that whereas PyPDF2 failed to preserve accessibility focused aspects of PDFs during the splitting, pdftk might do a better job of this.

We would have to add good graceful failure and the ability to configure the path to the pdftk binary in order to leverage this.

Next steps

First we would need to sanity check that pdftk run from the command line can handle wkhtmltopdf generated PDFs and spit out their outlines and split them.

If this does indeed work, to move forward with this strategy, we would have to remove PyPDF2 and replace operations with subprocess calls to operations in pdftk: https://www.pdflabs.com/docs/pdftk-man-page/

  • The get_toc method would now leverage the pdftk dump_data command. Writing the JSON out to a temporary file in a subprocess, reading that JSON back into Python and then cleaning up the temporary file.
  • The split_chapters and split_subchapters methods would now use the pdftk cat command and pass the page ranges parsed from get_toc to generate the new file.

Because of deferring to pdftk for the PDF processing, most of the other logic in the PDFParser class could be removed.

@rtibbles
Copy link
Member Author

Preliminary exploration confirmed that pdftk is able to do the kind of segmenting operations we require on wkhtmltopdf generated PDFs, so using that seems like the best way forward.

@nucleogenesis
Copy link
Member

nucleogenesis commented Mar 30, 2021

I've done some digging here and want to update the issue.

Because of an extant bug in PyPDF2 py-pdf/pypdf#193 trying to read the outline for a file generated in wkhtmltopdf results in an error.

This isn't the issue - the issue is that wkhtmltopdf does not generate a meaningful metadata about bookmarks even when it successfully gleans that they exist during parsing. wkhtmltopdf seems to be able to generate the metadata manually, but only from a file generated and tweaked separately. Ultimately, the only metadata we get from a wkhtmltopdf file is the hierarchical structure of bookmarks/outline items (ie, which of the chapters are subchapters of others).

Also that PyPDF2 issue is actually describing an issue that happens during instantiation - it just so happens that instantiation manages to call the "private" method _buildOutline.


So - the actual issue is:

PDFs generated without proper metadata can cause PyPDF2 to throw an error from which we do not gracefully recover - nor do we provide a meaningful and informative error message about why we get the error.

Potential Solutions

1. Just catch the error that is thrown by PyPDF2 and raise a new one to inform the user of why it won't work.

In this case we need to let the user know:

  • You cannot use get_toc at all with this PDF. Using split_chapters or split_subchapters with this PDF will require that you pass your own jsondata value as a param as the file does not contain sufficient metadata for extraction.
  • If you do know the page ranges to which you want to convert the PDF, then you may use write_pagerange to do so manually.

Also - likely this should be better documented so that the error can point to a part of the documentation. For example, this is the data structure output from get_toc(subchapters=False) - we should have a version w/ subchapters=True and document this to the user.

2. Wrap pdftk - breaks the current PDFParser API

PDFParser does much of its heavy lifting using the PyPDF2 tools. Changing this won't be too hard. I've already done so in #324 to some degree and it won't be too much work to make write_pagerange use pdftk as well.

However, there are some things to consider:

  • pdftk isn't the easiest thing to install on all OSs. It isn't available in publish Debian/Ubuntu repositories through apt - and the pdftk-java wrapper of it is behind several patch versions - ones which resolve an issue with the version used in pdftk-java. So Linux users have to download the binary from Gitlab for 2.3.3 or later. Also isn't available on Homebrew for Mac
  • This will break the PDFParser API because it currently exposes a property pdf which has is an instance of PyPDF.PdfFileReader - so anybody who has written code using that property will see their chefs break. This may not be much of an issue, however, but it's worth considering how we can deprecate it gracefully if we go this route.

3. Fork PyPDF2 and comment the error out

  • This gives us another repo to manage (boo) for something really simple and another package to release and care about.

@rtibbles rtibbles added this to the 0.7 milestone Jul 26, 2021
@rtibbles rtibbles removed this from the 0.7 milestone Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants