Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

clayms · 2018-08-21T14:24:41Z

I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.

I can create an additional pandas DataFrame with all of the page coordinates of each <path> from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.

Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.

Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.

Combinations of the following text-formatting features can be used deduce document hierarchy:

Case: UPPER > Title Case > Sentence > lower
Font Size: Large > Small
Font Weight: Bold > Italic > Normal
Underline: Underline > No Underline
Line Spacing: Large > Small
Alignment: Centered > Left
Indentation: No Indent > Indent

The text was updated successfully, but these errors were encountered:

clayms · 2018-08-22T15:30:26Z

see HazyResearch/fonduer#111 (comment) for an example pdf.
The mutool draw command described there converts the pdf to html, but it also misses the abundant strikeouts and underlines and all of the text that is clearly struck-out is shown as regular formatted text in the html output.

jbecke · 2019-08-26T20:27:16Z

I'm interested in your method to convert from a mutool-generated SVG to Fonduer's data model. Are you able to release this code? Thanks!

clayms · 2019-08-29T20:06:44Z

Let me talk to some people. In the meantime, I could provide some pseudocode outlining the whole process (in more detail than what's above).

jbecke · 2019-08-29T23:38:19Z

Thanks, pseudocode or a bit more detail than above would be helpful! My email is [email protected] if you prefer to chat over email.

clayms changed the title ~~Enhancement using svg format to get underlined and struck-out text formatting~~ Enhancement using pdf-to-svg to get underlined and struck-out text formatting Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

clayms commented Aug 21, 2018 •

edited

Loading

clayms commented Aug 22, 2018

jbecke commented Aug 26, 2019

clayms commented Aug 29, 2019

jbecke commented Aug 29, 2019

Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

Comments

clayms commented Aug 21, 2018 • edited Loading

clayms commented Aug 22, 2018

jbecke commented Aug 26, 2019

clayms commented Aug 29, 2019

jbecke commented Aug 29, 2019

clayms commented Aug 21, 2018 •

edited

Loading