You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.
I can create an additional pandas DataFrame with all of the page coordinates of each <path> from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.
Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.
Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.
Combinations of the following text-formatting features can be used deduce document hierarchy:
Case: UPPER > Title Case > Sentence > lower
Font Size: Large > Small
Font Weight: Bold > Italic > Normal
Underline: Underline > No Underline
Line Spacing: Large > Small
Alignment: Centered > Left
Indentation: No Indent > Indent
The text was updated successfully, but these errors were encountered:
clayms
changed the title
Enhancement using svg format to get underlined and struck-out text formatting
Enhancement using pdf-to-svg to get underlined and struck-out text formatting
Aug 21, 2018
see HazyResearch/fonduer#111 (comment) for an example pdf.
The mutool draw command described there converts the pdf to html, but it also misses the abundant strikeouts and underlines and all of the text that is clearly struck-out is shown as regular formatted text in the html output.
I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.
I can create an additional pandas DataFrame with all of the page coordinates of each
<path>
from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.
Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.
Combinations of the following text-formatting features can be used deduce document hierarchy:
The text was updated successfully, but these errors were encountered: