You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.
o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 EU rendeletek” szövegrész helyébe
Extracts as
o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 eU rendeletek” szövegrész helyébe
^
|
Lowercase
Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.
Now why is this? For some reason, the CID for both E and e are mapped to the ASCII code point 101 (lowercase e) in the font.
Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:
As far as I see, handling this case would need some refactoring around show_text, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.
I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.
For example I have the following: https://stickman.hu/junk/actualtext_example.pdf
Here, the line
Extracts as
Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.
Now why is this? For some reason, the CID for both
E
ande
are mapped to the ASCII code point 101 (lowercase e) in the font.Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:
The ActualText thing here is described in the PDF standard "14.9.4 Replacement Text", and has a special code path in poppler: https://github.com/freedesktop/poppler/blob/315ab3006fb24bf47b595343e6a3e90995f2a588/poppler/Gfx.cc#L5052-L5059
As far as I see, handling this case would need some refactoring around
show_text
, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.P.S. 1: It seems like this guy had a related issue back in the day: https://stackoverflow.com/questions/17737776/pdf-text-extraction-issue-font-capitalization-inconsistencies
P.S. 2: In the end, I might just expose the CID on the
output_character
interface and do the same workaround I did in python: https://github.com/badicsalex/hun_law_py/blob/master/hun_law/extractors/pdf.py#L88-L93P.S. 3: Thanks for taking the time to fix some of the bugs I reported, I really appreciate it.
The text was updated successfully, but these errors were encountered: