-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word mismatch between HTML and PDF for visual linker #12
Comments
Can https://spacy.io/api/goldparse#align be useful to align two list of words? |
Good catch! We will take a look this function. Thanks! |
@lukehsiao What exactly was the issue? md.xml.txt is the output of
On the other hand, according to #433, the
I don't see any problem from the visual linker perspective. Has the original issue been resolved over the years? or am I mistaken? |
Hmmm, interesting. This issue is old enough that I can't recall the specifics of the issue, not do I have an example that demonstrates it. I should've added more specifics when this issue was created. It is definitely possible this issue was resolved over time and just was not closed... @senwu, do you have any recollection? If not, I'd suggest we close this as stale and can reopen if we find an issue and have specific examples in the future. |
Hi @HiromuHota , I think this issue is about word mismatch between HTML and PDF documents. In this case, some words appear in PDF but not in HTML such as Sen |
Thank you guys for your recollections. The visual linker is supposed to link words between HTML and PDF. If I could take one step further of what @senwu suggested, I'd suggest to embed the coordinates into a HTML (maybe we should call this XML by then). As a result,
By this change, users would be supposed to embed coordinates in HTML on other own (they can use the visual linker but it's totally up to them). FYI, one of my use cases is to extract information from scanned documents. So the source is PDF (more specifically, a scanned image in PDF, or non-searchable PDF). |
@HiromuHota, you are exactly right! My comments are mainly focus on PDF documents as input data. If the input data is HTML, it would be very hard to find exactly words in some cases. And I agree that your suggested solution that embeds the coordinates into html and it’s actually the solution we want to have. That solution has multiple benefits such as it won’t rely on Adobe, it supports scanned PDF nicely, and it allows more control by Fonduer parser. |
I've tested pdftotree for the first time and learned that <html><div id=1>
<header
char='S a m p l e M a r k d o w n ',
top='37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 37.94140616000004 ',
left='35.0 48.34765616 60.34765616 80.3398436 93.68749976000001 100.35546848 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.3398436 ',
bottom='61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 61.94140616000004 ',
right='48.34765616 60.34765616 80.3398436 93.68749976 100.35546848000001 111.00781232 117.00781232 139.66015615999999 151.66015615999999 162.3125 175.66015615999999 189.00781231999997 201.00781231999997 218.33984359999997 231.68749975999998 '
>Sample Markdown </header> I think we can just take this approach that embeds coordinates (top, left, bottom, right) to the corresponding HTML element and let Fonduer parse it. Let's make this approach as a first-class citizen in Fonduer as we have been doing this as a custom step. |
According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR:
I'd propose to support hOCR because:
I'll make a different feature request that eventually resolves this issue. |
Awesome! I think your suggestion (hOCR) is super great! |
In the test md document we use an ordered HTML list, which renders as numbers in the PDF. This is causing a mismatch of words when doing the visual parse. The list of words in that document are
Notice that in the PDF words, the
1
,2
, and3
appear, whereas they do not in the HTML.Ideally this will be resolved when we switch to
pdftotree
and the new parser.The text was updated successfully, but these errors were encountered: