-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fonts with custom encoding #85
Comments
Which readers does copy and paste work in? |
Okular https://okular.kde.org/ https://github.com/KDE/okular This is the default pdf reader in KDE. With the file I provided, google-chrome only copies trash but okular copies the actual content. |
I think to fix this we need to parse the CFF fonts. |
My knowledge about fonts is very limited but if I can help withy anything please tell me. Thanks |
Hello, nice library. It is very useful and I had no issues until a find a weird PDF. Don't know if its an edge case or something common because I'm not a PDF expert.
Using pdffonts this info is shown:
If you open the PDF with a reader, you can see the text properly rendered. With some readers even copy & paste works but other just copy strange characters.
As far as I know (by reading and searching out there), the custom encoding implies non standard glyphs and this is the reason why some reader just copy trash, they are indeed copying the bytes but nothing "readable" outside the pdf context. But this is just my guessing.
I think that this is the same case https://community.adobe.com/t5/acrobat-discussions/strange-font-encoding-in-pdf-files/td-p/12472215
Looks that it affects mainly old files (mine is like 20+ old)
Other people are getting the same issue kermitt2/grobid#518
When using
pdf-extract
this is the output I get:And a bunch of lines like this. The text returned is just bytes in some encoding that are not readable.
This is a sample:
I cannot provide you the whole document but I'm attaching the first page so you can reproduce the error.
page.pdf
I will investigate more and if I find anything useful I will put it here.
Thank you
The text was updated successfully, but these errors were encountered: