Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate text and table in the extraction result #121

Open
yetnikoff opened this issue Jan 17, 2022 · 0 comments
Open

Duplicate text and table in the extraction result #121

yetnikoff opened this issue Jan 17, 2022 · 0 comments

Comments

@yetnikoff
Copy link

Describe the bug
the first page and the second page of the ouput contain the same text. page 4 and 5 are the same thing as well.

To Reproduce
Steps to reproduce the behavior:

  1. pdf downloaded from : AIMCO.pdf
  2. Execute the code : pdftotree.parse(\PATH\TO\AIMCO-2019.pdf, html_path=\PATH\TO\output.html,visualize=False)
  3. check hOCR output

Expected behavior
each page of the output file to have their own texts and tables.

Error Logs/Screenshots

Environment (please complete the following information):

  • OS: Windows 10, version 20H2
  • pdftotree Version: v0.5.0
  • pdfminer.six Version: 20211012

Additional context
if that issue suppose to happen, would it be possible to have a variable to keep track of text and table already extracted? (i am not very experienced in programming).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant