Duplicate text and table in the extraction result #121

yetnikoff · 2022-01-17T06:00:05Z

Describe the bug
the first page and the second page of the ouput contain the same text. page 4 and 5 are the same thing as well.

To Reproduce
Steps to reproduce the behavior:

pdf downloaded from : AIMCO.pdf
Execute the code : pdftotree.parse(\PATH\TO\AIMCO-2019.pdf, html_path=\PATH\TO\output.html,visualize=False)
check hOCR output

Expected behavior
each page of the output file to have their own texts and tables.

Error Logs/Screenshots

Environment (please complete the following information):

OS: Windows 10, version 20H2
pdftotree Version: v0.5.0
pdfminer.six Version: 20211012

Additional context
if that issue suppose to happen, would it be possible to have a variable to keep track of text and table already extracted? (i am not very experienced in programming).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate text and table in the extraction result #121

Duplicate text and table in the extraction result #121

yetnikoff commented Jan 17, 2022

Duplicate text and table in the extraction result #121

Duplicate text and table in the extraction result #121

Comments

yetnikoff commented Jan 17, 2022