Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in loading other document #2

Open
simjak opened this issue Nov 2, 2024 · 2 comments
Open

error in loading other document #2

simjak opened this issue Nov 2, 2024 · 2 comments

Comments

@simjak
Copy link

simjak commented Nov 2, 2024

Hey, thanks for awesome doc toolkit.

I tried to run pdf_path = "tests/test_files/direct_extract/single_column.pdf"

and got a following error:

2024-11-02 17:47:58,569 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation']
  0%|                                                                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/jakit/simonas/open-source/RapidDoc/demo.py", line 13, in <module>
    result = pdf_parser(pdf_path)
             ^^^^^^^^^^^^^^^^^^^^
  File "/Users/jakit/simonas/open-source/RapidDoc/rapid_doc/main.py", line 74, in __call__
    txt_boxes, txts = self.run_direct_extract(i, img_width)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jakit/simonas/open-source/RapidDoc/rapid_doc/main.py", line 105, in run_direct_extract
    txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width)
    ^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)
@liang-xian
Copy link

同问

@CY202227
Copy link

CY202227 commented Jan 10, 2025

检查demo里的文档,发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致,遂注释掉以下片段
if self.is_extract(page): img_width = img.shape[1] txt_boxes, txts = self.run_direct_extract(i, img_width) else: tt_boxes, txts = self.run_ocr_extract(img)
不管是什么类型,都跑这个tt_boxes, txts = self.run_ocr_extract(img)
虽然表格依然没有识别到,但是段落都正常了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants