Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in using other document and test document #3

Open
CY202227 opened this issue Jan 10, 2025 · 0 comments
Open

Error in using other document and test document #3

CY202227 opened this issue Jan 10, 2025 · 0 comments

Comments

@CY202227
Copy link

CY202227 commented Jan 10, 2025

2025-01-10 10:33:21,663 - DownloadModel - INFO: D:\Dev_env\Anaconda3_202406\envs\audit\Lib\site-packages\rapid_layout\models\layout_cdla.onnx already exists
2025-01-10 10:33:21,910 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation']
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Dev\Audit\RapidDoc\demo.py", line 13, in
result = pdf_parser(pdf_path)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 74, in call
txt_boxes, txts = self.run_direct_extract(i, img_width)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 105, in run_direct_extract
txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width)
ValueError: too many values to unpack (expected 2)

查看源码后发现self.pdf_extracter.extract_page_text(page_num, img_width)这个函数的返回值是 return np.array(boxes),修改为 return np.array(boxes), self.texts后报错 score = list(map(lambda x: float(x[1]), select_text_score))
ValueError: could not convert string to float: '民'。如果将score全部设置为0,出来的数据依然不可用。

然后检查demo里的文档,发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致,遂注释掉以下片段
# 假定不存在某一段是扫描的,某一段是可直接提取的
# if self.is_extract(page):
# img_width = img.shape[1]
# txt_boxes, txts = self.run_direct_extract(i, img_width)
# else:
# txt_boxes, txts = self.run_ocr_extract(img)
问题解决,但表格依然没有提取。我看判断是否可以直接提取的标准是是否可以提取100字。这个判断是不是过于草率了,关于extract_page_text(page_num, img_width)的问题该如何修改呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant