Error in using other document and test document #3

CY202227 · 2025-01-10T02:49:54Z

2025-01-10 10:33:21,663 - DownloadModel - INFO: D:\Dev_env\Anaconda3_202406\envs\audit\Lib\site-packages\rapid_layout\models\layout_cdla.onnx already exists
2025-01-10 10:33:21,910 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation']
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Dev\Audit\RapidDoc\demo.py", line 13, in
result = pdf_parser(pdf_path)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 74, in call
txt_boxes, txts = self.run_direct_extract(i, img_width)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 105, in run_direct_extract
txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width)
ValueError: too many values to unpack (expected 2)

查看源码后发现self.pdf_extracter.extract_page_text(page_num, img_width)这个函数的返回值是 return np.array(boxes)，修改为 return np.array(boxes), self.texts后报错 score = list(map(lambda x: float(x[1]), select_text_score))
ValueError: could not convert string to float: '民'。如果将score全部设置为0，出来的数据依然不可用。

然后检查demo里的文档，发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致，遂注释掉以下片段
# 假定不存在某一段是扫描的，某一段是可直接提取的
# if self.is_extract(page):
# img_width = img.shape[1]
# txt_boxes, txts = self.run_direct_extract(i, img_width)
# else:
# txt_boxes, txts = self.run_ocr_extract(img)
问题解决，但表格依然没有提取。我看判断是否可以直接提取的标准是是否可以提取100字。这个判断是不是过于草率了，关于extract_page_text(page_num, img_width)的问题该如何修改呢？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in using other document and test document #3

Error in using other document and test document #3

CY202227 commented Jan 10, 2025 •

edited

Loading

Error in using other document and test document #3

Error in using other document and test document #3

Comments

CY202227 commented Jan 10, 2025 • edited Loading

CY202227 commented Jan 10, 2025 •

edited

Loading