We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2025-01-10 10:33:21,663 - DownloadModel - INFO: D:\Dev_env\Anaconda3_202406\envs\audit\Lib\site-packages\rapid_layout\models\layout_cdla.onnx already exists 2025-01-10 10:33:21,910 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation'] 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "D:\Dev\Audit\RapidDoc\demo.py", line 13, in result = pdf_parser(pdf_path) File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 74, in call txt_boxes, txts = self.run_direct_extract(i, img_width) File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 105, in run_direct_extract txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width) ValueError: too many values to unpack (expected 2)
查看源码后发现self.pdf_extracter.extract_page_text(page_num, img_width)这个函数的返回值是 return np.array(boxes),修改为 return np.array(boxes), self.texts后报错 score = list(map(lambda x: float(x[1]), select_text_score)) ValueError: could not convert string to float: '民'。如果将score全部设置为0,出来的数据依然不可用。
然后检查demo里的文档,发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致,遂注释掉以下片段 # 假定不存在某一段是扫描的,某一段是可直接提取的 # if self.is_extract(page): # img_width = img.shape[1] # txt_boxes, txts = self.run_direct_extract(i, img_width) # else: # txt_boxes, txts = self.run_ocr_extract(img) 问题解决,但表格依然没有提取。我看判断是否可以直接提取的标准是是否可以提取100字。这个判断是不是过于草率了,关于extract_page_text(page_num, img_width)的问题该如何修改呢?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
2025-01-10 10:33:21,663 - DownloadModel - INFO: D:\Dev_env\Anaconda3_202406\envs\audit\Lib\site-packages\rapid_layout\models\layout_cdla.onnx already exists
2025-01-10 10:33:21,910 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation']
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Dev\Audit\RapidDoc\demo.py", line 13, in
result = pdf_parser(pdf_path)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 74, in call
txt_boxes, txts = self.run_direct_extract(i, img_width)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 105, in run_direct_extract
txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width)
ValueError: too many values to unpack (expected 2)
查看源码后发现self.pdf_extracter.extract_page_text(page_num, img_width)这个函数的返回值是 return np.array(boxes),修改为 return np.array(boxes), self.texts后报错 score = list(map(lambda x: float(x[1]), select_text_score))
ValueError: could not convert string to float: '民'。如果将score全部设置为0,出来的数据依然不可用。
然后检查demo里的文档,发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致,遂注释掉以下片段
# 假定不存在某一段是扫描的,某一段是可直接提取的
# if self.is_extract(page):
# img_width = img.shape[1]
# txt_boxes, txts = self.run_direct_extract(i, img_width)
# else:
# txt_boxes, txts = self.run_ocr_extract(img)
问题解决,但表格依然没有提取。我看判断是否可以直接提取的标准是是否可以提取100字。这个判断是不是过于草率了,关于extract_page_text(page_num, img_width)的问题该如何修改呢?
The text was updated successfully, but these errors were encountered: