Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process过程有算子会导致卡死 #560

Open
3 tasks done
SkyAndFly opened this issue Jan 22, 2025 · 2 comments
Open
3 tasks done

process过程有算子会导致卡死 #560

SkyAndFly opened this issue Jan 22, 2025 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@SkyAndFly
Copy link

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

运行环境:阿里云PAI-DSW。8核 32G 显存24G
我使用的yaml文件在下面附上。当我执行
python ./data-juicer-main/tools/process_data.py --config zhihu-bot.yaml
在运行完第一个算子numeric_field_filter_process完成后,开始进行第二个算子text_length_filter,在出现后Adding new column for stats (num_proc=8): 0%| 后卡死,远程notebook直接丢失了连接,只能重启。
卡死前会注意到CPU和内存占用率上升。我尝试过把np调节为4之后,这个问题不会出现。但是对于一些不需要花费很大性能的算子来说,处理的时间会变长。因此,有没有什么配置可以解决这个问题,能够最好的使用到全部性能来处理?我在阅读文档后确实没有找到相关信息。
yaml文件如下
zhihu-bot.yaml.txt
log文件如下
export_zhihu_refine.jsonl_time_20250122153602.txt

log中有一些打印信息是我修改了代码,试图找出卡死位置。

Additional 额外信息

No response

@SkyAndFly SkyAndFly added the question Further information is requested label Jan 22, 2025
@Cathy0908
Copy link
Collaborator

@SkyAndFly hi, it seems the machine's memory may have been exceeded. It is recommended to change to a machine with larger memory. There is no operator using GPU in your configuration file. You can use a CPU machine with larger memory. If the operator has the configuration "_accelerator = 'cuda'", the GPU can be used.
In addition, each operator can be configured with the number of multi-processes separately. Add the num_proc parameter in the operator configuration for separate configuration.

这种情况可能是机器的内存用超了,建议换一个内存大点的机器。看你的配置文件里没有算子是使用gpu的,可以使用内存大点的cpu机器。如果算子里有“_accelerator = 'cuda'”这个配置是可以使用gpu。
另外,每个算子是可以单独配置多进程个数的,在算子配置中添加num_proc参数进行单独配置。

@SkyAndFly
Copy link
Author

@Cathy0908 Thank you!Very helpful, you solved my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants