Memory problem during exporting prediction results #1585
stevenkboyd
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Fabian et al,
I've found the nnU-Net to be fantastic for my research program. However, I've trained a segmentation model for CT data (typically 512x512x600) with five folds and am now running inference using nnUNetv2_predict on a number of CT images and having out-of-memory issues.
My system has four GPUs, so I'm using the following to run prediction on 8 CT scans (2 per GPU):
CUDA_VISIBLE_DEVICES=0 nohup nnUNetv2_predict -i ... -o ... -part_id 0 -num_parts 4 -npp 2 -nps 2 -step_size 0.5 [...]
CUDA_VISIBLE_DEVICES=1 nohup nnUNetv2_predict -i ... -o ... -part_id 1 -num_parts 4 -npp 2 -nps 2 -step_size 0.5 [...]
CUDA_VISIBLE_DEVICES=2 nohup nnUNetv2_predict -i ... -o ... -part_id 2 -num_parts 4 -npp 2 -nps 2 -step_size 0.5 [...]
CUDA_VISIBLE_DEVICES=3 nohup nnUNetv2_predict -i ... -o ... -part_id 3 -num_parts 4 -npp 2 -nps 2 -step_size 0.5 [...]
I've set the following environment variable: export nnUNet_n_proc_DA=32
The problem seems to be that the GPU predictions are much faster than the CPU can export images, resulting in a memory backlog that eventually exceeds my RAM and crashes. I can observe the GPU predictions completing all five folds (in about ~40s), but exporting the images is much slower. That's fine, but eventually too many images accumulate for exporting, and at about 40gb RAM per CT it adds up quickly. I tried increasing the -nps from 2 to a larger number, but it didn't seem to.
Can somebody please give me advice on how to balance my CPU and GPU usage better to avoid out-of-memory issues? Is it normal for about 40gb of RAM to be used for each exporting process?
Here are some details on my system:
System:
60 cores
4 GPUs (NVidia A100-PCIE-40gb)
450gb RAM
Running:
nnUNetv2 from git (July 28, 2023)
CUDA-toolkit 11.7
pytorch 2.0.1
OS:
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal
Beta Was this translation helpful? Give feedback.
All reactions