Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory issue when running the demo #3

Open
xingruiyang opened this issue Oct 27, 2021 · 4 comments
Open

GPU memory issue when running the demo #3

xingruiyang opened this issue Oct 27, 2021 · 4 comments

Comments

@xingruiyang
Copy link

Hi,

I am facing a memory issue where CUDA complains about insufficient VRAM. I noticed this issue stems from line 245 of the file src/conv_onet/training.py where it tries to extract 3D features from the input point cloud.

My GPU has 8GiB VRAM but it seems the demo needs more than that. Can you tell me what's the minimum required VRAM size and perhaps there is a way to reduce the VRAM requirement? Thanks

Regards

@tangjiapeng
Copy link
Owner

Hi,

The GPU RAM when I run these experiments was at least 11 GB.

If you are using 8GB GPUs, you can reduce the batchsize from 16 to 12 or 8 to solve the memory issue.

You can also choose to reduce the number of input pointcloud for point feature learning.

Hope it can help you! Please let me know if you have addressed this problem!

Best,
Jiapeng

@xingruiyang
Copy link
Author

Thanks @tangjiapeng,

I can see the batch size in generate_optim_largescene.py is set to 1 so I don't think it's the issue with large batch size. I did try to downsample the point cloud as you suggested, I tried to set both pointcloud_n and pointcloud_subsample in demo_matterport.yaml to 4096, but I am still getting OOM errors. I am wondering if this is the correct way of downsampling input points? To help you diagnose this issue I have included the error message below:

Warning: generator does not support pointcloud generation.
  0%|                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]Process scenes in a sliding-window manner
ft only encoder True
only optimize encoder████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 693/693 [01:44<00:00,  6.02it/s]
Traceback (most recent call last):
  File "generate_optim_largescene.py", line 235, in <module>
    loss = trainer.sign_agnostic_optim_cropscene_step(crop_data, state_dict)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 216, in sign_agnostic_optim_cropscene_step
    loss = self.compute_sign_agnostic_cropscene_loss(data)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 244, in compute_sign_agnostic_cropscene_loss
    c = self.model.encode_inputs(inputs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/models/__init__.py", line 60, in encode_inputs
    c = self.encoder(inputs)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 307, in forward
    fea['grid'] = self.generate_grid_features(index['grid'], c)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 262, in generate_grid_features
    fea_grid = self.unet3d(fea_grid)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 465, in forward
    x = decoder(encoder_features, x)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 284, in forward
    x = self.joining(encoder_features, x)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 291, in _joining
    return torch.cat((encoder_features, x), dim=1)
RuntimeError: CUDA out of memory. Tried to allocate 750.00 MiB (GPU 0; 7.79 GiB total capacity; 5.98 GiB already allocated; 446.81 MiB free; 6.03 GiB reserved in total by PyTorch)
Exception ignored in: <bound method tqdm.__del__ of   0%|                                                                                                                                                                                | 0/2 [01:50<?, ?it/s]>
Traceback (most recent call last):
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in __del__
    self.close()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close
    self._decr_instances(self)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
    cls.monitor.exit()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/threading.py", line 1053, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

@xingruiyang xingruiyang reopened this Oct 31, 2021
@tangjiapeng
Copy link
Owner

The batch_size in demo_matterport.yaml was 2, you can set it to 1.

A better choice is to use GPUs with larger RAM.

@tangjiapeng
Copy link
Owner

Hi, xinrui, have you addressed the memory issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants