Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to execute run.py in train folder renders an error #45

Open
bhardwaj-garvit opened this issue Feb 1, 2023 · 4 comments
Open
Labels
bug Something isn't working

Comments

@bhardwaj-garvit
Copy link

Hi,
Really great and helpful code!
I was trying to run train.py on the nimrod-uk-1km-test data and encountered the following error, it says "RuntimeError: Serialization of parametrized modules is only supported through state_dict()." I searched on torch's website and found an earlier commit, so I downgraded torch to v1.12.0 but this did not go away.
Torch Link: pytorch/pytorch#69413

Can you guys help in debugging this issue? I am planning to use this on another dataset

Screenshot 2023-02-01 at 5 01 35 PM

**To Reproduce** Steps to reproduce the behavior: 1. installing dependencies 2. execute train/run.py and the above error shows in the terminal
@bhardwaj-garvit bhardwaj-garvit added the bug Something isn't working label Feb 1, 2023
@jacobbieker
Copy link
Member

Hi, are you using multiple GPUs? By default the run.py tries to use 6 GPUs, although it should be changed to 1. The spectrally normalized layers in PyTorch don't seem to work in multi-GPU setting as far as I have been able to get them. So if you do change it to 1 GPU, it should start training

@bhardwaj-garvit
Copy link
Author

I was earlier using cpu's, to sort the issue started using 1 gpu, but the training fills virtual memory of upto 200 GB(my system's limit) and the dataloader worker is killed. Can you suggest a way to bypass this.

@Chevolier
Copy link

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

@Chevolier
Copy link

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Updates: My problem is solved by setting streaming=True in TFDataset as follows for my own dataset, by doing this, data are not first loaded into memory.

class TFDataset(torch.utils.data.dataset.Dataset):
def init(self, data_path, split):
super().init()
# self.reader = load_dataset(
# "openclimatefix/nimrod-uk-1km", "sample", split=split, streaming=True
# )
self.reader = load_dataset(data_path, split=split, streaming=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants