-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load ICON data from HF #66
Comments
@peterdudfield I'd like to have a look at this, could you give me permission to create a branch |
Hi @gabrielelibardi , we actually paused development on this in ocf_datapipes. But this could be done in ocf-data-sampler. Would you like to try it there? |
I managed to run the pvnet_datapipe on some of the icon-eu huggingface data that I downloaded locally (one day worth of data). This needs some change in the code but I think the problem that you refer to in this issue is different, this seems to be independent of the postprocessing done in ocf_datapipes and just about this line ds = xr.open_mfdataset(zarr_paths, engine="zarr", combine="nested", concat_dim="time"). If zarr_paths is all the paths to each .zarr.zip file on hugging-face that ds is never initialized as it just takes too long. I presume this is because of all the metadata that need to be downloaded from each .zarr.zip file. Once the ds is initialized then the data will be loaded lazily as you create the batches. Maybe caching the metadata locally could speed things up. Do I understand this correctly? @peterdudfield |
Thanks, great you managed it with one day. hmmm, interesting. The meta is normally quite small. Does it scale linear with the number of files you provide in I have seen before if they are differnet shapes, then |
It definitely gets slower with more files. Here is a flamegraph svg for the profiling. . I am getting a year worth of hugging face .zarr.zip files and trying to make a xarray from the first 20. Most of the time is spend in requests to the HF server. If I try to make an xarray with too many paths eventually it pisses the hf server huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/openclimatefix/dwd-icon-eu. |
Thanks for doing this, and nice to see the profile. Whats the current timings for making an xarray with the first 20? Do you know roughtly how many paths until you get a 429? |
It takes 58 secs to create an xarray with 20. Pretty sure it is not downloading the data (that would be more than 60 GB in 60 secs a bit too crazy for my laptop). |
For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server? |
For training ECMWF, we tend to join our daily ECMWF files into year (or monthly) files and then open multi open zarrs. We tend to try to keep the data close to where the model is running, so either locally if trianing localy, or in cloud, if we are training in the cloud. It would be nice to stream from HF though, so we dont have to re organise the files, and the live data is available. Thanks for these benchmark figures. Yea I agree it's just downloading metadata not the whole thing. Feels like there should be some caching we can do, so for example, loading metadata for each month one, and then quickly loading this new file in the future. I really dont know the solution for this sorry. @devsjc @Sukh-P @AUdaltsova might have some ideas? |
You should be able to do something like kerchunk or virtualizarr to save the metadata into one file and open that. That way you wouldn't need all the requests for the metadata. Getting the data is still then limited by HF request limits, but is at least then less of an issue. |
Thanks a lot @jacobbieker @peterdudfield! I haven't tried it yet but looks promising. For now I put 1 month of data on a cloud storage (something like S3). This solves the problem with the 429 error. I tried to run the script/save_batches.py though and it is impractically slow (1 sample every 3 secs or so). It is not a problem for me to self-host the icon dataset (or parts of it) but I would like to keep large amounts of data off the training instance as we would spin this one on and off. I would be curious to know in your case for the training of PVNet how long did you need to create the batches for the training data and when you were training on the cloud what solutions worked best for you. |
Thanks @gabrielelibardi Is this slow running speed when you load from s3? Where are you running the code from, an ec2 instance? We tend to get a speed of more like 1 batch every 1sec ish (I would have to look up the batch size). So it can still take a day or so to make batches. We get the fastest results by having the data very near i.e locally or on a disk attached to an vm. This is more set up, but faster. Using Multi-processing helps too, I think thats in the making batches script already. |
Thank you @peterdudfield, it is a colocated S3 bucket and I am running from a VPS but these are not aws services so there might be some differences in terms of bandwidth. In general though it is faster than streaming it from HF, which makes this quite useless in my opinion. Maybe you have tried this before with a more powerful instance and it was faster? Still I think implementing this is still beneficial as it should be the same code weather you stream form local or remote .zarr.zip files. I can implement this in the ocf-data-sampler repo, for me this one worked pretty well though, what is the major difference? |
Thanks. Yea we first use HF to save the ICON data, as there is not rolling archive at the moment. Your totally right though, we should consider a further set that make its useful for other people to use. We are migrating from ocf-datapipes to ocf-data-sample. Essentially to simply the code and remove That would be greatly appreciate, if you can put the code in for ICON in this repo |
@peterdudfield can you give me the permission to push a feature branch to ocf-data-sample please. |
Yea, I can defiantely open access. What we general do is let OS contributors
|
Sure! |
Detailed Description
It would be great to be able to load the ICON data from HF in our nwp open dataset
Context
Possible Implementation
The text was updated successfully, but these errors were encountered: