Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMFlood dataset #2450

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Add MMFlood dataset #2450

wants to merge 15 commits into from

Conversation

lccol
Copy link

@lccol lccol commented Dec 5, 2024

This PR adds the MMFlood dataset from the paper "MMFlood: A Multimodal Dataset for Flood Delineation From Satellite Imagery". This is a Sentinel-1 + DEM dataset for Image Segmentation.

Original tif files are of variable resolution. Max height in pixels is 2147, max width in pixels is 2313 (which are the ones reported in the docs). The dataset also includes hydrography information, but it is not available for all acquisitions (currently the implemented class does not read such tif files).

Example with False Color representation
immagine

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Dec 5, 2024
@lccol
Copy link
Author

lccol commented Dec 5, 2024

@microsoft-github-policy-service agree

@adamjstewart adamjstewart added this to the 0.7.0 milestone Dec 5, 2024
Copy link
Collaborator

@nilsleh nilsleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the contribution, this is great! I have made a first pass of comments below. If you have questions about anything feel free to comment.

tests/datamodules/test_mmflood.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
@lccol lccol requested a review from nilsleh December 10, 2024 12:51
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good now, thanks for the hard work!

Only other comment I would make is that the recommended approach for these "curated" geospatial datasets (RasterDatasets containing both images and masks) is to create a dummy dataset for the images, a dummy dataset for the masks, and an IntersectionDataset that combines them. This usually lets you completely skip the __init__ and __getitem__ since it will inherit from RasterDataset. See L7Irish and L8Biome for examples of these. Up to you whether or not you want to do this since you're almost done, but it could make the code a bit cleaner.

tests/data/mmflood/data.py Outdated Show resolved Hide resolved
docs/api/datasets/geo_datasets.csv Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Outdated Show resolved Hide resolved
@lccol
Copy link
Author

lccol commented Dec 19, 2024

Thank you @adamjstewart for your comments. You managed all of your comments, including the conversion of MMFlood to RasterDataset class, similarly to L7Irish and L8Biome. I have just two questions:

  • I noticed that few entries have missing values, leading to NaNs. I was thinking of either putting all pixel values and mask values to 0 (option 1) or add a new entry missing (option 2) in the dict returned by the __getitem__. It will contain a Tensor of the same shape as the mask, with True in case of missing values in the image, False otherwise. I tried to check but I haven't found any other dataset which have NaN values in them. Probably this last option is kind of unusual compared to all the other datasets within the library... What do you think?
  • from some tests, I found that some of the tiles are partially overlapped and thus different tif files are merged when doing an iteration over the entire dataset. This should be fine I guess, since both images and masks are merged in a consistent manner (same order for both Sentinel-1 and masks data in the reverse painter algorithm). However, the dataset uses the tags in the tif files to store the timestamp of each tile. Is there a way to create the RTree with the temporal information stored in the tags? From my understanding, RasterDataset parses dates directly from the filename...

@adamjstewart
Copy link
Collaborator

  • I definitely prefer option 1. Our trainers support ignore_index which can be used to ignore these values during performance computation.
  • You mean the timestamp isn't stored in the filename, it's stored in some kind of metadata? How do you access this metadata? It's possible to override __init__ and extract the appropriate metadata yourself, it's just really ugly.

@lccol
Copy link
Author

lccol commented Dec 21, 2024

  • I definitely prefer option 1. Our trainers support ignore_index which can be used to ignore these values during performance computation.

Ok. I should have implemented it using the ignore_index.

  • You mean the timestamp isn't stored in the filename, it's stored in some kind of metadata? How do you access this metadata? It's possible to override __init__ and extract the appropriate metadata yourself, it's just really ugly.

Yes, that is correct. To access this timestamp I should do as follows

with rasterio.open(path, “r”) as src:
        timestamp = src.tags()[“event_date”]

I haven’t checked properly, but I think that should not be a big issue the current way it is implemented (i.e. do not retrieve the timestamp for each date), since I believe most of the overlapping tiles refers to the same date…

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments to try to make the code simpler

tests/data/mmflood/data.py Outdated Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Show resolved Hide resolved
torchgeo/datamodules/mmflood.py Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
torchgeo/datasets/mmflood.py Outdated Show resolved Hide resolved
@adamjstewart adamjstewart requested a review from nilsleh January 22, 2025 14:38
adamjstewart
adamjstewart previously approved these changes Jan 22, 2025
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, will give @nilsleh a chance to review

tests/datasets/test_mmflood.py Outdated Show resolved Hide resolved
Co-authored-by: Adam J. Stewart <[email protected]>
cache: if True, cache file handle to speed up repeated sampling

Raises:
DatasetNotFoundError: If dataset is not found and *download* is False.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also raises AssertionError if the split is invalid

torchgeo/datasets/mmflood.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants