-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Performance with Cloud-Optimized Geotiffs (COGs) - xarray,rasterio,dask #21
Comments
Thanks for posting this @scottyhq . That was a very interesting read. I'm now curious about how rasterio/GDAL handles reading tiles on remote stores like S3. From the report it sounds like they issue a separate synchronous request for each tile. That would indeed be unfortunate, assuming that our tiles are somewhat small. Doing some informal exploration ourselves would probably be a good use of time. I'm also curious what other performance-oriented COG groups do in practice. Assuming that things are as bad as they say, some possibilities:
It's worth noting that currently XArray uses a lock around rasterio calls, so this is possibly more extreme on our end. |
Not to beat a drum unnecessarily, but this sounds a lot like something that could live in an Intake driver. |
Thanks for the input @mrocklin and @martindurant Personally, I would not go the route of a new reader because GDAL and rasterio put significant effort into this. Once we know the best settings for a given rasterio/gdal version it seems that Intake would definitely simplify the process of loading data in the most efficient manner! New configuration options for Cloud storage are becoming available with each GDAL release (many post version 2.3): There are some issues in rasterio worth linking to here as well: @mrocklin can you elaborated on your last comment please?
|
Perhaps, but I think that you wouldn't want all of the complexity of such a
thing within intake itself. I think that this is lower level. It could be
a plugin or whatever, sure, but that's a conversation that would happen
much much later if folks choose to take that route.
…On Mon, Oct 22, 2018 at 10:40 AM Martin Durant ***@***.***> wrote:
Write our own lighter-weight COG reader
Not to beat a drum unnecessarily, but this sounds a lot like something
that could live in an Intake driver.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszOWeAb6dF7Dx-GU5MAFYT_vXb6Cxks5undjGgaJpZM4Xxf7_>
.
|
I suspect that folks found that the rasterio library was not threadsafe, and so they protect calling multiple rasterio functions at the same time. This would affect multi-threaded dask workers. If one worker was pulling data using rasterio then any other worker asked to pull data at the same time would wait until the first one finished. |
I saw this issue linked from rasterio repo, I thought I'll chime in if you don't mind. I'm an author of that benchmarking report linked by @scottyhq.
|
Thanks for joining this conversation @Kirill888 . And thank you for the excellent benchmark. Short term it sounds like there are two easy wins that we might want to do:
@Kirill888 regarding the point about layout. Lets say I have a 1000x1000 array tiled into 100x100 chunks and I ask for a 500x500 piece in the upper left.
My understanding of what you say above is that this issues five separate web requests. Is that correct? |
@mrocklin your understanding is correct, assuming typical tiff layout and empty caches this read will generate 5 requests, one for each You can see it for yourself by injecting this in to
This will print detailed info for everything #!/bin/sh
exec awk -F'[=-]' '/^Range:/{print $2 "\t" $3}' $@ |
OK, so we should prefer auto-chunking that assumes row-major order among tiles. We might even consider warning is the user asks for something else. Those are pretty easy to achieve. Thank you @Kirill888 for chiming in. We knew that our access times were much worse than they should have been, this gives us a couple good things to take on to make good progress here. |
Thanks so much for helping with this @mrocklin and @Kirill888. I put together an example notebook to help guide a solution using the master branch of xarray, since I thought the recent refactor might affect things (pydata/xarray#2261): https://github.com/scottyhq/xarray/blob/rasterio-env/examples/cog-test-rasterioEnv.ipynb The notebook captures the GDAL HTTP requests happening in the background and shows pretty clearly that xr.open_rasterio() is issuing more GET requests than required...
Is this possible currently? Is there a work around, or does In putting together the notebook I noticed a few other things. Not all of the options from |
In this line:
You should experiment with the aspect ratio of chunk sizes. I suspect that you'll want chunks that are not square, but rather chunks that are as long as possible in one direction. |
@scottyhq - thanks for putting this together. TLDR; xarray probably will need some modifications to make some of what you want happen. I'm not particularly surprised xarray's rasterio backend isn't well optimized for this use case. I suspect you are actually the most qualified to take a few swings at the modifying the rasterio backend in xarray. I would be happy to get you spun up on this. Another thing to think about is @mrocklin's PR: pydata/xarray#2255. I'm not sure exactly what stopped progress on this but you may want to take a look there too. |
Our landsat example isn't working with the latest versions of xarray (0.11.0), rasterio (1.0.10) and dask - 0.20.2, distributed 1.24, dask-kubernetes 0.6). this seems to be an issue w/ dask distributed having access to global environment variables, but I'm not exactly sure... Note things work fine w/o the kubernetes cluster, but the same command on the cluster results in a CURL error. To me, it seems workers aren't seeing the os.environ configuration or don't have access to the certificate path? I'm wondering if the recent xarray refactor might be affecting things (pydata/xarray#2261)? Notebook here with full error traceback: |
For what it's worth, the error I was encountering: /srv/conda/lib/python3.6/site-packages/rasterio/__init__.py in open()
215 # None.
216 if mode == 'r':
--> 217 s = DatasetReader(path, driver=driver, **kwargs)
218 elif mode == 'r+':
219 s = get_writer_for_path(path)(path, mode, driver=driver, **kwargs)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()
RasterioIOError: CURL error: error setting certificate verify locations: CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none Is due to rasterio 1.0.10 installed via pip. Installing rasterio 1.0.10 via conda-forge resolves the problem. This is a known issue (rasterio/rasterio@b621d92), but for some reason, |
nice @scottyhq! Anything left to do here? |
Let's keep this open since improvements are showing up regularly: Planning to do some additional benchmarking with xarray/dask/rasterio. In the meantime, here are some recent relevant benchmarks for GDAL, which is ultimately doing all the IO: optimizing GDAL configuration: COG creation and benchmarking for a few datasets on AWS: On various compression settings for COGs: |
As we mentioned in our landsat8 demo blog post (https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2), there is still much room for improvement.
Here is a nice benchmarking analysis of reading cloud-optimized-geotiffs (COGs) on AWS: https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md#rasterio-configuration
And discussion of the report here:
http://osgeo-org.1560.x6.nabble.com/Re-Fwd-Cloud-optimized-GeoTIFF-configuration-specifics-SEC-UNOFFICIAL-tt5367948.html
It would be great to do similar benchmarking with our example, and see if there are simple ways to improve how COGs are read with the combination of xarray, dask, and rasterio.
Pinging some notebook authors on this one, @mrocklin, @jhamman, @rsignell-usgs, @darothen !
The text was updated successfully, but these errors were encountered: