Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Performance with Cloud-Optimized Geotiffs (COGs) - xarray,rasterio,dask #21

Open
scottyhq opened this issue Oct 19, 2018 · 16 comments

Comments

@scottyhq
Copy link
Member

scottyhq commented Oct 19, 2018

As we mentioned in our landsat8 demo blog post (https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2), there is still much room for improvement.

Here is a nice benchmarking analysis of reading cloud-optimized-geotiffs (COGs) on AWS: https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md#rasterio-configuration

And discussion of the report here:
http://osgeo-org.1560.x6.nabble.com/Re-Fwd-Cloud-optimized-GeoTIFF-configuration-specifics-SEC-UNOFFICIAL-tt5367948.html

It would be great to do similar benchmarking with our example, and see if there are simple ways to improve how COGs are read with the combination of xarray, dask, and rasterio.

Pinging some notebook authors on this one, @mrocklin, @jhamman, @rsignell-usgs, @darothen !

@mrocklin
Copy link
Member

Thanks for posting this @scottyhq . That was a very interesting read. I'm now curious about how rasterio/GDAL handles reading tiles on remote stores like S3. From the report it sounds like they issue a separate synchronous request for each tile. That would indeed be unfortunate, assuming that our tiles are somewhat small.

Doing some informal exploration ourselves would probably be a good use of time. I'm also curious what other performance-oriented COG groups do in practice. Assuming that things are as bad as they say, some possibilities:

  1. See if we can improve GDAL/rasterio to at least read longer ranges when we're asking for several contiguous tiles at once (my hope is that they already do this)
  2. If not, ask people to store data with larger tiles
  3. Write our own lighter-weight COG reader that perhaps does some optimizaions as above, and perhaps operates asynchronously (though this may quickly become a rabbit hole?)

It's worth noting that currently XArray uses a lock around rasterio calls, so this is possibly more extreme on our end.

@martindurant
Copy link
Collaborator

Write our own lighter-weight COG reader

Not to beat a drum unnecessarily, but this sounds a lot like something that could live in an Intake driver.

@scottyhq
Copy link
Member Author

Thanks for the input @mrocklin and @martindurant

Personally, I would not go the route of a new reader because GDAL and rasterio put significant effort into this. Once we know the best settings for a given rasterio/gdal version it seems that Intake would definitely simplify the process of loading data in the most efficient manner!

New configuration options for Cloud storage are becoming available with each GDAL release (many post version 2.3):
https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_MULTIRANGE
https://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsis3

There are some issues in rasterio worth linking to here as well:
rasterio/rasterio#1513
rasterio/rasterio#1507
rasterio/rasterio#1362

@mrocklin can you elaborated on your last comment please?

It's worth noting that currently XArray uses a lock around rasterio calls, so this is possibly more extreme on our end.

@mrocklin
Copy link
Member

mrocklin commented Oct 22, 2018 via email

@mrocklin
Copy link
Member

https://github.com/pydata/xarray/blob/671d936166414edc368cca8e33475369e2bb4d24/xarray/backends/rasterio_.py#L311-L314

I suspect that folks found that the rasterio library was not threadsafe, and so they protect calling multiple rasterio functions at the same time. This would affect multi-threaded dask workers. If one worker was pulling data using rasterio then any other worker asked to pull data at the same time would wait until the first one finished.

@Kirill888
Copy link

I saw this issue linked from rasterio repo, I thought I'll chime in if you don't mind. I'm an author of that benchmarking report linked by @scottyhq.

  1. GDAL doesn't read one tile a time, adjacent tiles (in byte space) are read together in one http fetch. Cloud Optimized GeoTiffs have tiles stored in row-major order. So reads that span multiple tiles horizontally will be faster than reads that span multiple tiles vertically. GDAL's vsicurl io driver will not attempt to read bytes you haven't requested, even if it would be faster to do so in S3 case, but it does coalesce adjacent requests into one.

  2. About the lock: rasterio default open behaviour is to re-use already opened file handles (this happens inside GDAL library). And GDAL concurrency model assumes one file handle per thread, so file handle re-use breaks concurrency when two threads read from the same resource even if each one opened "their own version". However more recent versions of rasterio have sharing=False parameter, using that, GeoTIFF access should be thread-safe. But if you are accessing netcdfs/hdf5 files, sharing=False won't help since hdf5 library is not thread-safe at all, it assume one single thread accesses the whole library at a time (it's really bad like that). So if you want to support any file type that gdal supports in a thread-safe manner, the safest thing to do is to grab a global process lock ( which is what xarray devs are doing). I guess one could detect file type and have a whitelist of "safe io dirvers" that can be read concurrently.

@mrocklin
Copy link
Member

Thanks for joining this conversation @Kirill888 . And thank you for the excellent benchmark.

Short term it sounds like there are two easy wins that we might want to do:

  1. Change XArray's rasterio-lock to be per file rather than for the full library, or (perhaps easier) use the sharing=False parameter and see the performance effect of that.
  2. Change our default chunking to favor row-major order

@Kirill888 regarding the point about layout. Lets say I have a 1000x1000 array tiled into 100x100 chunks and I ask for a 500x500 piece in the upper left.

x[:500, :500]

My understanding of what you say above is that this issues five separate web requests. Is that correct?

@Kirill888
Copy link

@mrocklin your understanding is correct, assuming typical tiff layout and empty caches this read will generate 5 requests, one for each x[:100, :500] shaped slices. If however you were to read x[:500,:] this would be done in one request, and will probably be faster when reading from S3 bucket in the same region.

You can see it for yourself by injecting this in to rasterio.Env:

  • CPL_CURL_VERBOSE=True

This will print detailed info for everything libcurl is doing on behalf of GDAL to stderr. To just extract byte ranges read I use this snippet:

#!/bin/sh
exec awk -F'[=-]' '/^Range:/{print $2 "\t" $3}' $@

@mrocklin
Copy link
Member

OK, so we should prefer auto-chunking that assumes row-major order among tiles. We might even consider warning is the user asks for something else. Those are pretty easy to achieve.

Thank you @Kirill888 for chiming in. We knew that our access times were much worse than they should have been, this gives us a couple good things to take on to make good progress here.

@scottyhq
Copy link
Member Author

Thanks so much for helping with this @mrocklin and @Kirill888.

I put together an example notebook to help guide a solution using the master branch of xarray, since I thought the recent refactor might affect things (pydata/xarray#2261):

https://github.com/scottyhq/xarray/blob/rasterio-env/examples/cog-test-rasterioEnv.ipynb

The notebook captures the GDAL HTTP requests happening in the background and shows pretty clearly that xr.open_rasterio() is issuing more GET requests than required...

we should prefer auto-chunking that assumes row-major order among tiles

Is this possible currently? Is there a work around, or does xr.open_rasterio need modifying?

In putting together the notebook I noticed a few other things. Not all of the options from rasterio.open are available from xr.open_rasterio. Relevant to this discussion in particular is sharing=False. The machinery of xr.open_rasterio is a little over my head, so I'm hoping someone else can help out here. I don't think it's as simple as allowing for **kwargs...
https://github.com/pydata/xarray/blob/671d936166414edc368cca8e33475369e2bb4d24/xarray/backends/rasterio_.py

@mrocklin
Copy link
Member

we should prefer auto-chunking that assumes row-major order among tiles

Is this possible currently? Is there a work around, or does xr.open_rasterio need modifying?

In this line:

with xr.open_rasterio(httpURL, chunks={'band': 1, 'x': xchunk, 'y': ychunk}) as da:

You should experiment with the aspect ratio of chunk sizes. I suspect that you'll want chunks that are not square, but rather chunks that are as long as possible in one direction.

@jhamman
Copy link
Member

jhamman commented Oct 27, 2018

@scottyhq - thanks for putting this together. TLDR; xarray probably will need some modifications to make some of what you want happen.

I'm not particularly surprised xarray's rasterio backend isn't well optimized for this use case. I suspect you are actually the most qualified to take a few swings at the modifying the rasterio backend in xarray. I would be happy to get you spun up on this.

Another thing to think about is @mrocklin's PR: pydata/xarray#2255. I'm not sure exactly what stopped progress on this but you may want to take a look there too.

@scottyhq
Copy link
Member Author

Our landsat example isn't working with the latest versions of xarray (0.11.0), rasterio (1.0.10) and dask - 0.20.2, distributed 1.24, dask-kubernetes 0.6). this seems to be an issue w/ dask distributed having access to global environment variables, but I'm not exactly sure...

Note things work fine w/o the kubernetes cluster, but the same command on the cluster results in a CURL error. To me, it seems workers aren't seeing the os.environ configuration or don't have access to the certificate path? I'm wondering if the recent xarray refactor might be affecting things (pydata/xarray#2261)?

Notebook here with full error traceback:
https://github.com/scottyhq/cog-benchmarking/blob/master/notebooks/landsat8-cog-ndvi-mod.ipynb

Live example here:
Binder

@scottyhq
Copy link
Member Author

For what it's worth, the error I was encountering:

/srv/conda/lib/python3.6/site-packages/rasterio/__init__.py in open()
    215         # None.
    216         if mode == 'r':
--> 217             s = DatasetReader(path, driver=driver, **kwargs)
    218         elif mode == 'r+':
    219             s = get_writer_for_path(path)(path, mode, driver=driver, **kwargs)

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: CURL error: error setting certificate verify locations:   CAfile: /etc/pki/tls/certs/ca-bundle.crt   CApath: none

Is due to rasterio 1.0.10 installed via pip. Installing rasterio 1.0.10 via conda-forge resolves the problem.

This is a known issue (rasterio/rasterio@b621d92), but for some reason, export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt didn't resolve the error when running on a dask distributed cluster.

@jhamman
Copy link
Member

jhamman commented Nov 27, 2018

nice @scottyhq! Anything left to do here?

@scottyhq
Copy link
Member Author

scottyhq commented Jan 12, 2019

Let's keep this open since improvements are showing up regularly:

Planning to do some additional benchmarking with xarray/dask/rasterio. In the meantime, here are some recent relevant benchmarks for GDAL, which is ultimately doing all the IO:

optimizing GDAL configuration:
https://lists.osgeo.org/pipermail/gdal-dev/2018-December/049508.html

COG creation and benchmarking for a few datasets on AWS:
https://github.com/vincentsarago/awspds-benchmark
https://medium.com/@_VincentS_/do-you-really-want-people-using-your-data-ec94cd94dc3f

On various compression settings for COGs:
https://kokoalberti.com/articles/geotiff-compression-optimization-guide/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants