Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to map GeoZarr to datasets with man, many multiple CRSs #53

Open
rbavery opened this issue Oct 23, 2024 · 109 comments
Open

How to map GeoZarr to datasets with man, many multiple CRSs #53

rbavery opened this issue Oct 23, 2024 · 109 comments

Comments

@rbavery
Copy link

rbavery commented Oct 23, 2024

How should a STAC Collection containing many geospatial rasters that span many CRSs map to GeoZarr?

For example if I have Sentinel-2 or Landsat data spanning UTM zones, I don't want to have to reproject every scene to a common CRS in order to take advantage of chunking and sharding.

utmzones

As I'm loading scenes from GeoZarr , making an ML prediction, and georeferencing the result, I also want to keep the original CRS so that I can keep the area preserving projection.

It seems like this has been discussed a bit here

It may be helpful to specify in which case a dataset has children dataset and how to index the hierarchy (Combined with STAC object, the GeoZarr Dataset might be restricted to a single variable and only contains multiscales as children alternative variables)

From #34

and @benbovy had a helpful proposal of how this might be achieved in Xarray. If this is to big of a change to Xarray maybe it would be possible to achieve this index of indexes in ran extension to Xarray? rioxarray?

pydata/xarray#9543 (comment)

curious if others have thoughts on this. I don't have a well formulated design yet, still in the stage of reading and digesting the discussions up to today.

@mdsumner
Copy link

mdsumner commented Oct 24, 2024

My opinion is it shouldn't, if you want a standardized index set one up with the footprint of every source to a common crs, use that to index and filter to the sources that will contribute to a final "render" in that common crs, or to any other grid spec required.

This is how standard tools already work (the GDAL API for example), and how specific implementations that load from stac work: gdalcubes, opendatacube, stackstac, the GDAL STACIT driver and (more generally) the GDAL GTI driver.

Look at how odc-geo maps pixel index coordinates from various crs to the tile pixel index of a target source, I suppose you could store references that do that, VirtualiZarr-like, but it's so easily calculated in the fly for an on-demand grid spec I really don't see the value. (Ok maybe I've talked myself into illustrating something ... this was brought up by Deepak in terms of getting the index and weights from GDAL, so maybe that's what's really being asked here, to virtualize and undercut current libs). 😆

here's the mailing list pointers to how one might get "index and weights" from the warper: https://lists.osgeo.org/pipermail/gdal-dev/2024-April/058906.html

In opendatacube I think it's a slightly different motivation, because they are trying to cull-filter the mapping between scenes and which tile/s they contribute to , so you can line up sources and target tiles (aka 'chunks' ...) . So yeah, maybe I've framed this topic in a way I can negotiate now (!)

@benbovy
Copy link

benbovy commented Oct 24, 2024

From GDAL's GTI driver docs:

The tiles are listed as features of any GDAL supported vector format.

This would translate much more naturally to a GeoPandas' GeoDataFrame of Xarray DataArray objects than a plain data cube (see here for how this could work). The GeoDataFrame geometry column would store the footprint (bbox) of each scene in a common CRS so we can easily retrieve the scenes by spatial filtering. Basically the same approach than the one @mdsumner describes and that is indeed used by other tools (opendatacube, STAC tools, etc.), although those tools rely on other mechanisms than a GeoDataFrame for indexing the scenes.

This representation of individual scenes + geospatial index has some limitations, e.g., we cannot easily slice along other dimensions (time, band, etc.) as we could do with a plain datacube, but I guess that in most applications it is better anyway to do some spatial filtering first. It is also easier to reason about especially when the scenes are heterogeneous (i.e., different shapes and/or CRS), sparsely distributed and/or partially overlapping. In any case, this is much easier to implement than what I suggest in pydata/xarray#9543 (comment).

In terms of storage, with this approach we would still want to store the scenes each as a separate GeoZarr dataset. We could then store their footprint as a Geoparquet file or even into a basic relational database.

IIUC odc-geo's GridSpec is an alternative way of indexing the scenes in the specific (but pretty common?) case where those are arranged like a regular spatial grid. I'm wondering if in this case we couldn't have a Zarr "meta" dataset storing the information of each tile, i.e., their CRSs (e.g., using EPSG codes), their geo-transform (N+1 dimension array where the last dimension represents the 6 affine parameters) and even their footprint (assuming that GeoZarr would support vector geometries #28). That might be a little far-fetched, though, since we don't really need the full power of Zarr (cloud-optimized / n-arbitrary dimensions) to store this kind of information (pretty cheap, 2-dimensions only).

@benbovy
Copy link

benbovy commented Oct 24, 2024

Also xref geoarrow/geoarrow#24, which is relevant for this use case I think.

@RichardScottOZ
Copy link

What about xvec?

@benbovy
Copy link

benbovy commented Oct 24, 2024

Xvec and vector data cubes in general work best with geometries having attributes that also depend on other, non-geospatial dimensions like time, etc. Here this is a bit different: each geometry materializes the extent of a raster.

Similarly to a GeoDataFrame of DataArrays, maybe we could imagine an xvec DataArray of DataArrays? I'm not sure at all if/how this would work, though...
Would there be any advantage of using xarray DataArrays over simpler structures like a GeoDataFrame or odc-geo's GridSpec here?

Perhaps Xvec may be useful for loading and handling the Zarr "meta" dataset (cf. my suggestion above)? Off the top of my head something similar to how we use stackstac to query and turn a STAC collection into a DatArray, but here exclusively based on (Geo)Zarr and Xarray (Xvec).

@mdsumner
Copy link

From GDAL's GTI driver docs:

The tiles are listed as features of any GDAL supported vector format.

This would translate much more naturally to a GeoPandas' GeoDataFrame of Xarray DataArray objects than a plain data cube (see here for how this could work).

It *is a * vector layer when you treat it as one (it's an abstraction), when you treat it as a "GTI:" it is a raster (a 2D datacube with 1 or more vars), the entire point of the driver - and exactly like loading from stac into xarray via the various rasterio wrappers that exist now. I should have made that clearer, it doesn't present as a data frame when you access it via GTI, and the grid-resolving tasks are not invoked at all until you trigger a read-tasks. I'm not suggesting to crush tabular data into a Zarr, just that the idea of having multiple grid-specs somehow lined up in an array doesn't make much sense ... unless you are trying to index the details of how each footprint and underlying tiling maps into the final array (?). I think there is some utility in that.

@benbovy
Copy link

benbovy commented Oct 24, 2024

Thanks for the clarification @mdsumner ! So if I understand well GDAl GTI (collection of rasters -> virtual raster) is quite similar to stackstac (STAC items -> lazy xarray DataArray), right?

To be clear, I'm not suggesting either to crush tabular data into Zarr. I guess I was trying to say that a table of rasters (data frame) vs. a mosaic of rasters (single array) have both pros and cons, but that might be slightly off-topic.

I also agree that a multi-crs array doesn't make much sense and that something like a tile index would be useful. Two questions:

  1. Is there any interest in adding the tile index itself to the GeoZarr specs? Or should we leave that to other specs (e.g., STAC + GeoZarr)?

  2. If the answer to 1 is yes, how the spec would look like for the tile index? What do we need? Can / should we put both the tile index and all the individual tiles in the same Zarr dataset (leveraging Zarr's group hierarchy)?

@mdsumner
Copy link

So if I understand well GDAl GTI (collection of rasters -> virtual raster) is quite similar to stackstac (STAC items -> lazy xarray DataArray), right?

yes - I was a too reactive on the "crush" thing, I went and made an example and while it's amazingly simple to craft, it's quite abstract and needs some explaining. I think it's just not worth mapping disparate sources to an actual array, the array has a spec (rank, coordinates) and that's completely abstract, you can pour any sources into that so I guess it comes down to how to think about the catalogue of possible sources and when the mapping starts to occur.

(and I think that mapping is interesting, like for a given target grid specification, which sources are actually relevant, but maybe filtering spatially is best done after temporal and data type filtering etc)

your Qs I don't think so - I think the catalogue topic is very dynamic and will change and consolidate a lot in fairly short time. I'm concerned that STAC is treated specially, because why not have any old grid sources in a lighter collection - GTI (and the GeoParquet example, already compatible with GTI shows you can do that, STAC becomes a bespoke case folded in.

here's a very very simple GTI example I crafted, two disparate polar crs grids to a shared global one: https://gist.github.com/mdsumner/724f1db0bd0d211de6e3775486cd3df0

@maxrjones
Copy link
Member

From previous conversations, I believe @ljstrnadiii may have some experience with the DataFrame of DataArray's approach to this problem. Len, am I remembering that right and do you have any insights to share for this conversation?

@ljstrnadiii
Copy link

Yeah, @maxrjones I used a similar schema as GTI for the purpose of indexing tiles for different collections (features), date times, and crs. Columns in my geopandas df are geometry (bbox in the geopandas df's crs coordinates), path to cog, collection name, datetime, crs.

My goal is to reproject and resample heterogeneous tiles (w.r.t. crs) into a single Xarray data array with the final dims: band, time, y, x. However, gdalwarp wasn't happy when I provided a list of tiles in different crs. This could be a user error thing lol. So, I first query relevant tiles from parquet/geopandas, then groupby crs, mosaic with vrt, warp to target crs, and finally mosaic warped vrts. I then do that per collection and per datetime.

I am actually really excited to try out GTI soon for larger mosaics because the current approach seems to take time and I think requires http request per tif to discover metadata to build the vrt. I can build GTI data while I'm ingesting the tifs and avoid the lookup and provide it per tif as recommended in the GTI docs.

My guess is that GTI will help a ton with that and seems like it supports heterogeneous tiles crs and resolution!

In regards to the post, I don't see how chunking is relevant across crs areas without reprojecting. I'm failing to follow here, but I don't have a good developer-level understanding of geozarr, GDAL, or Xarray.

I suppose I'd try GTI to reproject resample to one mosaic Xarray and then apply the model for predictions, etc. and reproject back to UTM per zone. That depends on information loss and is honestly a question I've had: does going from a particular UTM zone to 3857 (for example) and back cause data loss/error? It will certainly generate redundant data. Or maybe make one Xarray data array per UTM zone and chunk within utm zones mosaiked by GTI?

I have tried to use a dimension in the past for tile id for example, but that was awkward since each tile has different heights/widths and therefore coordinates indexed by tile id so concatenating over tile id was not ideal not to mention edge effects, etc.

@mdsumner
Copy link

However, gdalwarp wasn't happy when I provided a list of tiles in different

That definitely works, keen to follow up on this. (You can see in my gist above that it works). Certainly it has improved a lot in recent years, I used to encounter crashes a lot ~8 years ago but it's super robust now. I usually use it via the API directly, the high level Warp() function in C++ , or via python or R.

Super keen to explore GTI with you, GDAL is also very responsive and developing fast, and some parts don't get tested enough by the community (or tested too late via downstream packages).

🙏

@RichardScottOZ
Copy link

I have a problem like Len's currently - not tiles but global heterogeneous surveys at different resolutions and shapes and CRS - so 'domain' is a possible index - gravity, magnetics, etc.

Might not just be errors @ljstrnadiii but artefacts too?

@rbavery
Copy link
Author

rbavery commented Oct 26, 2024

Wow thanks all for the quick and insightful comments! @mdsumner great recommendation on GTI that short example you provided is compelling. Looking forward to trying it out.

@ljstrnadiii you laid out a lot of options I've thought of trying and more I hadn't thought of, thanks!

In regards to the post, I don't see how chunking is relevant across crs areas without reprojecting.

My line of thinking was roughly what @benbovy discusses above with more clarity wrt to the Zarr "meta" dataset that tracks CRSs, geo-transform, and footprint. Each Zarr in a group would be for a single CRS. Reprojections would happen only on the fly if needed by the operation, like with GTI. Writing this out, it sounds like GTI with STAC geoparquet referencing COGs could be a great solution.

The only thing I'm wondering though is that Zarr v3 seems to have better support than other array formats for sharding, making it easy to rechunk, and compressing chunks with different methods. Or maybe (likely) this is all available in GDAL and I just need to chat with @mdsumner :)

I suppose I'd try GTI to reproject resample to one mosaic Xarray and then apply the model for predictions, etc. and reproject back to UTM per zone. That depends on information loss and is honestly a question I've had: does going from a particular UTM zone to 3857 (for example) and back cause data loss/error?

There's definitely some data corruption at higher latitudes but I'm very unsure on the extent of it and I think it is dependent on the size of the reprojected regions and latitude since mercator distorts more toward the poles. This makes me think it would be nice to tell GTI to reproject adjacent UTM scenes not to Web Mercator (or another virtual mosaic CRS) but something more locally accurate for the two scenes involved. Like one of the UTM projections of one of the scenes, determined by which UTM zone has the majority of pixel area.

Anyway I think the two questions posed by Benoit are good to discuss more! I'm curious to hear more from @mdsumner on his take that GeoZarr should not have a tile index.

the catalogue topic is very dynamic and will change and consolidate a lot in fairly short time.

I had the sense that GeoZarr is where this sort of OGC standardization on cataloguing should take place.

@RichardScottOZ
Copy link

However, gdalwarp wasn't happy when I provided a list of tiles in different

That definitely works, keen to follow up on this. (You can see in my gist above that it works). Certainly it has improved a lot in recent years, I used to encounter crashes a lot ~8 years ago but it's super robust now. I usually use it via the API directly, the high level Warp() function in C++ , or via python or R.

Super keen to explore GTI with you, GDAL is also very responsive and developing fast, and some parts don't get tested enough by the community (or tested too late via downstream packages).

🙏

How far have you seen this taken Michael? In terms of lots of CRS, size etc?

@mdsumner
Copy link

More broadly than to any grand scale, I've been flushing through many GDAL issues that prevent it being used with problematic sources, the vrt:// protocol in particular avoids having to instantiate actual VRT or apply workarounds after open in R or python (which is often too late to still exploit core GDAL-laziness and other features).

There's usually a way to stream any (2D slice or 2D multiband interpretation of) a source through the warper, even curvilinear model output, and includes sources in netcdf/hdf/legacy file or linking multiple component files together. The downstream packages tend to have a too-high view of these things. I like to be able to open any source, perhaps with some light assigned vrt:// fixes and then simply read any window in any crs for a quick look. That's also scaleable to massive outputs that are written target-tile at a time, and is exactly how odc-geo/opendatacube works.

I think the entire question behind this issue comes down to 1) familiarity with GDAL warper and what it can do, and 2) wanting to have some kind of pre-mapped index of what source/s will impact what parts of a target output, i.e. essentially what odc-geo and other cubes do to farm out actual calls to the warper api between huge Sentinel collections and a target tiling.

At the high level, you can take all your input sources and ask GDAL to scan them to write "just this one tile" (imagine the COP30 VRT with 10000s of longlat geotiffs), but that's inefficient if you don't have the VRT structure - GTI allows GDAL to pre-scan more efficiently from even disparate sources by spatial index which sources are actually relevant, and odc-geo takes that further by mapping actual source footprint fragments to output tiles, and culling any cases of calling the warper that won't have any impact (it also applies masking logic to source tiles upfront as well).

I suppose a Zarr target grid that stored information about the list of relevant sources that will impact each chunk would be useful, and for extra points where those different footprints are between source and target chunks. The scheduler could farm out the sources to chunks very efficiently, and save that index for new data that's coming in from the same grid/s for a given target grid.

@RichardScottOZ
Copy link

e.g., my first look is

gdaltindex -f GPKG grid_tile_index.gti.gpkg cog/*.tif
bash: /home/ubuntu/miniconda3/envs/grids/bin/gdaltindex: Argument list too long

@RichardScottOZ
Copy link

RichardScottOZ commented Oct 29, 2024

so presumably then you have to build in code

This is 25K 'tiles' - a LOT of different crs flavours - a bunch of different domains - so what would your approach be there, from low-level on up?

@mdsumner
Copy link

Can we get an example? I realised one that I have, I'll come back once I put a couple of things together.

@RichardScottOZ
Copy link

RichardScottOZ commented Oct 29, 2024

An example of what might be in there, small version? I could certainly find plenty of public analogues. For example, a small project raster in SA say 200x300, 32353, a chunk of brazil in utm 19s or something...map of sweden in 3006 in the thousands and a 20kx10k chunk of Africa...and tons of other small things...probably resolutions varying from 25m to a few km.

@mdsumner
Copy link

mdsumner commented Oct 29, 2024

ok sure I see, but what would you do with that? you want to create a image server or tile pyramid of whole world?

(my example consists of two 2m DEMS, one Antarctica polar stereographic in a VRT of thousands of COGs with additional overviews added, a 2m DEM of Tasmania in UTM, and two background global topographies COP30m for land-only, and GEBCO for the entire world topo and bathy) - from that I can get a map of topography anywhere at reasonable scale, and by collecting the world's local hi-res DEMS we could have one grand master topography index, but I think about it for random queries in particular places, or large areas at "reasonable scale" with no special-case problems like "no bathymetry". It covers my needs realy well, topo for any creature flying or swimming, and maps of cool places I'm always making, but that's pretty niche requirement I think.

I'm struggling to understand motivations, but excited to put mine above together to show what I'm on about a bit!

@RichardScottOZ
Copy link

No, a dataset for analysis mostly, not for people to look at in the eyeball sense [although of course that is nice too if can tile it for a WMS or whatever] - could slice by domain, use to infill or combine with other data - and also it is machine learning useable.

Geophysics in this case, not 'look at a mountain location' type map.

Being able to query - give me this country/polygon/box is useful too at times.

This is just one group, there will be others from other places to add in the future, and more new data as it gets produced.

@RichardScottOZ
Copy link

As I understand it though @mdsumner

  • Make a vector dataset with the bounds of each raster in geometry field and the location field is the file path

  • GTI creation references this

  • You tell GTI a CRS and resolution [or other relevant GDAL info] that you want this to end up as.

  • Now if you have a LOT - e.g. Len's above - should each raster be a VRT?

@rbavery
Copy link
Author

rbavery commented Oct 30, 2024

you want to create a image server or tile pyramid of whole world?

Not uncommon to need to serve a large fraction of it to a machine learning model. Like all tiles intersecting most continental coastlines, all tiles intersecting major shipping traffic, all US states, etc. Imagery of interest is 10 meter or possibly higher. And for a long time range, multiple years for sensors with weekly or monthly revisits. Example: https://skytruth.org/cerulean/

@mdsumner
Copy link

mdsumner commented Oct 30, 2024

  • Now if you have a LOT - e.g. Len's above - should each raster be a VRT?

It doesn't matter, you can use VRT to collate things into a single source, including external overviews (which I think is amazing and I do here soon to be updated with object storage ...)

GTI can collate them from disparate sources.

I hear your examples, I don't get what Zarr brings to it, other than a virtual grid frame to standardize to (?). Maybe that's it, effectively what sentinel cubes do but you'd use a sparse Zarr to cache materialized pixel values in a n-D frame.

@ljstrnadiii
Copy link

Yes, caching/persisting the final target grid feels like the most common use case.

Maybe there is or could be some overlap with purpose of virtualizar here in the sense that if a user has many collection-dates (one GTI per date and per collection e.g. product of 10 dates with sentinel2, landsat, modis would be 30 GTI files/indexes) we could build a single aggregate GTI manifest if the combined data if the user doesn't want to duplicate data by writing to zarr, but I would imagine persisting the final combined collection-dates Xarray/zarr dataset would be ideal given resources/time used to reproject/resample. I very commonly take this approach of one vrt per year and collection in order to build one "ML ready" dataset--excited to try GTI very soon! I'll often have on the order of 10s of dates and 10s of collections I'm hoping to put on a standardized grid.

On the other hand we might not want to cache/persist to zarr if we need to parameterize inputs to a connection string e.g. resolution or overview, etc. (assuming GTI somewhat supports that, which I do recall from the docs just not sure to what extent). Then a user would have to index multiple GTI files, open dataset engine rasterio, concat along band, concat along time each time we open, which is fine, but I'm sure someone will have an example that pushes the limits there.

Thinking through some of these examples above and writing to zarr, it seems the sparse nature of some of this high res data would motivate us to only write non-empty chunks. Given we write out from GTI to zarr, having some index on which chunks intersect the geometry induced by GTI might be helpful for future dask optimizations around resolving empty chunks as no-ops... This might not be required though for the case we do persist/cache to zarr given missing chunks (not written if providing write_empty_chunks=False) could already be a way to infer no-ops. This is more of a dask/Xarray topic only though I suppose.

Doesn't seem geozarr/zarr would help with any of this?

@RichardScottOZ
Copy link

imagine persisting the final combined collection-dates Xarray/zarr dataset would be ideal given resources/time used to reproject/resample

Yes - GTI = new, so no-one knows how long this is likely to take at at scale? If making many models and it is a significant time each time what is the computation time vs excess storage costs [assuming that it works].

@ljstrnadiii
Copy link

ljstrnadiii commented Oct 31, 2024

Correct me if I'm wrong, but I would imagine the reprojection and resampling computation (assume a GTI or VRT already exists) would be identical between xr.open_dataset with GTI or VRT? And similarly if chunks is not None and we have a dask cluster?

I think some speed gains are (assuming you already have geometry and location and potentially other fields as recommended) won due to avoiding the need to inspect each tif needed to make each VRT. It sounds like as long as you already have the metadata for each tif, the GTI index gives us immediate lazy access to the data on the target grid e.g. xr.open_dataset("GTI:...", engine="rasterio", chunks="auto"). That could shake things up in the resample/reproject on the fly dilemma. I suppose the more sparse it is, the better it is for computing on the fly. I also bet the difference in resolution and some projection transforms influence computational needs.

@RichardScottOZ
Copy link

Ok, I create a geopackage [with I create with EPSG:6933 as a test] with the required fields, then

ogr2ogr -f GPKG -mo SRS=EPSG:4326 -mo RESX=10000 -mo RESY=10000 output.gti.gpkg testgti.gpkg

then

import xarray

ds = xarray.open_dataset("GTI:output.gti.gpkg", engine = "rasterio")

gives

this is just using first 10 rasters from a list as a test

<xarray.Dataset> Size: 508MB
Dimensions:      (band: 1, x: 2321, y: 27326)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 19kB 4.201e+05 4.203e+05 ... 6.718e+05 6.719e+05
  * y            (y) float64 219kB 8.495e+06 8.495e+06 ... 5.531e+06 5.531e+06
    spatial_ref  int64 8B ...
Data variables:
    band_data    (band, y, x) float64 507MB ...

@mdsumner example had metdata creation options I think - the above being wrong, so I need to fix something I imagine.

@RichardScottOZ
Copy link

gdalinfo output.gti.gpkg
Driver: GTI/GDAL Raster Tile Index
Files: output.gti.gpkg
       output.gti.gpkg.aux.xml
Size is 2321, 27326
Coordinate System is:
PROJCRS["WGS 84 / NSIDC EASE-Grid 2.0 Global",
    BASEGEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            MEMBER["World Geodetic System 1984 (G2296)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["US NSIDC EASE-Grid 2.0 Global",
        METHOD["Lambert Cylindrical Equal Area",
            ID["EPSG",9835]],
        PARAMETER["Latitude of 1st standard parallel",30,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["False easting",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Environmental science - used as basis for EASE grid."],
        AREA["World between 86°S and 86°N."],
        BBOX[-86,-180,86,180]],
    ID["EPSG",6933]]
Data axis to CRS axis mapping: 1,2
Origin = (420087.500000000000000,8495380.000000000000000)
Pixel Size = (108.502270812988854,-108.502270812988854)
Metadata:
  RESX=20000
  RESY=20000
  SRS=EPSG:4326
Corner Coordinates:
Upper Left  (  420087.500, 8495380.000)
Lower Left  (  420087.500, 5530446.948) (  4d21'13.89"E, 48d59'55.44"N)
Upper Right (  671921.271, 8495380.000)
Lower Right (  671921.271, 5530446.948) (  6d57'50.06"E, 48d59'55.44"N)
Center      (  546004.385, 7012913.474) (  5d39'31.97"E, 72d50'49.58"N)
Band 1 Block=256x256 Type=Float64, ColorInterp=Gray
  NoData Value=nan

That is in 6933 from geopackage still - and maybe whatever resolution the first raster was.

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

yes, and I think we could add some pointers to examples in R and python, or maybe implement something in GDAL to do this for us, that would be cool, I'll explore.

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

what is approxfun() in python?

bb <- c(-180, -90, 180, 90)

#' Densify from vector, for use in `dens_bb`
#' @x input vector
#' @n length of output, default 48
dens <- function(x, n = 48)  approxfun(seq(0, 1, length.out = length(x) ), x)(seq(0, 1, length.out = n))

#' Dense polygon outline from bbox input
#' @x bbox xmin,ymin,xmax,ymax
#' @n number of points for output, default 48 (12 per side)
dens_bb <- function(x, n = 48) {
  if (length(x) != 4L) stop("input x must be length 4")
 if ((diff(x[c(1, 3)]) <= 0) || (diff(x[c(2, 4)])) <= 0) stop("invalid bbox xmax <= xmin or ymax <= ymin")
 exoutline <- matrix(x[c(1, 3, 3, 1, 1, 2, 2, 4, 4, 2)], ncol = 2)
 apply(exoutline, 2, dens) 
}

dens_bb(bb)

@RichardScottOZ
Copy link

interp from scipy probably?

@RichardScottOZ
Copy link

So from suggestions:

import xarray
import geopandas as gpd
import rioxarray

gdf = gpd.read_file('testgti.gti.fgb')
print(gdf.crs)

open_kwargs = {"resx":1,"resy":1}
ds = xarray.open_dataset("GTI:testgti.gti.fgb", engine = "rasterio", open_kwargs =open_kwargs)
print("\n",ds)

ds = rioxarray.open_rasterio("GTI:testgti.gti.fgb", engine = "rasterio", resx=0.1, resy=0.1)
print("\n",ds)

ds = rioxarray.open_rasterio("GTI:testgti.gti.fgb", engine = "rasterio", xsize=3600, ysize=1800, minx=-180,miny=-90,maxx=180,maxy=90,\
geotransform='-180, 0.1, 0, 90, 0, -0.1')
print("\n",ds)

ds = rioxarray.open_rasterio("GTI:testgti.gti.fgb", engine = "rasterio", minx=-180,miny=-90,maxx=180,maxy=90, resx=0.1, resy=0.1)
print("\n",ds)

EPSG:4326

 <xarray.Dataset> Size: 185kB
Dimensions:      (band: 1, x: 265, y: 86)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 2kB -115.8 -114.8 -113.8 ... 146.2 147.2 148.2
  * y            (y) float64 688B 49.7 48.7 47.7 46.7 ... -33.3 -34.3 -35.3
    spatial_ref  int64 8B ...
Data variables:
    band_data    (band, y, x) float64 182kB ...

 <xarray.DataArray (band: 1, y: 851, x: 2643)> Size: 18MB
[2249193 values with dtype=float64]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 21kB -116.3 -116.2 -116.1 ... 147.7 147.8 147.9
  * y            (y) float64 7kB 50.15 50.05 49.95 ... -34.65 -34.75 -34.85
    spatial_ref  int64 8B 0
Attributes:
    _FillValue:    nan
    scale_factor:  1.0
    add_offset:    0.0

 <xarray.DataArray (band: 1, y: 1800, x: 3600)> Size: 52MB
[6480000 values with dtype=float64]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 29kB -179.9 -179.8 -179.8 ... 179.8 179.9 180.0
  * y            (y) float64 14kB 89.95 89.85 89.75 ... -89.75 -89.85 -89.95
    spatial_ref  int64 8B 0
Attributes:
    _FillValue:    nan
    scale_factor:  1.0
    add_offset:    0.0

 <xarray.DataArray (band: 1, y: 1800, x: 3600)> Size: 52MB
[6480000 values with dtype=float64]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 29kB -179.9 -179.8 -179.8 ... 179.8 179.9 180.0
  * y            (y) float64 14kB 89.95 89.85 89.75 ... -89.75 -89.85 -89.95
    spatial_ref  int64 8B 0
Attributes:
    _FillValue:    nan
    scale_factor:  1.0
    add_offset:    0.0

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

fwiw to get the landsat example above to work I had to set GS_NO_SIGN_REQUEST for GDAL to read those sources.

import os
os.environ["GS_NO_SIGN_REQUEST"] = "YES"

@RichardScottOZ
Copy link

that will be another trick - replicate the above from s3

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

here's my minimal verify of the 'a_nodata' setting for the google landsat (just for my own purposes catching up still ...)

dsn <- "/vsigs/gcp-public-data-landsat/LE07/01/176/031/LE07_L1TP_176031_20201204_20201230_01_T1/LE07_L1TP_176031_20201204_20201230_01_T1_B3.TIF"

Sys.setenv("GS_NO_SIGN_REQUEST" = "YES")
library(terra)
#> terra 1.7.83
r <- rast(dsn)
r1 <- rast(sprintf("vrt://%s?a_nodata=0.0", dsn))
ex <- c(681578.9,  709557.7, 4551326.3, 4580576.8)
op <- par(mfrow = c(2, 1))
plot(crop(r, ex), main = "original")
plot(crop(r1, ex), main  = "a_nodata=0.0")

par(op)

Created on 2024-11-01 with reprex v2.0.2

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

that will be another trick - replicate the above from s3

it should be similar, sub in "/vsis3" prefix and set "AWS_NO_SIGN_REQUEST=YES" (or the keys also)

@RichardScottOZ
Copy link

Yes, that I have used a lot [@ljstrnadiii - the 150TB things I wrote out in parts on a cluster, too :) )

@RichardScottOZ
Copy link

On that topic - what about best GDAL environment etc. options with GTI - same as anything else?

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

No way, details matter: https://fosstodon.org/@hughagraham/113333781954590335

@ljstrnadiii
Copy link

Those seem like typical configs I set when using a large vrt with rioxarray. Are there any configs in particular we should focus on with GTI that might differ that you might know of? Or is having a large warped vrt and GTI effectively the same when materializing the dataset?

@RichardScottOZ
Copy link

RichardScottOZ commented Nov 1, 2024

image
From fosstodon above

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

It will be specific to the sources is all, definitely use EMPTY_DIR if sidecar files won't be needed, the biggest thing to matter for GTI is having explicit virtual grid setting (ie all of xsize,ysize,minx,miny,maxx,maxy, or all of xsize,ysize, geotransform), and having a spatial index on the vector layer which I haven't explored.

I think all the other settings shown there are pretty good, just can depend with specific sources

@mdsumner
Copy link

mdsumner commented Nov 1, 2024

adding this for completeness, a post on gdaltindex and FGB that pre-empted the GTI approach, and was clearly connected to STAC Geoparquet:

https://www.postholer.com/fgbcog/

https://cloudnativegeo.org/blog/2023/09/how-postholer-went-serverless-using-cloud-native-geospatial-data/

@RichardScottOZ
Copy link

The approximate 25K seems fine when I retest it this morning - FGB version.

@mdsumner - so what about writing a particular mosaic to disk -if this is similar to warped VRT behaviour here - @ljstrnadiii ?

e.g. 25K assorted files reprojected to a test 0.1 degree world grid to see how it goes [and I am sure there will be plenty of data issues to fix, but a test[.

@RichardScottOZ
Copy link

Throwing a default dask option at it on one c5.12x EC2 something I wasn't paying lots of attention, but around 10 minutes give or take a few - disk to disk.
So, pretty impressed.

@ljstrnadiii
Copy link

Throwing a default dask option at it on one c5.12x EC2 something I wasn't paying lots of attention, but around 10 minutes give or take a few - disk to disk.

That's a resample and reproject operation then? What is the size of the tifs and of the final store?

@RichardScottOZ
Copy link

RichardScottOZ commented Nov 2, 2024

100gb of geotiff - as on my first pass conversion I would have left some of them as float64 - end store size is 30mb
245 different CRS it appears.

@RichardScottOZ
Copy link

The mosaic map made was fine, too, this was an experiment in capability, 25K surveys squashed together is not useful in that not-organised sense.

However, with a bounding box for the target grid set - the driver appeared to handle the geometry problems - e.g. had a few datasets with transform issues - e.g. identity as it didn't work.

I was using code to automatically georeference the grids: Geosoft grids have some PROJCS type strings in their xml metadata [and sometimes EPSG etc] - sometimes people miss them or just have a GEOCS part in there - giving null type info there. Seems both of those were ignored in this case - so another nice thing.

@ljstrnadiii
Copy link

I noticed when actually inspecting the data that only particular params as open objects specified in the VRT connection string or as labels added with ogr2ogr -mo <...> were honored and same with inputs to open_rasterio...

I found that I had to add nodata, srs, dtype, to the GTI file with -mo, and parameterize resx/y, maxx/y, filter, band_count inside rioxarray which works well for my use case.

I also found the last filter was sticking when iteration over different filter args and concatenating (like over time and collection for example) into one Xarray. E.g. concatenating over band axis lead to the same (last filter query) data copied for each dataset.

If what I am noticing is true, we'd have to write out each collection as a dataset independently and then merge e.g. each GTI -> Xarray dataset should have same bands for all indexed assets. This would prevent me from directly "fusing" multiple collections unless I force chunk=1 on band axis and write in separate processes as far as I understand.

@RichardScottOZ
Copy link

Did using srs as metadata -mo allow you to change from what is in the vector dataset CRS?

@mdsumner
Copy link

mdsumner commented Nov 4, 2024

However, with a bounding box for the target grid set - the driver appeared to handle the geometry problems - e.g. had a few datasets with transform issues - e.g. identity as it didn't work.

What are the transform issues? I think that's worth exploring since the vector footprint and the raster itself could be "out of sync".

I was wondering about this, if the footprint is not related to the actual footprint of the file (because the transform or crs was wrong etc) then you would have a file selected for use because of its geometry footprint, but that might not contribute to the output because of the determination of its actual footprint as a raster source. I thought that's probably worth expanding on as it could be very confusing (and perhaps there's an expectation that the geometry footprint overrides that available in the source).

@RichardScottOZ
Copy link

issues being : didn't have one, as data has a problem, so just assigns the identify rather than actual values.
- in some cases I guess that could be included??
or the geosoft metadata was just wrong and it tries to georeference and fails - e.g. things like a GEOSC only or something that has a wildcard * in False Easting, issues like that.

@mdsumner
Copy link

mdsumner commented Nov 4, 2024

I see, essentially every one should be checked to see that it's sensible (the warper can handle many cases including curvilinear grids or control points etc, but sometimes you need a bit of VRT to make sure everything is right, missing crs is common in netcdf because of longlat assumptions etc).

@ljstrnadiii
Copy link

@mdsumner do you happen to have any insight into more complex methods to build mask band logic support something like (mask_band==8) or (mask_band==8 or mask_band<=4) to determine valid pixels and set all else to nodata?

I am thinking through using GTI with something like GLAD's ARD's raster data, which is temporally and spatially corrected Landsat 30m Surface Reflectance with 8th band as a Uint16 quality flag band. Or think any other large raster dataset where bitmask, QF band, etc should determine nodata before GDAL picks the first (or last) valid pixel during mosaicing areas with overlapping scenes. Feels somewhat on topic for the case where different CRSs overlap.

I don't see a connection string entry for something like this in GDAL's GTI or VRT driver. I do see examples inside the VRT docs like NoDataFromMaskSource, but not exactly the expression I am looking for.

@mdsumner
Copy link

Not otoh, I've tried figuring out masks with VRT so I think it's possible but I didn't get my head around it

Note that there's also possibility to run pixel functions with VRT and that might be a good way. I use stac tools like odc-geo or sits as my benchmark that do this, but I didn't get anywhere with it. Keen to explore though

@ljstrnadiii
Copy link

@mdsumner reporting back after putting GTI + rioxarray through the ringer. We have used it to build global datasets, datasets with high temporal frequency, etc. It has been really great.

One thing I see often though is stuff like:

  • TIFFFillTile:Read error at row 4294967295, col 4294967295, tile 498; got 0 bytes, expected 1866
  • RasterioIOError: Read failed. See previous exception for details.
  • RuntimeError: Request for 8516817326-8518185158 failed with response_code=0

I have pinned it down to likely some contention with gcs. This seems to happen even more frequently with larger COGs. My guess is that if there is no robust cache set for GDAL (CPL_VSIL_CURL_CACHE_SIZE) then it makes many many requests to these objects at once and overloads cloud storage. I have found these gdal configs to be very helpful for more robust tasks:

{
        "GDAL_HTTP_MERGE_CONSECUTIVE_RANGES": "YES",
        # using all_cpus here but single threaded dask scheduler when calling .to_zarr
        "GDAL_NUM_THREADS": "ALL_CPUS",
        "GDAL_CACHEMAX": "8048",
        "GDAL_HTTP_VERSION": "2",
        "GDAL_HTTP_MULTIPLEX": "YES",
        "VSI_CACHE": "TRUE",
        "VSI_CACHE_SIZE": "50000000",
        "CPL_VSIL_CURL_CHUNK_SIZE": "10485760",
        "CPL_VSIL_CURL_CACHE_SIZE": "1342177280",
        "GDAL_DISABLE_READDIR_ON_OPEN": "TRUE",
        "GDAL_HTTP_RETRY_DELAY": "30",
        "GDAL_HTTP_MAX_RETRY": "5",
}

I really would like to debug why exactly I am seeing a steady client error rate of 50%. My guess is that the requests are being throttled by the gcs server for our bucket (they are in the same region fwiw). I feel like GDAL_HTTP_EXPONENTIAL_BACKOFF options could be beneficial here instead of a default of fixed 30s. Do you know if GDAL offers exponential backoff-like support for HTTP? I am also reading that it is good practice to slowly ramp up requests in gcs.

Image

Anyways, figured it would be worth dropping a few notes here!

@RichardScottOZ
Copy link

I think I have seen errors along those lines on S3 when reading a lot

@mdsumner
Copy link

oh nice, I'll explore - but this is definitely worthy of asking on gdal-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants