Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion : writing xarray-compatible zarr stores #66

Open
emileten opened this issue Apr 16, 2024 · 16 comments
Open

Discussion : writing xarray-compatible zarr stores #66

emileten opened this issue Apr 16, 2024 · 16 comments
Assignees

Comments

@emileten
Copy link
Collaborator

emileten commented Apr 16, 2024

Context

Hi @joemoorhouse. This topic was briefly mentioned in this issue os-climate/physrisk#263 when we discussed how coordinate labels are stored in, or retrieved from, the output zarr stores in OS-C, in particular the hazard indicators. It was stressed that we do not save coordinate labels directly in the zarr metadata and instead, save a matrix that can be used in combination with the indices to get the coordinates. This makes the output incompatible with native xarray operations. This is why we have a module dedicated to offering an api to read the output as xarray objects

Suggestion

Can we write the coordinates to the zarr metadata so as to have the output directly xarray-compatible ? This would for example allow us to experiment with titiler-xarray for serving tiles..

You mentioned performance concerns in the linked issue, but I am not sure to understand why, in principle, this issue should arise, as the ideal way, I believe (and there is some reference to that), to save such data would be one single zarr store with latitude, longitude, temporal, model and scenario dimensions (and with the coordinate labels for each of those in the zarr store metadata), with a chunking structure optimized for the usage of this data. This also suggests potentially broader changes in how the data is saved, since I think, for example for the days above tas workflow, we're saving separate zarr arrays for each year ?

@emileten emileten self-assigned this Apr 16, 2024
@sharkinsspatial
Copy link

@emileten There is a currently an additional utility write_data_array method on the OSCZarr class that supports writing with xarray compliant _ARRAY_DIMENSIONS convention. This works for me when used in the pytest environment but is throwing an xarray exception on set_index when run with actual nex_gddp_cmip6 data, I'll need to investigate a bit more here.

@joemoorhouse if possible, can you elaborate a bit on the use of an index variable in the normalize_array method. I'm not clear on the purpose for indexing on an additional variable here?

Additionally, looking through some of the discussion on the interpolation methods in the physrisk repo, are there selection/interpolation methods being used in physrisk that can't be supported using standard xarray coordinates and interpolation routines?

I may be misunderstanding the intent, but it seems like some of the code could be potentially simplified by working directly in xarray with the zarr data in transform space rather than numpy with data in raw pixel space. I know this was discussed a bit here but can you provide a bit more detail on the advantages of the "Geotiff" based approach being used.

@joemoorhouse
Copy link
Collaborator

Hi @emileten, hi @sharkinsspatial,
Some answers in reverse order (and sounds like a call to discuss might also be useful?):
First on the question of why in physrisk we are not using xarray (as compared to hazard where we use widely), but rather we are using the Zarr library directly to access the Zarr array. Indeed I started by using xarray in physrisk but was fighting to get the performance I wanted, so opted to go lower level: there is not muchxarray functionality we need and performance optimisation was harder. In hazard, in contrast, our use cases already seem a good fit to using xarray + Dask.
More precisely, we have a (very) important main use case which is looking up single points from latitudes and longitudes. There we want to look up ~ millions of locations as quickly as possible. Here the affine transform gives a computational advantage because we can calculate the pixel locations directly: no look-up necessary at all. When doing the same thing with xarray labels I guess (?) we are at best doing a log2(N) complexity search for each look up along each co-ordinate axis, where N is the dimension length (from memory: I'd have to debug through this again to check what it is doing!).
Other considerations too:

  • We want to Zarr async behaviour in downloading chunks, but we need the flexibility to control any thread/process pool behaviour on top of that; I'm a bit wary about what xarray is doing there.
  • Nice to avoid loading up and caching labels, e.g. in the API.

@joemoorhouse
Copy link
Collaborator

The index dimension is just a convention whereby the hazard indicators are 3 dimensional Zarr arrays by construction. Two dimensions are spatial and the third is the index dimension that depends on the nature of the data. Acute hazards are 3D because we always have a number of return periods e.g. flood depth for 30Y, 100Y, 200Y, ... returns. For some chronic hazards it is also useful to make 3D because we have data that depends on a number of thresholds, e.g. days with wet bulb globe temperature above 20, 25, 30, ... C.

It is important performance-wise to have something 3D, because for the physical climate risk calculations we typically need all of the index values for a particular lat and lon (so also want to have in the same chunk).

@joemoorhouse
Copy link
Collaborator

On storing the coordinate labels, I think when the xarray [to_zarr method] (https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) is called the xarray is actually stored as a Zarr group with arrays for the data itself and each coordinate. That is, the convention is not to store the labels in the meta-data but have them in separate arrays. @emileten, was that what you had in mind, or was it actually a variation on that?

I suspect that the to_zarr method format is trying to keep the metadata (attributes) as small as possible so that it is fast to load an array - although you cannot then create an xarray without downloading the coordinates. I was also wondering whether to adopt that convention...

I am in two minds. Perhaps personal taste but I find the GeoTiff-like approach of storing the affine transform much more elegant than storing all of the labels: that way we can store the labels in the metadata without this getting too big because we are storing the labels so efficiently (6 numbers).

I also struggle a bit with the structure we end up with when using the to_zarr method because an array becomes a group. I tried this out for representing map tile sets where we end up with a group of say 9 zoom levels and each zoom level is itself a group containing the array and coordinates. I thought it seemed an overly complex structure.

On the other hand, adopting that convention does not prevent us from doing any performance optimisations, i.e. we could still ignore the co-ordinates arrays and regenerate from affine transform if needed.

Whatever option we choose, it is still pretty easy to create an xarray from the Zarr array and it might not prevent us from using say titiler-xarray (e.g. if we can add a hook for adding the logic to create the xarray from a single Zarr array, as opposed to a group).

@emileten emileten changed the title Writing xarray-compatible zarr stores Discussion : writing xarray-compatible zarr stores Apr 18, 2024
@emileten
Copy link
Collaborator Author

Thanks both for all the input. For now we'll just ensure this module offers the option to write output with coordinates in the metadata : #67. This will be optional. The default remains the legacy behavior.

@emileten
Copy link
Collaborator Author

Marked this as a 'discussion', it's not really an issue anymore. Not sure how to convert an issue to an actual github discussion.

@j08lue
Copy link
Contributor

j08lue commented Aug 13, 2024

Related discussion on Pangeo considering Xarray to support affine transforms in gridded data stores (e.g. Zarr):

@joemoorhouse
Copy link
Collaborator

joemoorhouse commented Jan 15, 2025

Hi @j08lue and FYI @xbarra,

Adding some of our exchanges into this discussion (can still discuss face to face as needed!). Some renewed focus because we have the question of 'what is a nice data store for ASDI'.

@j08lue, as you note the functionality to support affine transforms is, most likely, making its way into the XArray core library pydata/xarray#9543 and presumably at some point after that there will be an XArray-native way to save co-ordinates into Zarr as well as a GeoZarr standard. Separate discussion on that: zarr-developers/geozarr-spec#17. And thanks for the nice demo of how we can store data as a data cube, using the NetCDF-style coordinates of XArray:

https://gist.github.com/j08lue/193291492504d61ad9f446e8459380ef#file-os-climate-hazard-reformat-zarr-multidimensional-with-stac-ipynb

I think it is fair to say that at the point there is a GeoZarr standard we should align to that. If anything the discussion you mention above reinforces that a transform is preferable to NetCDF-style coordinates in our case: that's surely the best option for us long-term.
https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/6

I think in the meantime we have two questions.

  1. Until the new standard is relatively clear and supported, should we also have NetCDF-style coordinates?
  2. Should we have more of a data cube structure as the gist example shows and if so, how?

For 1 I am in two minds. Compared to today, having the NetCDF coordinates means that we have to add another level into the storage scheme I believe (which may or may not be redundant in the eventual target). That is in Zarr nomenclature we have a group 'days_tas_above' of arrays 'days_tas_above' (actual indicator data), 'latitude', 'longitude', 'temperature' (coordinates) etc. The arrays would then be at paths:
.../days_tas_above/days_tas_above, .../days_tas_above/latitude, ../days_tas_above/longitude, ../days_tas_above/temperature etc
We can still have the transform as an attribute in the .../days_tas_above/days_tas_above array, so in principle we can have optional NetCDF coordinates.
I guess my main concern is if adding this extra level takes us further away from the long-term structure, which might have the transform as attributes in the array itself, i.e. the real target might actually be closer to what we have now (I'm pretty sure the attributes will be completely different of course :)). But on the flip side, right now we cannot open an XArray without some extra logic. Also, I don't think the extra layer will actually limit out options in the future.

In an effort to conclude, I suggest that for now we support the cases both with and without NetCDF coordinates in physrisk, but we do not migrate one format to another (at least until clear what the long-term looks like)

2 is perhaps a bit easier. I agree 100% with the gist example that the temperatures should be in the array (and in fact we started doing just that in other cases). I also agree with that example that emissions scenario should be collection-level information. I guess I would probably make time (i.e. the projection central year) collection-level information too rather than a dimension. My thinking for that is probably driven by the use cases I have seen. There are cases where a user is interested just in one or two scenarios and one or two years and a data cube perhaps binds scenarios and times too closely together for my liking (i.e. in the same Zarr array structure even if these are in different chunks). For example, if one wanted to copy just the 2050 data that would become a Zarr/XArray operation rather than just copying using a different S3 prefix. Another case where I can see arguments both ways.

@j08lue
Copy link
Contributor

j08lue commented Jan 21, 2025

In an effort to conclude, I suggest that for now we support the cases both with and without NetCDF coordinates in physrisk, but we do not migrate one format to another (at least until clear what the long-term looks like)

That interim solution sounds great. I'll respond to your prototype PR

Why avoid explicit coordinates in the first place?

Just wanted to record the rationale you shared with me:

From a purely physrisk point of view we [prefer affine transforms to coordinate arrays] because:

  1. We don’t want to load the coordinate arrays into memory. It leads to a latency, takes up memory and otherwise does nothing for us because
  2. We don’t want to use the coordinates to look up points because it can be quite a lot slower than using the affine transform directly. With the affine transform the pixels can be located immediately with a few multiply-adds whereas doing so from the coordinates requires a binary search in each dimension. At best inelegant and in some cases can be costly.

Just curious - what size of arrays are you optimizing for here, @joemoorhouse, where coordinate array loading, memory footprint, and lookup speed are significant compared to the actual data arrays and other operations?

@j08lue
Copy link
Contributor

j08lue commented Jan 21, 2025

I agree 100% with the gist example that the temperatures should be in the array (and in fact we started doing just that in other cases)

🎉 Can you list the indicators where this pattern should be applied - more model parameters stored as array dimensions - and would you like us to implement them?

Perhaps we could branch off into a new ticket for these.

@j08lue
Copy link
Contributor

j08lue commented Jan 21, 2025

I guess I would probably make time (i.e. the projection central year) collection-level information too rather than a dimension. My thinking for that is probably driven by the use cases I have seen. There are cases where a user is interested just in one or two scenarios and one or two years and a data cube perhaps binds scenarios and times too closely together for my liking (i.e. in the same Zarr array structure even if these are in different chunks).

Good call to base the data cube design on the known use cases and discovery and access patterns.

Re access - based on your experience, is slicing along the projected time not a common use case, though? Like answering questions about how a hazard is expected to evolve, as climate change progresses?

Also, do you experience that users always download files from S3 before opening them, or could we imagine that they rely more on direct access from a central repository like ASDI or CEDA / UK EODH in the future? In the first case, I get why slicing on the file level is better. I think we really want to promote direct access, though, no?

For discoverability and cataloging, data cube dimensions bring the advantages discussed earlier.

@joemoorhouse
Copy link
Collaborator

Hi @j08lue,

Re "Just curious - what size of arrays are you optimizing for here, @joemoorhouse, where coordinate array loading, memory footprint, and lookup speed are significant compared to the actual data arrays and other operations?"

An example would be someone selecting a hazard map (e.g. flood for 2050 under SSP585) in a UI and clicking on a single location to get the flood depths. In that case the (current) work is to read the Zarr attributes (assuming not cached), then use this to identify the (usually single) chunk needed, download that, decompress it and get the pixel value. This is actually quite ok latency-wise (I guess around 100 ms level) and it stays at that level no matter how large the array (e.g. 1,000,000 × 500,000). At the point where the attributes are cached (possible since that is small), it is mainly the time to get that single chunk.

With NetCDF arrays we then have to load the coordinates and perhaps additional attributes (unless using consolidated metadata). Now I don't think this would be disastrously slow - even if the coordinates are a million numbers say we can get pretty quickly - but it is an undesirable and unnecessary overhead.

[By the way, the concerns others have over accuracy are probably a better case for transforms, but we did not experience that yet!]

@joemoorhouse
Copy link
Collaborator

I think most indicators fit that pattern well where there is an identifiable single extra dimension and definitely desirable to implement like that. In retrospect it was a misstep to make those separate. On the acute side, that dimension is return period. For chronic hazards we have families of threshold values, e.g. mean heating/cooling degrees days for different thresholds, mean days with temperature (average or max) above different thresholds, days with SPEI index below certain threshold values. In fact these are more common than the exception (things like fire probability)!

So I think the extra dimension gets us quite far. I do wonder whether we might need two extra dimensions in future. Perhaps modelling of heat waves or dealing with uncertainty better. And again, those would I think go in the cube.

@joemoorhouse
Copy link
Collaborator

I agree it is possible to make a good case for projected time in the cube. As you say, for our flood case why not click on a location and see flood depths vs return period for multiple time horizons? (which we can still do of course, but in a less elegant way). Also, interpolating between projected times is important. To be fair this one seems a bit of a coin toss to me.

I think ease of downloading files is important. From various exchanges, I think Physrisk is often used a piecemeal way, i.e. people using whatever bits of data and bits of code might be useful for their own systems (and potentially hosting themselves for the usual reasons), and I think that's a good thing (or if not good, then certainly fine!). Whereas of course a data cube is better when accessing from a central repository.

@mdsumner
Copy link

(following along from a distance ...) you also don't need to load any arrays at all if you can use affine, you need the extent and shape (or transform and shape), which is a maximum 8 values (for 2D). There's a tremendous amount of work that can be done with that tiny set of values about the grid and a single string for its crs, including entirely generating the coordinate arrays from scratch in whole or part arbitrarily. 🙏

@j08lue
Copy link
Contributor

j08lue commented Jan 22, 2025

Coordinates & transforms ✅

There's a tremendous amount of work that can be done with that tiny set of values

I 💯 agree on the benefits of affine transforms for representing regular grids / images you highlight, @mdsumner. ❤ Great that the GeoZarr standard and client libraries are adding support for it, so we can store Zarr with transforms + crs in a standard-compliant way - because the cost of not following standards is big, too, IMO.

Index parameter dimensions ✅

So I think the extra dimension gets us quite far. I do wonder whether we might need two extra dimensions in future.

I see you already implemented one extra dimension that is not called "index" in #135. Your call when / whether more are needed. Dimensions galore, if you ask me. 😄

Time dimension ❓

I think ease of downloading files is important. I think Physrisk is often used a piecemeal way, i.e. people using whatever bits of data might be useful for their own systems

I would imagine that more folks who want local copies would just download the whole time series and if the superfluous data they pull is not big (?), that would not hurt.

Subsetting from remote files would be a Python one-liner:

import xarray

xarray.open_dataset("https://remote.zarr", engine="zarr").isel(time="2020-01-01").to_zarr("local_2020.zarr")

But you are the maintainer and know these use cases best, so pls make the call @joemoorhouse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants