split out manifests for smaller "coordinate" arrays #539

dcherian · 2025-01-03T19:18:54Z

One of our ideas to reduce time to open a dataset with Xarray is to create a separate manifest files for "coordinate" arrays that are commonly loaded in to memory eagerly.

Here is a survey of some datasets with coordinate arrays on the large end of the spectrum (the values for MPAS are an estimate)

Name	Uncomp Size	Numel	Dtype	N Chunks	Shape	Chunk Shape
ARCO ERA5 time	10MB	1.3M	int64	20	(1336320,)	(67706,)
HRRR 2D lat,lon	10MB	1.9M	float64	1	(1059, 1799)	(1059, 1799)
landsat datacube x	10MB	1.3M	float64	1	(1336320,)	(1336320,)
landsat datacube y	5MB	0.6M	float64	1	(668160,)
Sentinel datacube	1.3MB	0.2M	float64	1
LLC4320 XC	970MB	18.6M	float64	13	(13, 4320, 4320)	(1, 4320, 4320)
LLC4320 all grid vars	16GB			254
3.75km MPAS edge_face_connectivity	1.9GB	0.2B	int64	150 [12MiB]	(125829120, 2)

If we want a manifest file of size 1MiB, and our default inline_chunk_threshold_bytes is 512 bytes, then we get to store max 2048 chunks (assuming they are all inlined).

So one no-config approach would be to:

take all the array metadata (combining the snapshot & changeset),
sort arrays by number of chunks, and
split it so that the first <=~2048 chunks get written to one manifest, and then put the rest in another manifest (or split this using other heuristics). We can be smart about splitting so that no array's chunks are split between the first and second manifest.

This ignores a bunch of complexity:

Sparsity of chunks (minor IMO)
underestimates number of chunks we could stick in there (IMO probably OK for a first pass)
If one builds a dataset using incrementally appends and commits after every append, we may end up rewriting the manifest for the first few appends. (this is OK IMO)

Thoughts?

The text was updated successfully, but these errors were encountered:

dcherian · 2025-01-03T21:46:02Z

And here is some python code to test out this idea on example datasets:

import xarray

def partition_manifest(ds: xarray.Dataset):
    import math
    import operator
    from itertools import dropwhile, takewhile

    from toolz import accumulate

    THRESHOLD_CHUNKS = 2048
    
    allvars: dict[str, int] = {}
    for name, var in ds.variables.items():
        # we know this from Zarr metadata
        nchunks = math.prod(math.ceil(s/c) for s, c in zip(var.shape, var.encoding["chunks"], strict=True))
        allvars[name] = nchunks

    allvars = dict(sorted(allvars.items(), key=lambda x: x[1]))
    accumulated = tuple(accumulate(operator.add, allvars.values()))
    
    threshold_filter = lambda tup: tup[1] < THRESHOLD_CHUNKS
    first = lambda tup: next(iter(tup))
    small = dict(map(first, takewhile(threshold_filter, zip(allvars.items(), accumulated, strict=True))))
    big   = dict(map(first, dropwhile(threshold_filter, zip(allvars.items(), accumulated, strict=True))))
    assert not set(small) & set(big)
    print(f"\nsmall: {sum(small.values())} chunks", tuple(small), f"\nbig: {sum(big.values())} chunks over {len(big)} arrays")

grid = xr.open_zarr("gs://pangeo-ecco-llc4320/grid", chunks={})
partition_manifest(grid)
# small: 253 chunks ('PHrefC', 'PHrefF', 'Z', 'Zl', 'Zp1', 'Zu', 'drC', 'drF', 'face', 'i', 'i_g', 'iter', 'j', 'j_g', 'k', 'k_l', 'k_p1', 'k_u',
# 'time', 'CS', 'Depth', 'SN', 'XC', 'XG', 'YC', 'YG', 'dxC', 'dxG', 'dyC', 'dyG', 'hFacC', 'hFacS', 'hFacW', 'rA', 'rAs', 'rAw', 'rAz')

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None, decode_cf=False)
partition_manifest(ds)
# small: 23 chunks ('latitude', 'level', 'longitude', 'time') 
# big: 361355904 chunks over 273 arrays

import arraylake as al

client = al.Client()
hrrr = client.get_repo("earthmover-demos/hrrr").to_xarray("solar")
gfs = client.get_repo("earthmover-demos/gfs").to_xarray("solar")

partition_manifest(hrrr)
partition_manifest(gfs)
# small: 54 chunks ('latitude', 'longitude', 'spatial_ref', 'step', 'x', 'y', 'time') 
# big: 4589460 chunks over 6 arrays
# small: 11 chunks ('latitude', 'longitude', 'step', 'time') 
# big: 1189440 chunks over 5 arrays

paraseba · 2025-01-07T02:21:51Z

I really like this @dcherian . It's simple and it will probably make a big difference for interactive use cases. I wonder what are the things we can generalize. Examples:

Maybe we can generate multiple small-ish manifests if there are many small arrays.
Maybe for the large coordinate arrays we still want to have a separate manifest? That would help with interactive use cases without having to download massive manifests? Can we identify what are coordinate arrays?

rabernat · 2025-01-07T14:44:35Z

If we want a manifest file of size 1MiB

Where does this come from? It seems like a very important number. I guess is is the size of a file we expect to be able to fetch from S3 quickly using a single thread?

According to our model, the fetching time should be T(n) = T0 + n / B0 =~ 10 ms + 1MB / 100 MB/s =~ 20ms. By this math, we could get away with 4 MB and still stay at around 50 ms.

3. If one builds a dataset using incrementally appends and commits after every append, we may end up rewriting the manifest for the first few appends. (this is OK IMO)

Agree it's okay to rewrite this file with every commit.

Could you try running the sample code on some big EO data cubes? Like something from Sylvera? Or https://app.earthmover.io/earthmover-demos/sentinel-datacube-South-America-3

rabernat · 2025-01-07T14:47:15Z

then we get to store max 2048 chunks (assuming they are all inlined).

It seems like we should also check this assumption for common datasets. It's unclear to me how much the logic you're proposing here depends on the assumption on inline chunks.

rabernat · 2025-01-07T14:50:53Z

Last comment about inlining:

Are the cases where it would make sense to actually rechunk the coordinate data before storing it? Like for this dataset:

# small: 23 chunks ('latitude', 'level', 'longitude', 'time')

Why should we store 23 chunks? Why not just 4? As long as we are inlining, there is not really any benefit to chunking. In the scenario where we recunk, we might exceed the 512 byte inline threshold, but we would still be storing less data total, since we would only have one "big" chunk.

Put differently, is there ever a good reason to chunk an array if it is being inlined into a single manifest?

dcherian · 2025-01-07T19:17:12Z

Maybe we can generate multiple small-ish manifests if there are many small arrays.
Maybe for the large coordinate arrays we still want to have a separate manifest?

I think this really is an issue about how to split the remaining arrays, that is we should make sure manifests for big arrays (big in terms of number of chunks) are stored separately. I think that can be a followup.

Can we identify what are coordinate arrays?

We should ask: what are the characteristics of "coordinate arrays"? From my survey they seem to be arrays that are "generally" small in terms of number of chunks, and empirically we can make good decisions without needing the user to annotate their data.

I guess is is the size of a file we expect to be able to fetch from S3 quickly using a single thread? By this math, we could get away with 4 MB and still stay at around 50 ms.

Yes we want to be in the "latency range" so <8MB.

Could you try running the sample code on some big EO data cubes? Like something from Sylvera? Or app.earthmover.io/earthmover-demos/sentinel-datacube-South-America-3

These are the landsat examples in the table; they all have single chunks and will be easily detected.

It's unclear to me how much the logic you're proposing here depends on the assumption on inline chunks.

It does not depend at all on inlining. Using the inline threshold gives us a worst-case-maximum estimate of number of chunks in a single manifest given the chosen size threshold.

Why should we store 23 chunks?

This is how that dataset was created. My only goal here was to see if there is a no-config way to detect these "coordinate" variables in real world scenarios. I think there is.

In the scenario where we recunk, we might exceed the 512 byte inline threshold, but we would still be storing less data total, since we would only have one "big" chunk.

I think this is all orthogonal to the idea here, which is to simply split out a "small" manifest and a "big manifest". Other optimizations can be explored later.

dcherian · 2025-01-20T16:31:52Z

Prototyped the config for a couple of use-cases following the design document in #593

ERA5 bulk update

Goals:

Put coordinate arrays (time, latitude, longitude, level) in one manifest
One data var per manifest file, so that I can incrementally build up the archive (commit after
ingesting one var). The arrays-per-manifest: 1 config is great for this use case.
Preload the coordinates set at open-time.

chunk-manifests:
 sets:
   - coordinates:                # a name for the set, used later with rules
       max-manifest-size: 10000  # this set will contain manifests with <= 10k references
                                 # large enough to hold all necessary time coordinate references
                                 # 60 years * (365*24) hours = 525,600 elements

       arrays-per-manifest: null # pack this many arrays in each manifest
                                 # max-manifest-size and arrays-per-manifest are
                                 # mutually exclusive
   - big-arrays:
       max-manifest-size: null
       arrays-per-manifest: 1
       cardinality: null

   - default:                  # default set must always exist, from defaults if needed

 rules:                        # what arrays go to each manifest set, an ordered list

   # each rule has a target (the manifest set) and a set of conditions that are and-ed
   # rules are applied in the order declared, and they break on the first match

   - path: .*/(latitude|longitude|time)  # an optional regex matching on path
     metadata-chunks: [0, 5000]          # arrays having number of chunks in this range
     target: coord1                      # arrays that match will go to this manifest set

   - metadata-chunks: [200000, null]    # huge arrays have their special manifest set
     target: big-array


 preload:                      # what manifests to asynchronously preload on session start
   max-manifest-size: 50000    # don't preload manifests larger than this
   max-manifests: 10           # don't preload more than this many manifests
   arrays:
     - path: .*/2m_temperature # will preload manifests for all arrays matching this regex

GFS: Forecast-type appending dataset

Goals:

start a new manifest every n time-steps.
Preload the coodinates set
Preload the most recent manifest for data variables. There is no way to specify this at the moment.
definitely want multiple arrays in a single manifest file

Note: I found myself micro-optimizing the numbers and came up with quite small max-manifest-size: 4000

chunk-manifests:
 sets:
   - coordinates:                # a name for the set, used later with rules
       max-manifest-size: 10000  # this set will contain manifests with <= 10k references
                                 # large enough to hold all necessary time coordinate references
                                 # 60 years * (365*24) hours = 525,600 elements

       arrays-per-manifest: null # pack this many arrays in each manifest
                                 # max-manifest-size and arrays-per-manifest are
                                 # mutually exclusive
   - appended-arrays:
       max-manifest-size: 4400   # 20 chunks per array for a single update, 4 updates per day
                                 # this gets us 5 days of updates in a single manifest file
       arrays-per-manifest: 10
       cardinality: null

   - default:                  # default set must always exist, from defaults if needed

 rules:                        # what arrays go to each manifest set, an ordered list

   # each rule has a target (the manifest set) and a set of conditions that are and-ed
   # rules are applied in the order declared, and they break on the first match

   - path: .*/(latitude|longitude|time|level)  # an optional regex matching on path
     metadata-chunks: [0, 5000]          # arrays having number of chunks in this range
     target: coord1                      # arrays that match will go to this manifest set

   - metadata-chunks: [200000, null]    # huge arrays have their special manifest set
     target: big-array


 preload:                      # what manifests to asynchronously preload on session start
   max-manifest-size: 50000    # don't preload manifests larger than this
   max-manifests: 10           # don't preload more than this many manifests

dcherian added the manifests 🔮 label Jan 3, 2025

dcherian added this to Icechunk 1.0 Jan 4, 2025

paraseba moved this to Ready in Icechunk 1.0 Jan 19, 2025

paraseba assigned paraseba and dcherian Jan 19, 2025

paraseba moved this from Ready to In progress in Icechunk 1.0 Jan 19, 2025

paraseba moved this from In progress to Done in Icechunk 1.0 Jan 23, 2025

paraseba closed this as completed by moving to Done in Icechunk 1.0 Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split out manifests for smaller "coordinate" arrays #539

split out manifests for smaller "coordinate" arrays #539

dcherian commented Jan 3, 2025 •

edited

Loading

dcherian commented Jan 3, 2025

paraseba commented Jan 7, 2025

rabernat commented Jan 7, 2025

rabernat commented Jan 7, 2025

rabernat commented Jan 7, 2025

dcherian commented Jan 7, 2025

dcherian commented Jan 20, 2025 •

edited

Loading

split out manifests for smaller "coordinate" arrays #539

split out manifests for smaller "coordinate" arrays #539

Comments

dcherian commented Jan 3, 2025 • edited Loading

dcherian commented Jan 3, 2025

paraseba commented Jan 7, 2025

rabernat commented Jan 7, 2025

rabernat commented Jan 7, 2025

rabernat commented Jan 7, 2025

dcherian commented Jan 7, 2025

dcherian commented Jan 20, 2025 • edited Loading

ERA5 bulk update

GFS: Forecast-type appending dataset

dcherian commented Jan 3, 2025 •

edited

Loading

dcherian commented Jan 20, 2025 •

edited

Loading