Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Rilwan-Adewoyin · 2025-01-10T12:28:12Z

Title: Enhance Anemoi Indexing Capabilities to Leverage Zarr's List Indexing

Description:

This issue proposes removing current limitations and inefficiencies in Anemoi Datasets' indexing by fully utilizing Zarr's list indexing capabilities. Currently, Anemoi imposes restrictions that prevent optimal use of list indices, leading to performance bottlenecks and code complexity.

Background:

Existing comments in the codebase suggest that these restrictions are based on the assumption that Zarr does not support list indexing. However, Zarr's zarr.array.Array.oindex[] method does provide support for list indices, as well as slices and None.

Current Limitations and Inefficiencies:

1. Limited Support for Multi-Dimensional List Indexing (Inflexibility 1):

Anemoi Datasets currently do not support indexing with two or more list indices on different dimensions. For example:

x = get_anemoi_dataset(...)  # shape(dates, variable, ensemble, grid)
date_index = [1, 3, 5]
ensemble_index = [0, 4, 2]
x[date_index, :, ensemble_index, :]  # This is currently not possible

2. Issues with Mixed List and Slice Indexing (Inflexibility 2):

As highlighted in #162, there are existing issues with indexing using a single list index on one dimension and a slice on another.

3. Inefficient Handling of Non-Sequential List Indices (Inefficiency 1):

When a non-sequential list index of size n is used for a single dimension, the current implementation retrieves data by converting it into n separate slices of length 1. It then retrieves data for each slice individually and concatenates the results. This is evident in the code at [select.py#L54C1-L60C7](https://github.com/ecmwf/anemoi-datasets/blob/develop/src/anemoi/datasets/data/select.py#L54C1-L60C7). This approach is equal to or less efficient than using oindex[...] especially when dealing with large indices.

Proposed Solution:

Leverage Zarr's oindex() to support general list indexing in Anemoi datasets. This will remove the current restrictions on using multiple list indices and will allow for equal or more efficient retrieval of data for non-sequential list indices.

Benefits and Use Cases:

Improved Performance for Ensemble Dataloaders: Enables efficient list indexing on date and ensemble member dimensions, which is crucial for ensemble modeling.
Enhanced Dataloader Efficiency: (potentially) Amortizes the cost of reads for list indices within the same chunk, significantly improving performance for models processing large time chunks non-iteratively.
Resolution of Existing Issues: Addresses scenarios like those described in feat: support list indices for sampling #162.
Code Simplification: Reduces the complexity of the indexing logic in the anemoi/datasets/data/ directory.
Increased flexibility: The indexing behaviour of anemoi datasets will be more consistent with the indexing behaviour of zarr and numpy arrays.

The text was updated successfully, but these errors were encountered:

Rilwan-Adewoyin added the enhancement New feature or request label Jan 10, 2025

Rilwan-Adewoyin self-assigned this Jan 10, 2025

Rilwan-Adewoyin added a commit that referenced this issue Jan 10, 2025

#172 add list indexing

f3dfe71

Rilwan-Adewoyin mentioned this issue Jan 10, 2025

#172 Support flexible indexing #173

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Rilwan-Adewoyin commented Jan 10, 2025

Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Comments

Rilwan-Adewoyin commented Jan 10, 2025