Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: More Flexible Indexing From Underlying Zarr Dataset #172

Open
Rilwan-Adewoyin opened this issue Jan 10, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Rilwan-Adewoyin
Copy link
Member

Title: Enhance Anemoi Indexing Capabilities to Leverage Zarr's List Indexing

Description:

This issue proposes removing current limitations and inefficiencies in Anemoi Datasets' indexing by fully utilizing Zarr's list indexing capabilities. Currently, Anemoi imposes restrictions that prevent optimal use of list indices, leading to performance bottlenecks and code complexity.

Background:

Existing comments in the codebase suggest that these restrictions are based on the assumption that Zarr does not support list indexing. However, Zarr's zarr.array.Array.oindex[] method does provide support for list indices, as well as slices and None.

Current Limitations and Inefficiencies:

1. Limited Support for Multi-Dimensional List Indexing (Inflexibility 1):

Anemoi Datasets currently do not support indexing with two or more list indices on different dimensions. For example:

x = get_anemoi_dataset(...)  # shape(dates, variable, ensemble, grid)
date_index = [1, 3, 5]
ensemble_index = [0, 4, 2]
x[date_index, :, ensemble_index, :]  # This is currently not possible

2. Issues with Mixed List and Slice Indexing (Inflexibility 2):

As highlighted in #162, there are existing issues with indexing using a single list index on one dimension and a slice on another.

3. Inefficient Handling of Non-Sequential List Indices (Inefficiency 1):

When a non-sequential list index of size n is used for a single dimension, the current implementation retrieves data by converting it into n separate slices of length 1. It then retrieves data for each slice individually and concatenates the results. This is evident in the code at [select.py#L54C1-L60C7](https://github.com/ecmwf/anemoi-datasets/blob/develop/src/anemoi/datasets/data/select.py#L54C1-L60C7). This approach is equal to or less efficient than using oindex[...] especially when dealing with large indices.

Proposed Solution:

Leverage Zarr's oindex() to support general list indexing in Anemoi datasets. This will remove the current restrictions on using multiple list indices and will allow for equal or more efficient retrieval of data for non-sequential list indices.

Benefits and Use Cases:

  • Improved Performance for Ensemble Dataloaders: Enables efficient list indexing on date and ensemble member dimensions, which is crucial for ensemble modeling.
  • Enhanced Dataloader Efficiency: (potentially) Amortizes the cost of reads for list indices within the same chunk, significantly improving performance for models processing large time chunks non-iteratively.
  • Resolution of Existing Issues: Addresses scenarios like those described in feat: support list indices for sampling #162.
  • Code Simplification: Reduces the complexity of the indexing logic in the anemoi/datasets/data/ directory.
  • Increased flexibility: The indexing behaviour of anemoi datasets will be more consistent with the indexing behaviour of zarr and numpy arrays.
@Rilwan-Adewoyin Rilwan-Adewoyin added the enhancement New feature or request label Jan 10, 2025
@Rilwan-Adewoyin Rilwan-Adewoyin self-assigned this Jan 10, 2025
Rilwan-Adewoyin added a commit that referenced this issue Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant