Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create staggered hierarchy for CellArr dataset for flexibility for power users #81

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hanslovsky
Copy link
Contributor

Power users may want to manage their CellArrDataset more directly with tiledb.Array or uris that do not share a common prefix. This PR provides an example implementation of how to achieve this by extracting out a common base class that is then extended to reproduce the existing class with the implemented safeguards:

  1. class _CellArrDatasetBase: base class that is constructed with tiledb.Arrays: This does not manage (open/close) any arrays. An additional check for read-only is added to the constructor
  2. class _CellArrDatasetUri(_CellArrDatasetBase): this class is constructed with uris pointing to existing tiledb.Arrays. This class opens arrays from these uris (each uri is taken as is, no prefix prepended) and closes them via __del__. The arrays are passed into super().__init__
  3. CellArrDataset(_CellArrDatasetUri): this class has the same interface as before. In the constructor, it simply does some string concatenations to produce the uris that are then passed into super().__init__.

Most users should use the existing class (3). Power users can fall back to (1) or (2) for extra flexibility if needed. The _ prefix of (1) and (2) indicates that people who use those should not expect support and will need to debug on their own if they use them incorrectly.

Note: This is a draft PR to start a discussion around such power user classes. In my use case, I am planning on a processing pipeline that maps one CellArrDataset to a new one in each step. The input is considered immutable, i.e. I currently have to copy the metadata each time, even when only the matrix data changes. Using (1) and (2) allows for much greater flexibility in such scenarios and avoids unnecessary data replication.

…wer users

This will allow to operate on tiledb arrays directly
@hanslovsky hanslovsky force-pushed the propose-dataset-relaxation branch from b379dcd to d4ae610 Compare January 23, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant