Skip to content

Commit

Permalink
Support for multiple matrices and improving construction of TileDB ob…
Browse files Browse the repository at this point in the history
…jects (#53)

* Support multiple matrices during construction and query of slices
* Parallelize matrix ingestion into TileDB
* Added method to access an entire sample
* Fix bugs with filtering frames with tiledb query expressions
* Better remapping of sliced coo matrices from tiledb arrays
* Update documentation, tests and README
  • Loading branch information
jkanche authored Nov 28, 2024
1 parent 74bb680 commit 48de52c
Show file tree
Hide file tree
Showing 19 changed files with 697 additions and 211 deletions.
61 changes: 61 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,66 @@
# Changelog

## Version 0.3.0

This version introduces major improvements to matrix handling, storage, and performance, including support for multiple matrices in H5AD/AnnData workflows and optimizations for ingestion and querying.

**Support for multiple matrices**:
- Both `build_cellarrdataset` and `CellArrDataset` now support multiple matrices. During ingestion, a TileDB group called `"assays"` is created to store all matrices, along with group-level metadata.

This may introduce breaking changes with the default parameters based on how these classes are used. Previously to build the TileDB files:

```python
dataset = build_cellarrdataset(
output_path=tempdir,
files=[adata1, adata2],
matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
num_threads=2,
)
```

Now you may provide a list of matrix options for each layers in the files.

```python
dataset = build_cellarrdataset(
output_path=tempdir,
files=[adata1, adata2],
matrix_options=[
MatrixOptions(matrix_name="counts", dtype=np.int16),
MatrixOptions(matrix_name="log-norm", dtype=np.float32),
],
num_threads=2,
)
```

Querying follows a similar structure:
```python
cd = CellArrDataset(
dataset_path=tempdir,
assay_tiledb_group="assays",
assay_uri=["counts", "log-norm"]
)
```
`assay_uri` is relative to `assay_tiledb_group`. For backwards compatibility, `assay_tiledb_group` can be an empty string.

**Parallelized ingestion**:
The build process now uses `num_threads` to ingest matrices concurrently. Two new columns in the sample metadata, `cellarr_sample_start_index` and `cellarr_sample_end_index`, track sample offsets, improving matrix processing.
- Note: The process pool uses the `spawn` method on UNIX systems, which may affect usage on windows machines.

**TileDB query condition fixes**:
Fixed a few issues with fill values represented as bytes (seems to be common when ascii is used as the column type) and in general filtering operations on TileDB Dataframes.

**Index remapping**:
Improved remapping of indices from sliced TileDB arrays for both dense and sparse matrices. This is not a user facing function but an internal slicing operation.

**Get a sample**:
Added a method to access all cells for a particular sample. you can either provide an index or a sample id.

```python
sample_1_slice = cd.get_cells_for_sample(0)
```

Other updates to documentation, tutorials, the README, and additional tests.

## Version 0.2.4 - 0.2.5

- Provide options to extract an expected set of cell metadata columns across datasets.
Expand Down
43 changes: 34 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,9 @@ Building a `CellArrDataset` generates 4 TileDB files in the specified output dir
- `sample_metadata`: A TileDB file containing sample metadata.
- `cell_metadata`: A TileDB file containing cell metadata including mapping to the samples
they are tagged with in ``sample_metadata``.
- A matrix TileDB file named by the `layer_matrix_name` parameter. This allows the package
to store multiple different matrices, e.g. 'counts', 'normalized', 'scaled' for the same cell,
gene, sample metadata attributes.
- An `assay` TileDB group containing various matrices. This allows the package to store multiple different matrices, e.g. 'counts', 'normalized', 'scaled' for the same sample/cell and gene attributes.

The organization is inspired by the [MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) data structure.
The organization is inspired by Bioconductor's `SummarizedExperiment` data structure.

The TileDB matrix file is stored in a **cell X gene** orientation. This orientation
is chosen because the fastest-changing dimension as new files are added to the
Expand Down Expand Up @@ -64,7 +62,19 @@ adata2 = "path/to/object2.h5ad"
dataset = build_cellarrdataset(
output_path=tempdir,
files=[adata1, adata2],
matrix_options=MatrixOptions(dtype=np.float32),
matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
num_threads=2,
)

# Or if the objects contain multiple assays
dataset = build_cellarrdataset(
output_path=tempdir,
files=[adata1, adata2],
matrix_options=[
MatrixOptions(matrix_name="counts", dtype=np.int16),
MatrixOptions(matrix_name="log-norm", dtype=np.float32)
],
num_threads=2,
)
```

Expand All @@ -91,7 +101,7 @@ if these are `AnnData` or `H5AD`objects, all objects must contain an index (in t

#### Optionally provide cell metadata columns

If the cell metadata is inconsistent across datasets, you can provide a list of
If the cell metadata is inconsistent across datasets, you may provide a list of
columns to standardize during extraction. Any missing columns will be filled with
the default value `'NA'`, and their data type should be specified as `'ascii'` in
`CellMetadataOptions`. For example, this build process will create a TileDB store
Expand All @@ -115,7 +125,7 @@ Check out the [documentation](https://biocpy.github.io/cellarr/tutorial.html) fo

### Query a `CellArrDataset`

Users have the option to reuse the `dataset` object retuned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.
Users have the option to reuse the `dataset` object returned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.

```python
# Create a CellArrDataset object from the existing dataset
Expand All @@ -140,11 +150,23 @@ print(expression_data.gene_annotation)
446 gene_50
945 gene_95

This returns a `CellArrDatasetSlice` object that contains the matrix and metadata `DataFrame`'s along the cell and gene axes.

Users can easily convert these to analysis-ready representations

```python
print("as anndata:")
print(expression_data.to_anndata())

print("\n\n as summarizedexperiment:")
print(expression_data.to_summarizedexperiment())
```

### A built-in dataloader for the `pytorch-lightning` framework

The package includes a dataloader in the `pytorch-lightning` framework for single cells expression profiles, training labels, and study labels. The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.

This dataloader can be used as a template to create custom dataloaders specific to your needs.
This dataloader can be used as a **template** to create custom dataloaders specific to your needs.

```python
from cellarr.dataloader import DataModule
Expand All @@ -153,7 +175,7 @@ datamodule = DataModule(
dataset_path="/path/to/cellar/dir",
cell_metadata_uri="cell_metadata",
gene_annotation_uri="gene_annotation",
matrix_uri="counts",
matrix_uri="assays/counts",
label_column_name="label",
study_column_name="study",
batch_size=1000,
Expand Down Expand Up @@ -188,6 +210,9 @@ trainer = pl.Trainer(**params)
trainer.fit(autoencoder, datamodule=datamodule)
autoencoder.save_all(model_path=model_path)
```

Check out the [documentation](https://biocpy.github.io/cellarr/api/modules.html) for more details.

<!-- pyscaffold-notes -->

## Note
Expand Down
Binary file modified assets/cellarr.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/cellarr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion assets/cellarr.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ extend-ignore = ["F821"]
[tool.ruff.pydocstyle]
convention = "google"

[tool.ruff.format]
docstring-code-format = true
docstring-code-line-length = 20

[tool.ruff.per-file-ignores]
"__init__.py" = ["E402", "F401"]

Expand Down
Loading

0 comments on commit 48de52c

Please sign in to comment.