Support for multiple matrices and improving construction of TileDB ob…

…jects (#53) * Support multiple matrices during construction and query of slices * Parallelize matrix ingestion into TileDB * Added method to access an entire sample * Fix bugs with filtering frames with tiledb query expressions * Better remapping of sliced coo matrices from tiledb arrays * Update documentation, tests and README
CellArr · Nov 28, 2024 · 48de52c · 48de52c
1 parent 74bb680
commit 48de52c
Show file tree

Hide file tree

Showing 19 changed files with 697 additions and 211 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,66 @@
 # Changelog
 
+## Version 0.3.0
+
+This version introduces major improvements to matrix handling, storage, and performance, including support for multiple matrices in H5AD/AnnData workflows and optimizations for ingestion and querying.
+
+**Support for multiple matrices**:
+  - Both `build_cellarrdataset` and `CellArrDataset` now support multiple matrices. During ingestion, a TileDB group called `"assays"` is created to store all matrices, along with group-level metadata.
+
+This may introduce breaking changes with the default parameters based on how these classes are used. Previously to build the TileDB files:
+
+```python
+dataset = build_cellarrdataset(
+    output_path=tempdir,
+    files=[adata1, adata2],
+    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
+    num_threads=2,
+)
+```
+
+Now you may provide a list of matrix options for each layers in the files.
+
+```python
+dataset = build_cellarrdataset(
+    output_path=tempdir,
+    files=[adata1, adata2],
+    matrix_options=[
+        MatrixOptions(matrix_name="counts", dtype=np.int16),
+        MatrixOptions(matrix_name="log-norm", dtype=np.float32),
+    ],
+    num_threads=2,
+)
+```
+
+Querying follows a similar structure:
+```python
+cd = CellArrDataset(
+    dataset_path=tempdir,
+    assay_tiledb_group="assays",
+    assay_uri=["counts", "log-norm"]
+)
+```
+`assay_uri` is relative to `assay_tiledb_group`. For backwards compatibility, `assay_tiledb_group` can be an empty string.
+
+**Parallelized ingestion**:
+The build process now uses `num_threads` to ingest matrices concurrently. Two new columns in the sample metadata, `cellarr_sample_start_index` and `cellarr_sample_end_index`, track sample offsets, improving matrix processing.
+  - Note: The process pool uses the `spawn` method on UNIX systems, which may affect usage on windows machines.
+
+**TileDB query condition fixes**:
+Fixed a few issues with fill values represented as bytes (seems to be common when ascii is used as the column type) and in general filtering operations on TileDB Dataframes.
+
+**Index remapping**:
+Improved remapping of indices from sliced TileDB arrays for both dense and sparse matrices. This is not a user facing function but an internal slicing operation.
+
+**Get a sample**:
+Added a method to access all cells for a particular sample. you can either provide an index or a sample id.
+
+```python
+sample_1_slice = cd.get_cells_for_sample(0)
+```
+
+Other updates to documentation, tutorials, the README, and additional tests.
+
 ## Version 0.2.4 - 0.2.5
 
 - Provide options to extract an expected set of cell metadata columns across datasets.

diff --git a/README.md b/README.md
@@ -29,11 +29,9 @@ Building a `CellArrDataset` generates 4 TileDB files in the specified output dir
 - `sample_metadata`: A TileDB file containing sample metadata.
 - `cell_metadata`: A TileDB file containing cell metadata including mapping to the samples
 they are tagged with in ``sample_metadata``.
-- A matrix TileDB file named by the `layer_matrix_name` parameter. This allows the package
-to store multiple different matrices, e.g. 'counts', 'normalized', 'scaled' for the same cell,
-gene, sample metadata attributes.
+- An `assay` TileDB group containing various matrices. This allows the package to store multiple different matrices, e.g. 'counts', 'normalized', 'scaled' for the same sample/cell and gene attributes.
 
-The organization is inspired by the [MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) data structure.
+The organization is inspired by Bioconductor's `SummarizedExperiment` data structure.
 
 The TileDB matrix file is stored in a **cell X gene** orientation. This orientation
 is chosen because the fastest-changing dimension as new files are added to the
@@ -64,7 +62,19 @@ adata2 = "path/to/object2.h5ad"
 dataset = build_cellarrdataset(
     output_path=tempdir,
     files=[adata1, adata2],
-    matrix_options=MatrixOptions(dtype=np.float32),
+    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
+    num_threads=2,
+)
+
+# Or if the objects contain multiple assays
+dataset = build_cellarrdataset(
+    output_path=tempdir,
+    files=[adata1, adata2],
+    matrix_options=[
+        MatrixOptions(matrix_name="counts", dtype=np.int16),
+        MatrixOptions(matrix_name="log-norm", dtype=np.float32)
+    ],
+    num_threads=2,
 )
 ```
 
@@ -91,7 +101,7 @@ if these are `AnnData` or `H5AD`objects, all objects must contain an index (in t
 
 #### Optionally provide cell metadata columns
 
-If the cell metadata is inconsistent across datasets, you can provide a list of
+If the cell metadata is inconsistent across datasets, you may provide a list of
 columns to standardize during extraction. Any missing columns will be filled with
 the default value `'NA'`, and their data type should be specified as `'ascii'` in
 `CellMetadataOptions`. For example, this build process will create a TileDB store
@@ -115,7 +125,7 @@ Check out the [documentation](https://biocpy.github.io/cellarr/tutorial.html) fo
 
 ### Query a `CellArrDataset`
 
-Users have the option to reuse the `dataset` object retuned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.
+Users have the option to reuse the `dataset` object returned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.
 
 ```python
 # Create a CellArrDataset object from the existing dataset
@@ -140,11 +150,23 @@ print(expression_data.gene_annotation)
      446	gene_50
      945	gene_95
 
+This returns a `CellArrDatasetSlice` object that contains the matrix and metadata `DataFrame`'s along the cell and gene axes.
+
+Users can easily convert these to analysis-ready representations
+
+```python
+print("as anndata:")
+print(expression_data.to_anndata())
+
+print("\n\n as summarizedexperiment:")
+print(expression_data.to_summarizedexperiment())
+```
+
 ### A built-in dataloader for the `pytorch-lightning` framework
 
 The package includes a dataloader in the `pytorch-lightning` framework for single cells expression profiles, training labels, and study labels. The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.
 
-This dataloader can be used as a template to create custom dataloaders specific to your needs.
+This dataloader can be used as a **template** to create custom dataloaders specific to your needs.
 
 ```python
 from cellarr.dataloader import DataModule
@@ -153,7 +175,7 @@ datamodule = DataModule(
     dataset_path="/path/to/cellar/dir",
     cell_metadata_uri="cell_metadata",
     gene_annotation_uri="gene_annotation",
-    matrix_uri="counts",
+    matrix_uri="assays/counts",
     label_column_name="label",
     study_column_name="study",
     batch_size=1000,
@@ -188,6 +210,9 @@ trainer = pl.Trainer(**params)
 trainer.fit(autoencoder, datamodule=datamodule)
 autoencoder.save_all(model_path=model_path)
 ```
+
+Check out the [documentation](https://biocpy.github.io/cellarr/api/modules.html) for more details.
+
 <!-- pyscaffold-notes -->
 
 ## Note

diff --git a/assets/cellarr.jpg b/assets/cellarr.jpg
diff --git a/assets/cellarr.png b/assets/cellarr.png
diff --git a/assets/cellarr.svg b/assets/cellarr.svg
diff --git a/pyproject.toml b/pyproject.toml
@@ -17,6 +17,10 @@ extend-ignore = ["F821"]
 [tool.ruff.pydocstyle]
 convention = "google"
 
+[tool.ruff.format]
+docstring-code-format = true
+docstring-code-line-length = 20
+
 [tool.ruff.per-file-ignores]
 "__init__.py" = ["E402", "F401"]