Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintegration of subset causing very large file size #42

Open
maxc2020 opened this issue Jan 28, 2023 · 5 comments
Open

Reintegration of subset causing very large file size #42

maxc2020 opened this issue Jan 28, 2023 · 5 comments

Comments

@maxc2020
Copy link

maxc2020 commented Jan 28, 2023

Hello,

Thank you for this great tool.

I'm working to analyze multiple samples from different conditions and compare to published data, in total there are 240k cells over ~40 samples. Four broad cell types (epithelial, stromal, vascular, and immune) are present in many but not all samples (e.g., samples are not balanced by cell type). I have normalized in Seurat with SCTv2 and then integrated with FastMNN via the SeuratWrappers function, and FastMNN does a great job. Other methods have led to significant misclassification of cell types.

My issue and request for help is related to subsetting. I generated cell type specific Variable Feature using subsets from a few samples with 3-5k cells in that type, and then use than in rerunning FastMNN on a subsetted object as follows. Note that the DietSeurat function strips the Seurat object of all other assays/reductions, such as the original MNN. The issue is my output object is enormous now, 140gb for 55k cells. I tried with another cell type with only 5k cells and the object with 30gb. The starting integrated object is about 14gb, with all 240k cells. My guess is that this may be related to some samples having fewer than 50 cells after subsetting, but I am not sure. And note that a few samples only had 100-300 cells in the initial object creation so perhaps low cells/sample isn't the issue after all.

mnn_integrated_object <- readRDS(file = "xxx")
mnn_integrated_object <- DietSeurat(mnn_integrated_object, counts = TRUE, data = TRUE, assays = c("RNA", "SCT"))
gc()
subset_mnn <- subset(mnn_integrated_object, cells = Cell_Type_1_List)
DefaultAssay(subset_mnn) <- "SCT"
subset_mnn <- RunFastMNN(object.list = SplitObject(subset_mnn, split.by = "orig.ident"), features = VariableFeatures, assay = "SCT", verbose = TRUE)
subset_mnn <- RunUMAP(subset_mnn, reduction = "mnn", dims = 1:50, min.dist = 0.3)
subset_mnn <- FindNeighbors(subset_mnn, reduction = "mnn", dims = 1:50)
subset_mnn <- FindClusters(subset_mnn, resolution = 1.0)

Any idea what is going on? Of course I could just skip rerunning FastMNN on the subset but I suspect there is interesting biology present and would prefer to redo the PCA/Integration steps.

Many thanks!

@LTLA
Copy link
Owner

LTLA commented Jan 28, 2023

Hm. My best guess is that it's something to do with the reconstructed assay, which is a Very Special matrix. If you look at it after being directly returned from fastMNN() (e.g., by debuging whatever RunFastMNN is doing), you'll see that it has a type of LowRankMatrix. This is a delayed cross-product of the (MNN-corrected) PCs and the rotation vector, e.g.

set.seed(123456)

library(Matrix)
a <- rsparsematrix(50000, 20000, density=0.01)
out <- runPCA(a, rank=10, BSPARAM=IrlbaParam())

# Creating a low-rank reconstruction with the first 10 PCs.     
lr <- LowRankMatrix(out$rotation, out$x)
## <20000 x 50000> matrix of class LowRankMatrix and type "double":
##                   [,1]          [,2]          [,3] ...      [,49999]
##     [1,]  0.0057135993 -0.0045652364  0.0039531449   . -1.254121e-03
##     [2,]  0.0074395154  0.0046073647  0.0066427142   . -9.973139e-04
##     [3,]  0.0043550788 -0.0018743069  0.0006405070   . -3.131222e-03
##     [4,] -0.0003137060  0.0017982014  0.0009785337   . -5.224690e-03
##     [5,]  0.0009093743 -0.0022579247 -0.0019486521   . -2.947885e-03
##      ...             .             .             .   .             .
## [19996,] -0.0046791718 -0.0073614454  0.0007148935   . -0.0024410180
## [19997,] -0.0048366051  0.0008190038 -0.0072412677   . -0.0004270998
## [19998,]  0.0034414458 -0.0026491549  0.0039459904   . -0.0038380154
## [19999,]  0.0051936588 -0.0004937389  0.0048754591   . -0.0020620816
## [20000,]  0.0007477738  0.0018794305 -0.0011703138   . -0.0019812603
##               [,50000]
##     [1,] -2.319264e-03
##     [2,]  3.829845e-04
##     [3,] -5.374638e-04
##     [4,]  2.443269e-05
##     [5,] -1.632262e-03
##      ...             .
## [19996,] -0.0019683190
## [19997,] -0.0013774037
## [19998,]  0.0005604048
## [19999,]  0.0019135394
## [20000,]  0.0009124538

DelayedArray::is_sparse(lr)
## [1] FALSE

object.size(lr)
## 5603648 bytes

lr is basically pretending to be a dense matrix, but it's only 5.6 MB in memory because the cross-product operation is not actually performed. Instead, thanks to DelayedArray magic, the cross-product is computed dynamically when particular rows or columns are requested. This allows us to hold the reconstructed assay cheaply in memory.

However, if you're working with an analysis environment that doesn't understand DelayedArray structures (like Seurat, apparently), then it's likely that the LowRankMatrix is converted into one of the supported matrix types - probably either an ordinary dense matrix or a dgCMatrix. In this case, the cross-product is realized and you are now obliged to hold the result in memory. In the case of a 50k-by-20k double-precision matrix, that's 8 GB.

So, the simplest approach is to just drop the reconstructed assay before it reenters the Seurat world. For various reasons (check out the relevant chapter of the OSCA book), I don't think that batch-corrected expression values are particularly reliable for per-gene inferences; I only use them to obtain a clustering in a common coordinate space, and then I go on to block on the batch in downstream differential analyses.

If it's not that, then I don't know. The only "real" output of fastMNN is the matrix of corrected PCs, which is just of size equal to 50 (or however many PCs you picked) times the number of cells. Even with 240k cells, that's just a bit under 100 MB, so I don't know what's going on with their converters internally.

@maxc2020
Copy link
Author

Thank you. Sounds likely to be an issue withe the SeuratWrapper implementation function, so I'll work on figuring how to run directly in batchelor as an SCE object.

@maxc2020
Copy link
Author

When I subset my seurat object to the cell-type of interest (based on original all-cell FastMNN clusters), I am successfully able to to run correctExperiments on a list of SingleCellExperiment with two of the three cell-types of interest.

Seurat_object_Cell_Type1 <- subset(Seurat_object_AllCells, cells = Cell_Type_1_List)
DefaultAssay(Seurat_object_Cell_Type1) <- "SCT"
Seurat_object_Cell_Type1 <- DietSeurat(Seurat_object_Cell_Type1, counts = TRUE, data = TRUE, assays = c("SCT"))
ObjectList_Cell_Type1 <- SplitObject(Seurat_object_Cell_Type1, split.by = "orig.ident")
ObjectList_Cell_Type1 <- lapply(ObjectList_Cell_Type1, FUN = function(x) {
  DefaultAssay(x) <- "SCT"
  x <- as.SingleCellExperiment(x)
})
SCE_object_CellType1 <- correctExperiments(ObjectList_Cell_Type1, subset.row = VariableFeatures)

But with my third cell type, this error is thrown:

Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : 
  invalid character indexing

Based on a manual inspection, I can see that all of the subsetted SingleCellExperiment objects in the "ObjectList_Cell_Type1" list have the same number of genes. Changing the workflow so that SplitObject is never run & instead correctExperiements is run on a merged object in SCE format throws the same error (e.g., correctExperiments(SCE_object_Cell_Type1, subset.row = VariableFeatures, batch = SCE_object_Cell_Type1$orig.ident). Including "checkDimnames = FALSE" in the conversion to SCE objects did not help either. I've tried reviewing the batchelor and SummarizedExperiment help & vignettes but can't quite tell what is going on.

Any ideas or advice?

Thanks again for your great package. For the cell type I have gotten through it, the dimensional reduction does great job handling biological variation and heterogeneity in sample composition.

@LTLA
Copy link
Owner

LTLA commented Jan 31, 2023

My best guess is that you are asking for features that are not present in the matrix for the third cell type.

library(Matrix)
y <- rsparsematrix(10, 10, 0.1)
rownames(y) <- LETTERS[1:10]
y[letters[1:10],]
## Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : 
##  invalid character indexing

Seems like something to do with Seurat's preprocessing of the matrix.

@maxc2020
Copy link
Author

Thank you. Because I had normalized with SCTransform, there were some limits on how I could generate my variable feature list. So I had simply selected ~8 samples with the largest number of cells in each cell type subset, identified variable genes there, and took the overlap. This worked great with the first two cell types. But the above errors occurred with my third. This time I redid my normalization and variable feature finding with all cells/samples of the cell type in a different way and used that for the variable feature list for correctExperiments and it worked great. Appreciate all the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants