Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsample sce object based on factor in colData #56

Open
baj12 opened this issue Feb 19, 2023 · 1 comment
Open

subsample sce object based on factor in colData #56

baj12 opened this issue Feb 19, 2023 · 1 comment

Comments

@baj12
Copy link

baj12 commented Feb 19, 2023

I would like to sub-sample a singleCellExperiment object based on a factorial in colData.

I have a singleCellExperiment object:

> sce
# A SingleCellExperiment-tibble abstraction: 13,268,769 × 6
# Features=42 | Assays=exprs

with some colData:

> colData(sce)
DataFrame with 13268769 rows and 5 columns
         sample_id condition patient_id    label1 cluster_id
          <factor>  <factor>   <factor> <numeric>   <factor>
1            D929I       Ref      D929I        36        302
2            D929I       Ref      D929I        29        285
3            D929I       Ref      D929I        50        103
4            D929I       Ref      D929I        36        302
5            D929I       Ref      D929I        51        181
...            ...       ...        ...       ...        ...
13268765     D232I       Ref      D232I        51        201
13268766     D232I       Ref      D232I        28        304
13268767     D232I       Ref      D232I        50        5  
13268768     D232I       Ref      D232I        51        184
13268769     D232I       Ref      D232I        18        364

I would like to subsample based on the cluster_id column such that I have max X (500) events of each cluster.

I can get the selection of cells using the following code:

> sce %>% group_by(cluster_id) %>% slice_sample(n=500) %>% ungroup()
tidySingleCellExperiment says: A data frame is returned for independent data analysis.
# A tibble: 200,000 × 6
   .cell    sample_id condition patient_id label1 cluster_id
   <chr>    <fct>     <fct>     <fct>       <dbl> <fct>     
 1 4002318  D0749I    Ref       D0749I         60 1         
 2 10259368 D590I     Ref       D590I          60 1         
 3 12615676 D232I     Ref       D232I          25 1         
 4 6765422  D694I     Ref       D694I          25 1         
 5 9415336  D0553I    Ref       D0553I         60 1         
 6 7245671  D694I     Ref       D694I          25 1         
 7 7177144  D694I     Ref       D694I          42 1         
 8 7002069  D694I     Ref       D694I          49 1         
 9 8732040  D615I     Ref       D615I          60 1         
10 3989255  D0749I    Ref       D0749I         60 1         
# … with 199,990 more rows
# ℹ Use `print(n = ...)` to see more rows

But I don't know how I would use this to filter the original singleCellExperiment object.

Could you please give me a pointer?

Thanks

@stemangiola
Copy link
Owner

sorry, this slipped into the cracks.

At the moment you can use

nest() |>
mutate(map(...)) |>
unnest()

In the future we might be able to add group_by while preserving the SingleCellExperiment. But we don't have plans yet. (Pull requests are always welcome, though!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants