Scripts for annotating 10x Genomics scRNA-seq analysis data
This repository requires that pandoc
and libhdf5-dev
libraries are installed:
sudo apt-get install pandoc libhdf5-dev
It also depends on the H5weaver
, jsonlite
, rmarkdown
, and optparse
libraries.
jsonlite
, rmarkdown
and optparse
are available from CRAN, and can be installed in R using:
install.packages("jsonlite")
install.packages("rmarkdown")
install.packages("optparse")
H5weaver
is found in the aifimmunology Github repositories. Install with:
devtools::install_github("aifimmunology/H5weaver")
This repository can add important QC characteristics and cell metadata for 10x Genomics. It requires the filtered_feature_bc_matrix.h5
, molecule_info.h5
, and metrics_summary.csv
files generated by cellranger count
as inputs, as well as a SampleSheet.csv
file (as described below), and generates a decorated output .h5 file and a JSON metrics file based on these parameters and a WellID.
There are 7 parameters for this script:
-i or --in_h5
: The path to the filtered_feature_bc_matrix.h5 file from cellranger outs/-l or --in_mol
: The path to the molecule_info.h5 file from cellranger outs/-s or --in_sum
: The path to the metrics_summary.csv file from cellranger outs/-k or --in_key
: The path to SampleSheet.csv-w or --in_well
: A well name to use for metadata in the format[XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir
: A directory to use to output the modified .h5 and JSON metrics-o or --out_html
: A filename to use to output the HTML summary report file
An example run for a cellranger count result is:
Rscript --vanilla \
tenx-rnaseq-pipeline/run_add_tenx_rna_metadata.R \
-i /shared/lucasg/pipeline_cellhashing_tests/data/pool16/filtered_feature_bc_matrix.h5 \
-l /shared/lucasg/pipeline_cellhashing_tests/data/pool16/molecule_info.h5 \
-s /shared/lucasg/pipeline_cellhashing_tests/data/pool16/metrics_summary.csv \
-k /shared/lucasg/pipeline_cellhashing_tests/data/pool16/SampleSheet.csv \
-w X000-P1C1W3 \
-d /shared/lucasg/pipeline_cellhashing_tests/output/pool16/ \
-o /shared/lucasg/pipeline_cellhashing_tests/output/pool16/X000-P1C1W3_metadata_report.html
This script is designed to work with both Non-Hashed and Cell Hashed input data.
Each should have a slightly different SampleSheet.csv, as described below. The primary difference is that Non-Hashed SampleSheets have a WellID column, whereas Cell Hashed SampleSheets have a HashTag column. The script will use this difference to detect the type of run.
For Non-Hashed runs, SampleSheet.csv
conveys the relationship between SampleID and WellID.
It should have 4 columns: SampleID, BatchID, WellID, and PoolID
SampleID,BatchID,WellID,PoolID
PB00042,X051,X051-P1C1W1,X051-P1
PB00043,X051,X051-P1C1W2,X051-P1
PB00044,X051,X051-P1C1W2,X051-P1
For Non-Hashed outputs, two files will be generated. The .h5 will be named based on PoolID and SampleID, while the JSON metrics for this well will be named based on WellID:
- .h5 file: [PoolID]_[SampleID].h5, e.g. X051-P1_PB0042.h5
- JSON file: [WellID]_well_metrics.json, e.g. X051-P1C1W1_well_metrics.json
This ensures that the .h5 filename matches the .h5 naming convention that Cell Hashed files obtain after the merge step.
For Cell Hashed runs, SampleSheet.csv
conveys the relationship between SampleID and HashTag.
It should have 4 columns: SampleID, BatchID, HashTag, and PoolID
SampleID,BatchID,HashTag,PoolID
PB00042,X051,HT1,X051-P1
PB00043,X051,HT2,X051-P1
PB00044,X051,HT3,X051-P1
For Cell Hashed outputs, two files will be generated and will be named based on WellID:
- .h5 file: [WellID].h5, e.g. X000-P1C1W3.h5
- JSON file: [WellID]_well_metrics.json, e.g. X000-P1C1W3_well_metrics.json
Unlike Non-Hashed datasets, these .h5 files are still a mixture of all SampleIDs, so filenames will reflect only the WellID at this stage.
Test runs can be performed using datasets provided with the H5weaver
package by excluding parameters other than -o
.
Rscript --vanilla \
tenx-rnaseq-pipeline/run_add_nonhashed_metadata.R \
-o test_metadata_report.html
Some QC statistics and parameters differ when using cellranger-arc for 10x Multiome or TEA-seq experiments. To account for these differences, there are modified versions of a few key steps.
NOTE The scripts for this step are stored in the tenx-atacseq-pipeline
repository. This step only needs to be performed once for a given cellranger-arc output set.
Prior to running processing of ATAC or RNA data from cellranger-arc output, run 00_run_arc_formatting.R to add a common UUID and restructure the metadata files from arc for downstream processing. This ensures that we don't end up with different UUIDs assigned in the RNA and ATAC arms of the pipeline.
There are two parameters for this script:
-t
: path to cellranger-arc outs/-o
: path for the HTML output generated by the script
An example run is:
Rscript --vanilla tenx-atacseq-pipeline/00_run_arc_formatting.R \
-t outs/
-o arc_formatting_report.html
This script outputs .csv files to the outs/ directory for downstream processing:
- arc_singlecell.csv: Arc version of the standard 10x ATAC singlecell.csv output
- atac_summary.csv: Arc version of the standard 10x ATAC summary.csv output
- rna_summary.csv: Arc version of the standard 10x RNA metrics_summary.csv
As for standard scRNA-seq cellranger results, this script will add additional cell metadata to the RNA .h5 files.
In addition, it will separate the Gene Expression and Peaks matrices into separate matrix objects in the .h5 file to enable downstream processing of hashed runs.
The main difference from the main difference in parameters is the use of the cellranger outs/ directory rather than specifying individual outputs. This version of the script will detect whether cellranger or cellranger-arc was used based on the presence or absence of the formatting script, above.
There are 5 parameters for this script:
-t or --in_tenx
: The path to cellranger-arc outs/-k or --in_key
: The path to SampleSheet.csv-w or --in_well
: A well name to use for metadata in the format[XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir
: A directory to use to output the modified .h5 and JSON metrics-o or --out_html
: A filename to use to output the HTML summary report file
An example run for a cellranger-arc count result is:
Rscript --vanilla \
tenx-rnaseq-pipeline/run_crossplatform_rna_metadata.R \
-t outs/ \
-k SampleSheet.csv \
-w X000-P1C1W3 \
-d rna_preprocessed/ \
-o rna_preprocessed/X000-P1C1W3_metadata_report.html
This script is designed to work with both Non-Hashed and Cell Hashed input data.
Sample sheet formats follow the same conventions as for scRNA-seq, above.
Ouput formats follow the same conventions as for scRNA-seq, above.
We can perform QC analysis per well for ADT data in preparation for review and integration with scRNA-seq datasets. This step requires the Tag_counts.csv
file generated by BarCounter, as well as a cell barcode whitelist as inputs, and generates tables of ADT counts and QC metrics, as well as a report for review.
BarCounter is used to generate the Tag_counts.csv
upstream of QC analysis. The barcode whitelist supplied as input for BarCounter only should be the unfiltered whitelist generated by CellRanger. This will retain count levels for both called cells and non-cells so the background count levels can be determined.
This file should be located within the cellranger outs/ directory at:
outs/raw_feature_bc_matrix/barcodes.tsv
Note that the whitelist supplied to the QC Analysis in the next step is a different file.
There are 5 parameters for this script:
-i or --in_counts
: The path to a Tag_counts.csv result from BarCounter-b or --in_bcs
: The path to the barcodes.tsv file for filtered cells from cellranger outs/-w or --well_id
: A well name to use for metadata in the format[XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir
: A directory to use to output the ADT tables-o or --out_html
: A filename to use to output the HTML summary report file
An example run for a cellranger count result is:
Rscript --vanilla \
tenx-rnaseq-pipeline/run_adt_well_qc.R \
-i /data/tarpits/BarCounter/X070-EP1C1W1_Tag_counts.tsv \
-b /data/tarpits/outs/filtered_feature_bc_matrix/barcodes.tsv \
-w X070-EP1C1W1 \
-d /data/tarpits/adt_qc/ \
-o /data/tarpits/adt_qc/X070-EP1C1W1_adt_qc_report.html
Tag_counts.csv
inputs should follow the format resulting from running BarCounter, with a header row followed by results for one cell barcode per subsequent row. cell_barcode
and total
should be the first two columns, followed by counts for each ADT:
cell_barcode,total,CD3E,CD4,CD8,CD56
ACACTGAGTCATCCCT,1,1,0,0,0
CACGGGTAGATGATTG,17,1,2,1,13
GCTTGGGAGGTCGTGA,9,1,2,1,5,0
TTGGGATAGTTACGAA,9,2,3,2,2
To filter cell barcodes and compare signal to background barcodes, we need to provide the output barcodes.tsv file generated by cellranger.
It should be located within the cellranger outs/ directory at:
outs/filtered_feature_bc_matrix/barcodes.tsv
2 files will be generated in the output directory: An ADT count matrix and ADT metadata file for use with downstream processing.
- counts: [WellID]_adt_positive_tag_counts.csv
- metadata: [WellID]_adt_metadata.csv
After both scRNA-seq and ADT data have been through well QC, these datasets can be merged into a single .h5 file for downstream processing.
There are 6 parameters for this script:
-i or --in_h5
: The path to the .h5 generated byrun_crossplatform_rna_metadata.R
-c or --in_adt_counts
: The path to the molecule_info.h5 file from cellranger outs/-m or --in_adt_meta
: The path to the molecule_info.h5 file from cellranger outs/-w or --well_id
: A well name to use for metadata in the format[XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir
: A directory to use to output the modified .h5 and JSON metrics-o or --out_html
: A filename to use to output the HTML summary report file
An example run for a cellranger count result is:
Rscript --vanilla \
tenx-rnaseq-pipeline/run_adt_injection.R \
-i /data/tarpits/rna_preprocessed/X070-P1C1W1.h5 \
-c /data/tarpits/adt_qc/X070-EP1C1W1_adt_positive_tag_counts.csv \
-m /data/tarpits/adt_qc/X070-EP1C1W1_adt_metadata.csv \
-w X070-P1C1W1 \
-d /data/tarpits/rna_adt_injection/ \
-o /data/tarpits/adt_qc/X070-P1C1W1_adt_injection_report.html
An updated .h5 file will be generated based on the filename of the input .h5, with the addition of _adt
before .h5
:
- .h5 file: [PoolID]_[SampleID]_adt.h5, e.g. X051-P1_PB0042_adt.h5
The license for this package is available on Github in the file LICENSE.txt in this repository.
We are not currently supporting this code, but simply releasing it to the community AS IS but are not able to provide any guarantees of support. The community is welcome to submit issues, but you should not expect an active response.
If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in the file CONTRIBUTING.md in this repository.