tenx-rnaseq-pipeline

Scripts for annotating 10x Genomics scRNA-seq analysis data

Dependencies

This repository requires that pandoc and libhdf5-dev libraries are installed:

sudo apt-get install pandoc libhdf5-dev

It also depends on the H5weaver, jsonlite, rmarkdown, and optparse libraries.

jsonlite, rmarkdown and optparse are available from CRAN, and can be installed in R using:

install.packages("jsonlite")
install.packages("rmarkdown")
install.packages("optparse")

H5weaver is found in the aifimmunology Github repositories. Install with:

devtools::install_github("aifimmunology/H5weaver")

Return to Contents

Metadata annotation for cellranger results

This repository can add important QC characteristics and cell metadata for 10x Genomics. It requires the filtered_feature_bc_matrix.h5, molecule_info.h5, and metrics_summary.csv files generated by cellranger count as inputs, as well as a SampleSheet.csv file (as described below), and generates a decorated output .h5 file and a JSON metrics file based on these parameters and a WellID.

Return to Contents

There are 7 parameters for this script:

-i or --in_h5: The path to the filtered_feature_bc_matrix.h5 file from cellranger outs/
-l or --in_mol: The path to the molecule_info.h5 file from cellranger outs/
-s or --in_sum: The path to the metrics_summary.csv file from cellranger outs/
-k or --in_key: The path to SampleSheet.csv
-w or --in_well: A well name to use for metadata in the format [XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir: A directory to use to output the modified .h5 and JSON metrics
-o or --out_html: A filename to use to output the HTML summary report file

An example run for a cellranger count result is:

Rscript --vanilla \
  tenx-rnaseq-pipeline/run_add_tenx_rna_metadata.R \
  -i /shared/lucasg/pipeline_cellhashing_tests/data/pool16/filtered_feature_bc_matrix.h5 \
  -l /shared/lucasg/pipeline_cellhashing_tests/data/pool16/molecule_info.h5 \
  -s /shared/lucasg/pipeline_cellhashing_tests/data/pool16/metrics_summary.csv \
  -k /shared/lucasg/pipeline_cellhashing_tests/data/pool16/SampleSheet.csv \
  -w X000-P1C1W3 \
  -d /shared/lucasg/pipeline_cellhashing_tests/output/pool16/ \
  -o /shared/lucasg/pipeline_cellhashing_tests/output/pool16/X000-P1C1W3_metadata_report.html

Return to Contents

`SampleSheet.csv` formats and output filenames

This script is designed to work with both Non-Hashed and Cell Hashed input data.

Each should have a slightly different SampleSheet.csv, as described below. The primary difference is that Non-Hashed SampleSheets have a WellID column, whereas Cell Hashed SampleSheets have a HashTag column. The script will use this difference to detect the type of run.

Non-Hashed `SampleSheet.csv`

For Non-Hashed runs, SampleSheet.csv conveys the relationship between SampleID and WellID.

It should have 4 columns: SampleID, BatchID, WellID, and PoolID

SampleID,BatchID,WellID,PoolID
PB00042,X051,X051-P1C1W1,X051-P1
PB00043,X051,X051-P1C1W2,X051-P1
PB00044,X051,X051-P1C1W2,X051-P1

Return to Contents

Non-Hashed output files

For Non-Hashed outputs, two files will be generated. The .h5 will be named based on PoolID and SampleID, while the JSON metrics for this well will be named based on WellID:

.h5 file: [PoolID]_[SampleID].h5, e.g. X051-P1_PB0042.h5
JSON file: [WellID]_well_metrics.json, e.g. X051-P1C1W1_well_metrics.json

This ensures that the .h5 filename matches the .h5 naming convention that Cell Hashed files obtain after the merge step.

Cell Hashed `SampleSheet.csv`

For Cell Hashed runs, SampleSheet.csv conveys the relationship between SampleID and HashTag.

It should have 4 columns: SampleID, BatchID, HashTag, and PoolID

SampleID,BatchID,HashTag,PoolID
PB00042,X051,HT1,X051-P1
PB00043,X051,HT2,X051-P1
PB00044,X051,HT3,X051-P1

Cell Hashed output files

For Cell Hashed outputs, two files will be generated and will be named based on WellID:

.h5 file: [WellID].h5, e.g. X000-P1C1W3.h5
JSON file: [WellID]_well_metrics.json, e.g. X000-P1C1W3_well_metrics.json

Unlike Non-Hashed datasets, these .h5 files are still a mixture of all SampleIDs, so filenames will reflect only the WellID at this stage.

Return to Contents

Tests

Test runs can be performed using datasets provided with the H5weaver package by excluding parameters other than -o.

Rscript --vanilla \
  tenx-rnaseq-pipeline/run_add_nonhashed_metadata.R \
  -o test_metadata_report.html

Return to Contents

Modifications for cellranger-arc

Some QC statistics and parameters differ when using cellranger-arc for 10x Multiome or TEA-seq experiments. To account for these differences, there are modified versions of a few key steps.

NOTE The scripts for this step are stored in the tenx-atacseq-pipeline repository. This step only needs to be performed once for a given cellranger-arc output set.

Return to Contents

Format Arc Outputs

Prior to running processing of ATAC or RNA data from cellranger-arc output, run 00_run_arc_formatting.R to add a common UUID and restructure the metadata files from arc for downstream processing. This ensures that we don't end up with different UUIDs assigned in the RNA and ATAC arms of the pipeline.

Parameters

There are two parameters for this script:

-t: path to cellranger-arc outs/
-o: path for the HTML output generated by the script

An example run is:

Rscript --vanilla tenx-atacseq-pipeline/00_run_arc_formatting.R \
  -t outs/
  -o arc_formatting_report.html

Outputs

This script outputs .csv files to the outs/ directory for downstream processing:

arc_singlecell.csv: Arc version of the standard 10x ATAC singlecell.csv output
atac_summary.csv: Arc version of the standard 10x ATAC summary.csv output
rna_summary.csv: Arc version of the standard 10x RNA metrics_summary.csv

Return to Contents

Metadata annotation for cellranger-arc results

As for standard scRNA-seq cellranger results, this script will add additional cell metadata to the RNA .h5 files.

In addition, it will separate the Gene Expression and Peaks matrices into separate matrix objects in the .h5 file to enable downstream processing of hashed runs.

Return to Contents

Parameters

The main difference from the main difference in parameters is the use of the cellranger outs/ directory rather than specifying individual outputs. This version of the script will detect whether cellranger or cellranger-arc was used based on the presence or absence of the formatting script, above.

There are 5 parameters for this script:

-t or --in_tenx: The path to cellranger-arc outs/
-k or --in_key: The path to SampleSheet.csv
-w or --in_well: A well name to use for metadata in the format [XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir: A directory to use to output the modified .h5 and JSON metrics
-o or --out_html: A filename to use to output the HTML summary report file

An example run for a cellranger-arc count result is:

Rscript --vanilla \
  tenx-rnaseq-pipeline/run_crossplatform_rna_metadata.R \
  -t outs/ \
  -k SampleSheet.csv \
  -w X000-P1C1W3 \
  -d rna_preprocessed/ \
  -o rna_preprocessed/X000-P1C1W3_metadata_report.html

Return to Contents

`SampleSheet.csv` formats and output filenames

This script is designed to work with both Non-Hashed and Cell Hashed input data.

Sample sheet formats follow the same conventions as for scRNA-seq, above.

Return to Contents

Output files

Ouput formats follow the same conventions as for scRNA-seq, above.

Return to Contents

CITE-seq/TEA-seq ADT Well QC

We can perform QC analysis per well for ADT data in preparation for review and integration with scRNA-seq datasets. This step requires the Tag_counts.csv file generated by BarCounter, as well as a cell barcode whitelist as inputs, and generates tables of ADT counts and QC metrics, as well as a report for review.

Return to Contents

BarCounter Requirements

BarCounter is used to generate the Tag_counts.csv upstream of QC analysis. The barcode whitelist supplied as input for BarCounter only should be the unfiltered whitelist generated by CellRanger. This will retain count levels for both called cells and non-cells so the background count levels can be determined.

This file should be located within the cellranger outs/ directory at:
outs/raw_feature_bc_matrix/barcodes.tsv

Note that the whitelist supplied to the QC Analysis in the next step is a different file.

Return to Contents

QC Analysis Parameters

There are 5 parameters for this script:

-i or --in_counts: The path to a Tag_counts.csv result from BarCounter
-b or --in_bcs: The path to the barcodes.tsv file for filtered cells from cellranger outs/
-w or --well_id: A well name to use for metadata in the format [XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir: A directory to use to output the ADT tables
-o or --out_html: A filename to use to output the HTML summary report file

An example run for a cellranger count result is:

Rscript --vanilla \
  tenx-rnaseq-pipeline/run_adt_well_qc.R \
  -i /data/tarpits/BarCounter/X070-EP1C1W1_Tag_counts.tsv \
  -b /data/tarpits/outs/filtered_feature_bc_matrix/barcodes.tsv \
  -w X070-EP1C1W1 \
  -d /data/tarpits/adt_qc/ \
  -o /data/tarpits/adt_qc/X070-EP1C1W1_adt_qc_report.html

Return to Contents

BarCounter `Tag_counts.csv` format

Tag_counts.csv inputs should follow the format resulting from running BarCounter, with a header row followed by results for one cell barcode per subsequent row. cell_barcode and total should be the first two columns, followed by counts for each ADT:

cell_barcode,total,CD3E,CD4,CD8,CD56
ACACTGAGTCATCCCT,1,1,0,0,0
CACGGGTAGATGATTG,17,1,2,1,13
GCTTGGGAGGTCGTGA,9,1,2,1,5,0
TTGGGATAGTTACGAA,9,2,3,2,2

Return to Contents

10x Cellranger `barcodes.tsv` format and sources

To filter cell barcodes and compare signal to background barcodes, we need to provide the output barcodes.tsv file generated by cellranger.

It should be located within the cellranger outs/ directory at:
outs/filtered_feature_bc_matrix/barcodes.tsv

Return to Contents

ADT QC output files

2 files will be generated in the output directory: An ADT count matrix and ADT metadata file for use with downstream processing.

counts: [WellID]_adt_positive_tag_counts.csv
metadata: [WellID]_adt_metadata.csv

Return to Contents

ADT matrix injection

After both scRNA-seq and ADT data have been through well QC, these datasets can be merged into a single .h5 file for downstream processing.

Return to Contents

There are 6 parameters for this script:

-i or --in_h5: The path to the .h5 generated by run_crossplatform_rna_metadata.R
-c or --in_adt_counts: The path to the molecule_info.h5 file from cellranger outs/
-m or --in_adt_meta: The path to the molecule_info.h5 file from cellranger outs/
-w or --well_id: A well name to use for metadata in the format [XB][0-9]{3}-P[0-9]C[0-9]W[0-9]
-d or --out_dir: A directory to use to output the modified .h5 and JSON metrics
-o or --out_html: A filename to use to output the HTML summary report file

An example run for a cellranger count result is:

Rscript --vanilla \
  tenx-rnaseq-pipeline/run_adt_injection.R \
  -i /data/tarpits/rna_preprocessed/X070-P1C1W1.h5 \
  -c /data/tarpits/adt_qc/X070-EP1C1W1_adt_positive_tag_counts.csv \
  -m /data/tarpits/adt_qc/X070-EP1C1W1_adt_metadata.csv \
  -w X070-P1C1W1 \
  -d /data/tarpits/rna_adt_injection/ \
  -o /data/tarpits/adt_qc/X070-P1C1W1_adt_injection_report.html

Return to Contents

ADT QC output files

An updated .h5 file will be generated based on the filename of the input .h5, with the addition of _adt before .h5:

.h5 file: [PoolID]_[SampleID]_adt.h5, e.g. X051-P1_PB0042_adt.h5

Return to Contents

Legal Information

License

The license for this package is available on Github in the file LICENSE.txt in this repository.

Level of Support

We are not currently supporting this code, but simply releasing it to the community AS IS but are not able to provide any guarantees of support. The community is welcome to submit issues, but you should not expect an active response.

Contribution Agreement

If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in the file CONTRIBUTING.md in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DEPENDS.md		DEPENDS.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
run_add_tenx_rna_metadata.R		run_add_tenx_rna_metadata.R
run_adt_injection.R		run_adt_injection.R
run_adt_well_qc.R		run_adt_well_qc.R
run_arc_tenx_rna_metadata.R		run_arc_tenx_rna_metadata.R
run_crossplatform_rna_metadata.R		run_crossplatform_rna_metadata.R
tenx-rnaseq-pipeline.Rproj		tenx-rnaseq-pipeline.Rproj

License

aifimmunology/tenx-rnaseq-pipeline

Folders and files

Latest commit

History

Repository files navigation

tenx-rnaseq-pipeline

Contents

10x cellranger scRNA-seq data

10x cellranger-arc processing

CITE-seq/TEA-seq ADT data

Dependencies

Metadata annotation for cellranger results

SampleSheet.csv formats and output filenames

Non-Hashed SampleSheet.csv

Non-Hashed output files

Cell Hashed SampleSheet.csv

Cell Hashed output files

Tests

Modifications for cellranger-arc

Format Arc Outputs

Parameters

Outputs

Metadata annotation for cellranger-arc results

Parameters

SampleSheet.csv formats and output filenames

Output files

CITE-seq/TEA-seq ADT Well QC

BarCounter Requirements

QC Analysis Parameters

BarCounter Tag_counts.csv format

10x Cellranger barcodes.tsv format and sources

ADT QC output files

ADT matrix injection

ADT QC output files

Legal Information

License

Level of Support

Contribution Agreement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`SampleSheet.csv` formats and output filenames

Non-Hashed `SampleSheet.csv`

Cell Hashed `SampleSheet.csv`

`SampleSheet.csv` formats and output filenames

BarCounter `Tag_counts.csv` format

10x Cellranger `barcodes.tsv` format and sources