- Here are the overview of major GDC harmonization workflows.
- The GDC workflow repositories has been tested on GDC data and in the particular environment GDC is running in.
- Current GDC production workflows are running in the GDC Pipeline Automation System (GPAS).
- Most GDC workflows are developed using Common Workflow Language (CWL) with Dockerized tools.
- Please check GDC Documentation for more details.
The GDC workflow repositories has been tested on GDC data and in the particular environment GDC is running in.
- We have created external CWL entrypoint in some workflows. For the others, you are expecting to modify the workflow to be used in your system.
- Some of the reference data required for the workflow production are hosted in GDC reference files. We can not share some other reference files due to licensing issues.
- In particular, we can not share target or bait bed files or interval list of any specific Target Capture Kits. You are encouraged to contact the vendors or project owners to get these files.
- We are not able to share dockers to external users due to licensing issues. You are welcomed to build your own dockers using the docker files provided.
- For any questions related to GDC data, please contact the GDC Help Desk at support@nci-gdc.datacommons.io.
- BWA based alignment workflow for Whole Exome Sequencing (WXS), Whole Genome Sequencing (WGS), Targeted Sequencing, and some other DNA-Seq experimental strategies. The workflow takes either BAM or FASTQ files as input, performs reads mapping, and optional steps of Base Quality Score Recalibration (BQSR), Indel Realignment, MarkDuplicates, and outputs a sorted BAM file, a BAM index file, and various QC metrics.
- Main CWL: https://github.com/NCI-GDC/gdc-dnaseq-cwl
- STAR based RNA-Seq alignment workflow that takes either BAM and FASTQ files as input, and generates 3 BAMs (Genome Aligned BAM, Transcriptome Aligned BAM, Chimeric BAM), STAR Counts, and Splice Junction Quantifications. Among the 3 BAMs generated, Transcriptome aligned BAM is read name sorted instead of coordinate sorted, so it is not companioned by a BAM index file. In addition, the STAR Counts file contains quantification in 3 different ways: strandless mode and two different stranded modes.
- Main CWL: https://github.com/NCI-GDC/gdc-rnaseq-cwl
- Utility scripts: https://github.com/NCI-GDC/gdc-rnaseq-tool
- miRNA alignment and quantification workflow adopted from https://github.com/bcgsc/mirna. The workflow takes an adaptor-trimmed BAM as input, performs reads mapping, and outputs a sorted BAM file, a BAM index file, miRNA expression, and miRNA isoform expression.
- Main CWL: https://github.com/NCI-GDC/gdc-mirnaseq-cwl
- Docker file: https://github.com/NCI-GDC/mirna-profiler-docker
- RNA-Seq HTSeq quantification workflow takes a GDC aligned RNA-Seq BAM file as input, and outputs HTSeq gene-level qualitifications with various normalization methods. The workflow assumes strandless to maximize compatibility across all GDC projects.
- Main CWL: https://github.com/NCI-GDC/htseq-cwl
- Utility scripts: https://github.com/NCI-GDC/htseq-tool
- 4 Tumor-normal paired somatic mutation calling workflow for WXS (or Targeted Sequencing) data. The workflow takes a tumor-tissue BAM and a normal-tissue BAM from the same case, and generates raw calling VCFs, and VCF indexes.
- As a single combined workflow or run individual callers: GATK3 MuTect2, MuSE, VarScan2, SomaticSniper
- Combined workflow: https://github.com/NCI-GDC/gdc-somatic-variant-calling-workflow
- MuSE module: https://github.com/NCI-GDC/muse-cwl
- SomaticSniper module: https://github.com/NCI-GDC/somaticsniper-cwl
- VarScan2 module: https://github.com/NCI-GDC/varscan-cwl
- MuTect2 module: https://github.com/NCI-GDC/mutect2-cwl
- Samtools module: https://github.com/NCI-GDC/samtools-mpileup-cwl
- MuSE docker file: https://github.com/NCI-GDC/muse-tool
- SomaticSniper docker file: https://github.com/NCI-GDC/somaticsniper-tool
- VarScan2 docker file: https://github.com/NCI-GDC/varscan-tool
- MuTect2 docker file: https://github.com/NCI-GDC/mutect2-tool
- Samtools docker file: https://github.com/NCI-GDC/samtools-mpileup-tool
- VCF filtering workflow. GDC internal VCF filtering and formating workflow that runs between variant calling and variant annotation steps. Direct outputs of this workflow are not present in the GDC portal, but rather, used as inputs for the VEP Variant Annotation workflow.
- Main CWL: https://github.com/NCI-GDC/variant-filtration-cwl
- Utility scripts: https://github.com/NCI-GDC/variant-filtration-tool
- Tumor-normal paired somatic mutation calling workflow for WGS data, adopted from https://github.com/cancerit/dockstore-cgpwgs. The workflow takes a tumor-tissue BAM and a normal-tissue BAM from the same case, and generates simple somatic mutations with CaVEMan and Pindel in VCF format, structural variation with BRASS in VCF and BedPE formats, and copy number variation with ascatNGS in segmentation and gene-level copy number TSV formats.
- Main CWL: https://github.com/NCI-GDC/gdc-sanger-somatic-cwl
- Utility scripts: https://github.com/NCI-GDC/gdc-sanger-somatic-tool
- Tumor-only somatic mutation calling workfow for DNA-Seq. The workflow takes a tumor-tissue BAM and generates simple somatic mutations with GATK4 MuTect2 in VCF format.
- Main CWL: https://github.com/NCI-GDC/gatk4_mutect2_cwl
- Utility scripts: https://github.com/NCI-GDC/gatk4-mutect2-tool
- Tumor-only somatic mutation filtering workfow for DNA-Seq. The workflow takes a tumor-only VCF and tags variants for predicted somatic status by PureCN (https://github.com/lima1/PureCN). The resulting VCF from the tumor-only variant calling workflow can be further annotated using VEP Variant Annotation workflow and converted into MAF files using MAF Generation workflows.
- Main CWL: https://github.com/NCI-GDC/gdc_tosvc_workflow
- Utility scripts: https://github.com/NCI-GDC/gdc-tosvc-tools
- VEP based variant annotation workflow that provide functional annotations to each variant in the VCFs.
- Main CWL: https://github.com/NCI-GDC/vep-cwl
- Utility scripts: https://github.com/NCI-GDC/vep-tool
- MAF generation workflows
- MAF schema: https://github.com/NCI-GDC/maf-lib/tree/master/src/maflib/resources
- MAF library: https://github.com/NCI-GDC/maf-lib (maf library)
- Main CWL: https://github.com/NCI-GDC/aliquot-maf-cwl
- Utility scripts: https://github.com/NCI-GDC/aliquot-maf-tools
- The workflow applies Circular Binary Segmentation to existing BirdSeed probe-level copy numbers, and generates copy number segmentation files and gene-level copy number TSVs.
- Utility scripts: https://github.com/NCI-GDC/dnacopy-tool
- target_capture_kit is an enumerated property on the ReadGroup node, and a value other than Unknown or Not Applicable is required for read groups that are associated to WXS and Targeted Sequencing strategies.
- Unfortunately we can not share the Target Capture Kit bed files in public because of policy restrictions from some kit vendors. You are able to find the corresponding size (in bps) of each kit in this file (https://github.com/NCI-GDC/gdc-workflow-overview/blob/master/gdc_target_capture_kit_size.tsv) for Tumor Mutation Burden (TMB) analysis. Please note some of these files arrive GDC in hg19 reference build and some in GRCh38 (hg38) reference build, so we labeled them in separate columns in the file.