Skip to content
ckennedy-nmdp edited this page Sep 11, 2014 · 22 revisions

Purpose

A simple workflow to generate research-grade consensus sequences from high-throughput (next-generation) DNA sequencing of individual or multiplexed, clonally-amplified genes representing the major histocompatibility complex (MHC), including HLA-A, -B, -C, -DQB1, -DPB1, and -DRB1 as well as KIR.

NON-CLINICAL RESEARCH USE ONLY

Scope

The pipeline is designed to enable processing of targeted HLA NGS with the intent of further open source development to allow processing of additional genomic regions such as ABO/Rh, other targeted gene systems, whole-exomes or whole-genomes with paired or unpaired data from multiple sequencing platforms (Sanger, Illumina, Pacific Biosciences, Oxford Nanopore, Ion Torrent, or combinations thereof) and diverse applications (research, donor recruitment typing, clinical diagnostics, etc). Where specific parameters, filtering criteria, software or hardware configurations appear to overfit the pipeline to targeted HLA NGS the NMDP will document and disclose as release notes.

The pipeline represents what we consider to be a minimally viable workflow for generating consensus DNA sequence and interpretable alleles. Further details are provided below or you can get started with a tutorial.

Abstract

Due to the relatively low cost of NGS, use of this technology shifts efforts from data acquisition toward processing, analysis, and interpretation. Innumerable methods are available for these purposes, but several design patterns are common. The pipeline is an initial attempt to integrate necessary components into an automatable NGS processing workflow, which can be applied the MHC, a region that represents significant human genetic variation with broad research and clinical application.

The workflow includes all necessary components for its execution, including IT infrastructure and hardware (cloud); development and test environment; application programming interfaces, tools and software; reference databases and documentation for producing HML from raw sequencing reads.

Inputs

Raw reads

Short sequencing reads in fastq format (Illumina platform). Note: Other platforms, notably Pacific Biosciences-generated long reads, should be acceptable input; however, since data are pre-assembled on-instrument it is likely that the Assemble step (below) is unnecessary.

(Optional) Sample-specific molecular barcodes

Sequences required for sample-level demultiplexing.

(Optional) Locus-level molecular barcodes

Genomic coordinates and lengths or sequences of targeted sequencing probes required for locus-level demultiplexing.

GRCh38

Full human genomic reference sequence, properly indexed.

IMGT-HLA

Reference sequence typed at various resolution for HLA and KIR genes.

Outputs

Pipeline outputs are organized into two directories. These are required in addition to the raw/ input directory for the pipeline to execute

Intermediate

Includes processes reads, such as debarcoding (if applicable) and consensus assembled sequences in FASTA format

Final

Includes compressed files containing aligned consensus sequences and sub-allele-level genotype calls in BAM and VCF formats, respectively

Environment

Methods

Preprocess

Raw DNA sequencing reads are split by sample into paired-files (if applicable). Depending on the target capture method, reads from disjoint sources may be separately processed in parallel.

Assemble

With extra sequence context provided by many overlapping reads, alignment artifacts are reduced and sensitivity is improved for detecting certain genetic variation, especially short insertions and deletions (and complex combinations thereof) . This is an especially important feature for NGS pipelines used to detect genetic variation in highly polymorphic regions of the genome such as the MHC. Several assembly methods are available that broadly fit into two categories: prefix tree and De Bruijn graphs. For the current implementation of the pipeline NMDP is using prefix tree-based short sequence read assembly by progressive k-mer search and 3’ read extension (SSAKE).

Align

Alignment provides aligned contigs global location and allele-level genetic variation with respect to a common human genome reference (3.1.4). Several alignment methods are available that vary with respect to speed and accuracy. For the current implementation of the pipeline NMDP uses Burrows-Wheeler alignment (BWA) for long reads (assembled contigs). Alignment-only processing is required to retain sequence information for non-overlapping (singleton) reads, which cannot be assembled. For this purpose the NMDP uses BWA optimized for detecting variation in low-complexity regions of the genome . Compressed files containing aligned reads or contigs are sorted and indexed.

Filter

Filtering is an essential part of any data processing workflow; however, acceptance criteria and specific quality thresholds must be determined on an application-dependent, case-by-case basis. There is always a tradeoff between some measures of accuracy (sensitivity) and others (specificity and ambiguity) as well as processing speed (turnaround time). Under normal operating conditions, filtering criteria must be adjusted to optimize the pipeline and achieve satisfactory validation results for its intended purpose at an acceptable cost.

Data processing errors may results from algorithm-specific artifact or other sample- or sub-sample-specific factors, such as aneuploidy or off-target capture or contamination. With high sequence coverage and digital output (individual reads) errors such these may be very infrequent but their effects—while possibly very small or negligible—can usually be measured. Certain strict filtering criteria are standard operating procedure because they increase accuracy without measureable impact to other performance metrics. These include, but are not limited to, the following:

Merge overlapping contigs

Figure 1 illustrates two overlapping contigs with identical sequences that should be merged into a single contig during the normal assembly step. NMDP has developed filtering software that identifies strictly overlapping contigs and merges them.

Apply consensus quality thresholds

Sometimes contigs only partially align to the human reference with significant sequence (usually flanking) that is soft-clipped. Additionally, contigs may be assembled from reads outside of the genomic regions of interest (ROI) due to off-target capture effects or contamination. Filtering may include, but is not limited to, the following:

Locus

Consensus sequence length

Consensus locus length

Breadth-of-coverage

Depth-of-coverage

(Genotype)

In its current implementation the NMDP pipeline simply reformats aligned consensus sequences as VCF using mileup to aggregate all consensus sequences into a single file. At present, zygosity and phase-level is lost to facilitate gross, high-resolution benchmarking to independently verified alleles (3.7). With slight modification, individual aligned contigs can be reformatted (genotyped) separately thus preserving phase-level resolution.

Interpret

Several methods for interpreting consensus DNA sequence are available.

Cross-reference to IMGT-HLA

VCF representation

Allele equivalence

Calculation of equivalent insertion and deletion regions – Short reads are often aligned with gaps in either the read or reference sequence indicating the possible occurrence of short deletions or insertions, respectively. Gaps may not be positioned uniquely with respect to a given reference or consensus assembly sequence. Rather, a range of positions—known as the equivalent equivalent insertion and deletion region (EIR) —may be required. Improper representation is common in public databases leading to misinterpretation and errors of considerable consequence. Proper implementation is critical, especially where data are patient-derived and genetic results are applied clinically.

Assign nomenclature

Filtered, FASTA-formatted consensus sequences can be uploaded to an interpretation service provided by the NMDP:

curl -T <file.fastq> http://emmes-dev:8080/hla/api/rs/interp ; echo

Will produce combinations of possible allele assignments as a genotype list string

(Transcribe) Translate and annotate de novo

cDNA sequences may be translated in silico into corresponding protein sequences. Further annotation (currently out of scope) may identify site-specific sequence conservation, nucleic- or amino-acid-level lesions likely to affect protein structure or (patho)physiological function. Many open source tools are available for this purpose.

Further work

We're very interested in maintaining, contributing to, and managing an open source solution for NGS of the MHC. Current and future efforts are aimed at further optimization and validation of the workflow.

Validate

. With other gene families (including KIR and ABO/Rh)
. With data from other sequencing platforms
. With multiple target capture methods

Optimize

. Assembly methods (the most critical step)
. Interpretation and nomenclature assignment to allow genetic matching that can scale to higher resolution typing of the MHC as well as other gene families that are important for immunohistocompatibility (see below)
. Predominant data model -- move away from format-specific I/O between algorithmic components and batch processing toward unified, distributed storage-compute platforms such as ADAM

DaSH

Clone this wiki locally