Skip to content
ckennedy-nmdp edited this page Sep 15, 2014 · 22 revisions

Due to its comparably low cost, high-throughput next-generation DNA sequencing (NGS) is increasingly used for routine genetic testing (typing). Massively parallel sequencing requires fewer resources for data acquisition but more for processing, analysis, and interpretation. To meet the demand of more and better data, there is need for integrated software-hardware platforms that scale efficiently. Innumerable pipelines and platforms are available, but several design patterns are common. The pipeline described here represents what we consider a minimally viable yet inclusive workflow for generating consensus DNA sequence and interpreting alleles for the human major histocompatibility complex (MHC), a region that represents significant human genetic variation with broad research and clinical application.

Our objective is to provide all necessary components to execute a workflow and generate HML messages from raw sequencing reads, including IT infrastructure and hardware (cloud); development and test environment; application programming interfaces, algorithms, tools and software; reference databases and documentation. Further details are provided in the methods section below or you can get started with a tutorial.

Purpose

A simple workflow to generate research-grade consensus sequences from high-throughput (next-generation) DNA sequencing of individual or multiplexed, clonally-amplified genes representing the MHC, including human leukocyte antigen (HLA)-A, -B, -C, -DQB1, -DPB1, and -DRB1 as well as killer-cell immunoglobulin-like receptor (KIR).

All documentation and code referenced herein are intended for NON-CLINICAL RESEARCH USE ONLY.

Scope

The pipeline is designed to enable processing of targeted HLA NGS with the intent of further open source development to allow processing of additional genomic regions such as ABO/Rh, other targeted gene systems, whole-exomes or whole-genomes with paired or unpaired data from multiple sequencing platforms (Sanger, Illumina, Pacific Biosciences, Oxford Nanopore, Ion Torrent, or combinations thereof) and diverse applications (research, donor recruitment typing, clinical diagnostics, etc). Where specific parameters, filtering criteria, software or hardware configurations appear to overfit the pipeline to targeted HLA NGS the NMDP will document and disclose as release notes.

Methods

Inputs

Raw reads

Short sequencing reads in fastq format (Illumina platform). Note: Other platforms, notably Pacific Biosciences-generated long reads, should be acceptable input; however, since data are pre-assembled on-instrument it is likely that the Assemble step (below) is unnecessary.

(Optional) Sample-specific molecular barcodes

Sequences required for sample-level demultiplexing.

(Optional) Locus-level molecular barcodes

Genomic coordinates and lengths or sequences of targeted sequencing probes required for locus-level demultiplexing.

GRCh38

Full human genomic reference sequence, properly indexed.

IMGT-HLA

Reference sequence typed at various resolution for HLA and KIR genes.

Outputs

Pipeline outputs are organized into two directories. These are required in addition to the raw/ input directory for the pipeline to execute.

Intermediate

Includes processes reads, such as debarcoding (if applicable) and consensus assembled sequences in FASTA format.

Final

Includes compressed files containing aligned consensus sequences and sub-allele-level genotype calls in BAM and VCF formats, respectively.

Environment

Of course, pipeline software is only useful if it can be executed. An environment for doing so includes many things described here.

Preprocess

Raw DNA sequencing reads are split by sample into paired-files (if applicable). Depending on the target capture method, reads from disjoint sources may be separately processed in parallel.

Assemble

With extra sequence context provided by many overlapping reads, alignment artifacts are reduced and sensitivity is improved for detecting certain genetic variation, especially short insertions and deletions (and complex combinations thereof) . This is an especially important feature for NGS pipelines used to detect genetic variation in highly polymorphic regions of the genome such as the MHC. Several assembly methods are available that broadly fit into two categories: prefix tree and De Bruijn graphs. For the current implementation of the pipeline NMDP is using prefix tree-based short sequence read assembly by progressive k-mer search and 3’ read extension (SSAKE).

Align

Alignment provides aligned contigs global location and allele-level genetic variation with respect to a common human genome reference. Several alignment methods are available that vary with respect to speed and accuracy. For the current implementation of the pipeline NMDP uses Burrows-Wheeler alignment (BWA) for long reads (assembled contigs). Alignment-only processing is required to retain sequence information for non-overlapping (singleton) reads, which cannot be assembled. For this purpose the NMDP uses BWA optimized for detecting variation in low-complexity regions of the genome . Compressed files containing aligned reads or contigs are sorted and indexed.

Filter

Filtering is an essential part of any data processing workflow; however, acceptance criteria and specific quality thresholds must be determined on an application-dependent, case-by-case basis. There is always a tradeoff between some measures of accuracy (sensitivity) and others (specificity and ambiguity) as well as processing speed (turnaround time). Under normal operating conditions, filtering criteria must be adjusted to optimize the pipeline and achieve satisfactory validation results for its intended purpose at an acceptable cost.

Data processing errors may results from algorithm-specific artifact or other sample- or sub-sample-specific factors, such as aneuploidy or off-target capture or contamination. With high sequence coverage and digital output (individual reads) errors such these may be very infrequent but their effects—while possibly very small or negligible—can usually be measured. Certain strict filtering criteria are standard operating procedure because they increase accuracy without measureable impact to other performance metrics. These include, but are not limited to, the following:

Merge overlapping contigs

Figure 1 illustrates two overlapping contigs with identical sequences that should be merged into a single contig during the normal assembly step. NMDP has developed filtering software that identifies strictly overlapping contigs and merges them.

Apply consensus quality thresholds

Sometimes contigs only partially align to the human reference with significant sequence (usually flanking) that is soft-clipped. Additionally, contigs may be assembled from reads outside of the genomic regions of interest (ROI) due to off-target capture effects or contamination. Filtering may include, but is not limited to, the following:

Locus

For some applications it is useful to limit analysis to specific genomic regions of scientific or clinical interest. For example, the full gene, all exons, or those representing the antigen recognition site. Filters applicable to HLA are here:

/opt/nmdp/regions/

For an example go here.

Consensus sequence length

The string length of the assembled contiguous sequence.

Consensus locus length

The length of the genomic region represented by an aligned contig. Note, the sequence length and locus length may differ where insertions and deletions are present.

Breadth-of-coverage

The relative length of an expected region of interest that is covered by one or more reads (contigs).

Depth-of-coverage

The total number of reads (contigs) that are mapped to a specific region of interest.

(Genotype)

In its current implementation the NMDP pipeline simply reformats aligned consensus sequences as VCF using mileup to aggregate all consensus sequences into a single file. At present, zygosity and phase-level is lost to facilitate gross, high-resolution benchmarking to independently verified alleles. With slight modification, individual aligned contigs can be reformatted (genotyped) separately thus preserving phase-level resolution.

Interpret

Several methods for interpreting consensus DNA sequence are available.

Cross-reference to IMGT-HLA

VCF representation

Allele equivalence

Calculation of equivalent insertion and deletion regions – Short reads are often aligned with gaps in either the read or reference sequence indicating the possible occurrence of short deletions or insertions, respectively. Gaps may not be positioned uniquely with respect to a given reference or consensus assembly sequence. Rather, a range of positions—known as the equivalent equivalent insertion and deletion region (EIR) —may be required. Improper representation is common in public databases leading to misinterpretation and errors of considerable consequence. Proper implementation is critical, especially where data are patient-derived and genetic results are applied clinically.

Assign nomenclature

Filtered, FASTA-formatted consensus sequences can be assigned to possible combinations of alleles formatted as a genotype list string using the NMDP's interpretation service. Go here to see an example.

(Transcribe) Translate and annotate de novo

De novo annotation is especially important for novel alleles not present in the IMGT-HLA database, as well as very rare or private mutations that may cause disease. cDNA sequences may be translated in silico into corresponding protein sequences. Further annotation (currently out of scope of the pipeline) may identify site-specific sequence conservation, nucleic- or amino-acid-level lesions likely to affect protein structure or (patho)physiological function. Many open source tools are available for this purpose.

DaSH

Clone this wiki locally