Skip to content

Commit

Permalink
a bit more on short read mapping
Browse files Browse the repository at this point in the history
The tech note still needs improvement. Will do that after the release of v2.3.
  • Loading branch information
lh3 committed Oct 22, 2017
1 parent c6b6392 commit 1dd221a
Showing 1 changed file with 19 additions and 9 deletions.
28 changes: 19 additions & 9 deletions tex/minimap2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ \section{Introduction}
generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality.

Both SMRT and ONT have been applied to sequence spliced mRNAs (RNA-seq). While
Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
optimized for long noisy sequence reads and are tens of times slower than
dedicated long-read aligners. When developing minimap2 initially for aligning
Expand Down Expand Up @@ -111,8 +111,11 @@ \subsubsection{Chaining}
\begin{equation}\label{eq:chain-gap}
\beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
\end{equation}
In implementation, a gap of length $l$ costs $\gamma_c(l)=0.01\cdot \bar{w}\cdot
|l|+0.5\log_2|l|$, where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
In implementation, a gap of length $l$ costs
\[
\gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
\]
where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
are inapplicable to generic gap cost, complex to implement and usually
Expand Down Expand Up @@ -363,12 +366,19 @@ \subsection{Aligning spliced sequences}
\subsection{Aligning short paired-end reads}

During chainging, minimap2 takes a pair of reads as one read with a gap of
unknown length in the middle. It does not break a chain if there is a long
reference gap between seeds on different reads. After identifying primary
chains (Section~\ref{sec:primary}), we split each fragment chain into two read
chains and perform alignment for each read as in Section~\ref{sec:genomic}.
Finally, we pair hits of each read end to find consistent paired-end
alignments.
unknown length in the middle. It applies a normal gap cost between seeds on the
same read but is a more permissive gap cost between seeds on different reads.
More precisely, the gap cost during chaining is:
\[
\gamma_c(l)=\left\{\begin{array}{ll}
0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
\min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
\end{array}\right.
\]
After identifying primary chains (Section~\ref{sec:primary}), we split each
fragment chain into two read chains and perform alignment for each read as in
Section~\ref{sec:genomic}. Finally, we pair hits of each read end to find
consistent paired-end alignments.

\end{methods}

Expand Down

0 comments on commit 1dd221a

Please sign in to comment.