README

README for alu-detect
http://compbio.cs.toronto.edu/alu-detect

This document describes alu-detect, a software package for detecting
alu insertions directly from Illumina HTS reads.

alu-detect was developed by:
	   Matei David
	   Harun Mustafa
	   Michael Brudno

The authors may be contacted at: alu-detect cs toronto edu.


--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------

  1. Overview

  2. Prerequisties

  3. Installing alu-detect

     3.1. Compile, if necessary
     3.2. Setup the global settings
     3.3. Setup a reference
     3.4. Setup a fake reference (optional)
     3.5. Setup list of known novel alus (optional)
     3.6. Setting up PATH

  4. Input Specification

     4.1 Reads in fastq format
     4.2 Mappings in sam/bam format

  5. Running alu-detect

  6. Output Specification

     6.1. Bed file fields
     6.2. Confidence intervals

  7. Output Filtering

     7.1. Default filtering
     7.2. Using a fake reference

  8. Notes


--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------

alu-detect is a software package for detecting alu insertions directly from
Illumina HTS reads.

Briefly, alu-detect selects "interesting" reads that might contain evidence of
novel insertions in the reference, e.g.: reads that map with insertions or
mismatches towards either end, read pairs where only one read maps, read pairs
mapped with with a discordant insert size. alu-detect then remaps these
interesting reads to the reference and to the set of alu consensus
sequences. Then, it builds clusters of reads with Alu evidence along the
reference, using read pair information when available. Each cluster is
investigated using a "split mapping" algorithm. Every Alu insertion call is
detected with 0, 1, or 2 breakpoints.

alu-detect takes as input either reads in fastq format, or mappings in sam/bam
format. If the input consists of fastq files, these are first mapped to the
reference, increasing the total running time and overall space requirements.


--------------------------------------------------------------------------------
2. Prerequisites
--------------------------------------------------------------------------------

A large part of alu-detect consists of bash scripts using tools commonly
available in a UNIX/Linux development environment.

If the package you downloaded does not contain binaries (e.g. bin/get-regions),
it is necessary to compile the some parts of alu-detect. For this, you need:

- gcc

- Boost C++ libraries
  http://www.boost.org/


Additional prerequisites include:

- python 2.6+, including modules: argparse, atexit, bisect, functools, itertools,
    math, operator, subprocess

- SAMtools
  http://samtools.sourceforge.net/

- BEDTools
  http://code.google.com/p/bedtools/

- RepeatMasker
  http://www.repeatmasker.org/

- RepBase for RepeatMasker
  http://www.girinst.org/

- Bowtie2
  http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

- SHRiMP
  http://compbio.cs.toronto.edu/shrimp/

- Reference Genome(s)
  http://genome.ucsc.edu/

- Bowtie2 indexes of the reference genome(s)
  http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
  Note: These can be regenerated locally using bowtie2.

- Repeat annotations of the reference genome(s) (i.e., RepeatMasker .out files)
  E.g., for hg19 they are here:
  http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
  Note: These can be regenerated locally using RepeatMasker.


--------------------------------------------------------------------------------
3. Installing alu-detect
--------------------------------------------------------------------------------

Here are the steps needed to setup alu-detect after you download and unpack the
alu-detect package.


3.1. Compile, if necessary
--------------------------

If the package does not contain binaries (e.g. bin/get-regions), it is necessary
to compile them. For this you need gcc and boost:

$ cd <alu-detect-dir>
$ BOOST=<path-to-boost-headers> make

If boost is already installed in a standard location on your system, you may
omit setting the BOOST variable.


3.2. Setup the global settings
------------------------------

For this step you need the paths to all prerequisites listed in Section 2,
including the RepBase libraries for RepeatMasker. Run:

$ ./alu-detect setup


3.3. Setup a reference
----------------------

For this step you need: the reference fasta file, bowtie2 indexes for that
reference, and repeat annotations for that reference. Run, e.g.:

$ ./alu-detect add-ref hg18

This will ask for: the reference fasta file, the prefix of the bowtie2 index
files (full path minus the .1.bt2 suffix), and a bash regexp (can use wildcards)
describing the path to the relevant .out files output by RepeatMasker.

The paths can also be specified on the command line. Type only

$ ./alu-detect add-ref

for a list of command line options. E.g.:

$ ./alu-detect add-ref \
    -f /data/hg18.fa \
    -x /data/bowtie2-index/hg18 \
    -o "/tmp/RepeatMasker-hg18/chr*out" \
    hg18

Note that the regex describing the path to the RepeatMasker .out files
must be quoted to prevent shell expansion.


3.4. Setup a fake reference (optional)
--------------------------------------

This step is optional! Please read the Output Filtering Section to understand
what this could be used for.

To add a fake reference, run, e.g.:

$ ./alu-detect make-fake-ref hg18 fake_hg18

For this to work, the reference hg18 must have been previously setup as in step
3.3 above. make-fake-ref consists of a number of stages which are run one after
the other. To avoid the "continue?" prompts, set CONF=a prior to running it. The
most time-consuming task is running RepeatMasker on the fake reference fasta
file. To use multiple threads, set the variable NCPU to the desired number of
threads (it defaults to 4). E.g.:

$ NCPU=30 CONF=a ./alu-detect make-fake-ref hg18 fake_hg18

This program will eventually invoke step 3 automatically, adding fake_hg18 as a
reference to the alu-detect settings/ folder. Additionally, it also creates
several files in the data/ folder which are used as "ground truth" when
computing precision and recall during filtering.


3.5. Setup list of known novel alus (optional)
----------------------------------------------

This step is optional! This list is useful only if you decide to filter calls
using a fake reference, as described in the Output Filtering Section.

If you decide to do this, once you remove Alus from the reference and run
detection using a donor HTS dataset, truly novel Alus in the donor with respect
to the reference will appear as false positives of the simulation. To alleviate
this, we can filter out every Alu insertion call that matches a known novel Alu
in any other individual.

For a reference named, e.g., hg15, the list of known novel Alus should be put in
the file data/known-novel-alus.hg15.bed

alu-detect is distributed with such lists for hg18 and hg19, using novel Alu
insertion calls from: dbRIP [Wang et al, 2006], [Hormozdiari et al., 2011],
[Stewart et al., 2011] and [Lee et al., 2012]. For other references, you can
attempt to lift those annotations over.

If you do not build such a list, and still use filtering using a fake reference,
the simulation might still work, but the observed precision will be reduced. By
default, alu-detect looks for filters that achieve precision at least .97 (and
maximizes recall). Without a list of known novel Alus, you should edit the file
bin/get-top-filter and reduce the target minimum precision, or else the
simulation will output no viable filters.

To get an idea of what precision might be suitable, after running alu-detect on
the fake reference, try filtering with some reasonable parameters and see what
you get, e.g.:

$ filter-ref-alu-bp $ALUS_BED <raw-calls-on-fake_hg15> \
 | apply-filter 200 10 200 1100 \
 | TRUTH=<alu-detect-path>/targets.fake_hg15.bed get-prec


3.6. Setting up PATH
--------------------

The main parts of alu-detect (adding/removing a reference, detecting Alus,
filtering) are accessed through the wrapper program called alu-detect. This
program detects its own installation directory and sets up an environment
variable which allows other scripts to locate the data folders associated with
the current installation. Thus, one can have 2 independent installations of
alu-detect, one in folder1, one in folder2, with their own data folders. To run
using a specific copy, specify the path to the alu-detect script (e.g., "$
folder1/alu-detect detect [...]"), then the alu-detect script in turn sets up
the PATH for other scripts.

Other scripts part of alu-detect are in the bin/ subfolder of
<alu-detect-dir>. To ensure those work properly, set PATH, AWKPATH, and
PYTHONPATH to include <alu-detect-dir>/bin. Make sure AWKPATH and PYTHONPATH are
exported by your shell. E.g.:

$ PATH=$PATH:<slu-detect-dir>/bin
$ export AWKPATH=$AWKPATH:<alu-detect-dir>/bin
$ export PYTHONPATH=$PYTHONPATH:<alu-detect-dir>/bin


--------------------------------------------------------------------------------
4. Input Specification
--------------------------------------------------------------------------------

alu-detect accepts these types of input:

4.1. Reads in fastq format

     4.1.a. Unpaired reads
     4.1.b. Paired reads in single files
     4.1.c. Paired reads in separate files

4.2. Mappings in sam/bam format

All input files may optionally be gzipped.

In every run, there can be only 1 type of input: 1.a or 1.b or 1.c or 2. Several
files of the same input type can be given. To combine several types, you must
first bring them to a common type "by hand" (usually as sam/bam).

Also see NOTE on bash in Section 8.


4.1. Reads in fastq format
--------------------------

Whenever input consists of several sets of reads in fastq format, alu-detect
will first map all reads to the reference using bowtie2. It will then proceed
with the workflow as if the input was of type 2.

The read group of each set of reads must be specifed. (Read groups may be
identical, but there must be as many read groups given as there are fastq
files.)

If INPUT_PHRED is not empty, its value is used for all fastq files.
If INPUT_PHRED is empty or not given, a value is detected for each set of fastq
files.

There are several types of fastq inputs:


4.1.a. Unpaired reads:

$ ORIG_UNPAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" \
INPUT_PHRED=33 [...] ./alu-detect detect [...]


4.1.b. Paired reads in single files:

$ ORIG_PAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" \
INPUT_PHRED=33 [...] ./alu-detect detect [...]

Here, file1.fq contains consecutive reads from each pair:

@read1/1
AAAA
+
####
@read1/2
AAAA
+
####
@read2/1
AAAA
+
####
@read2/2
AAAA
+
####


4.1.c. Paired reads in separate files:

$ ORIG_READS_1="file1.1.fq file2.1.fq" ORIG_READS_2="file1.2.fq file2.2.fq" \
ORIG_READ_GROUPS="rg1 rg2" INPUT_PHRED=33 [...] ./alu-detect detect [...]

Here, file1.1.fq contains:

@read1/1
AAAA
+
####
@read2/1
AAAA
+
####

and file1.2.fq contains:

@read1/2
AAAA
+
####
@read2/2
AAAA
+
####


4.2. Mappings in sam/bam format
-------------------------------

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" [...] ./alu-detect detect [...]


Read groups: Downstream processing steps need information about what read group
each read belongs to. For every read found in a SAM record, its read group is
determined from the RG:Z:<read_group> tag. If that is missing, a default value
of "00" is used. This is ok if and only if the pairing information is very
similar for all reads missing their RG tag.

To specify a default read group for some of the sam/bam files, you can
also set the variable ORIG_READ_GROUPS. Unlike the case of fastq input, there
may be fewer read groups specifed than there are input sam/bam files. In
that case, read group 1 is used as the default for sam/bam file 1, read
group 2 for file 2, and so on. Files with no corresponding read group will use
"00" as the default read group. E.g.:

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" ORIG_READ_GROUPS="rg1" \
[...] ./alu-detect detect [...]

All reads missing the RG tag from file1.bam will be assigned to the read group
"rg1", and all reads the RG tag from file2.sam.gz will be assigned the read
group "00".


Ordering: When dealing with paired reads, downstream processing steps require
that mappings for both reads in each read pair be found in consecutive SAM
records. A common situation when this is not the case is when a sam/bam
file is sorted by coordinate. This is usually (but not always) specified by the
"SO:coordiante" tag on the first line of the file (see SAM spec). If that is the
case, alu-detect will re-sort the file appropriately.

Unfortunately, some programs do not honour this tag, notably samtools. So,
samtools may produce bam files sorted by coordinate, but without the
"SO:coordinate" tag. This breaks alu-detect. To fix it, set the additional RSORT
variable to a non-empty value. This will cause alu-detect to sort all input
files by read name, regardless of the contents of the SO tag:

$ ORIG_MAPPINGS="file1.bam file2.sam.gz" RSORT=1 [...] ./alu-detect detect [...]


If DROP_PAIRS_DIFF_CHR is non-empty, paired reads which have the two mates
mapped to different chromosomes will be dropped as soon as they are encountered.
In the absence of this flag, alu-detect will keep those reads in memory until it
finds the end of the file, at which time they are dropped. If there are many
such reads, they might fill up the RAM.


--------------------------------------------------------------------------------
5. Running alu-detect
--------------------------------------------------------------------------------

alu-detect detect is a large bash script manipulating the input reads
in a series of consecutive stages.

Stage 0: Obtain a sam file of the reads mapped to the reference with read pairs
appearing in consecutive records.

Stage 1: Detect pairing information

Stage 2: Extract "interesting" reads or read pairs

Stage 3: Prepare the reads using an internal naming format

Stage 4: Map the reads to the reference using bowtie2

Stage 5: Clean up the reference mappings

Stage 6: Extract reads for Alu mapping

Stage 7: Map the reads to the set of Alu consensus sequences

Stage 8: Aggregate mappings to reference and mappings to Alus, perform split
mapping, produce raw Alu insertion calls.

Stage 9: Apply basic filters to the raw calls

Stages 4 and 8 are the most time-consuming, and they are fully parallelizable
(see NCPU below). Stage 4 requires most hard disk space, and Stage 8 requires
most RAM.


The syntax for running alu-detect is (also see NOTE on bash in Section 8):

$ [variable assingments] ./alu-detect detect <ngs_name> <reference_name>


<reference_name> should be previously set up as in Section 3.3. The files
created by this run will be prefixed by <ngs_name>.<reference_name>. Most
importantly:

<ngs_name>.<reference_name>.calls.raw contains the raw calls

<ngs_name>.<reference_name>.calls.basic-filters contains raw with basic filters


Variables which affect alu-detect are:

- NCPU : the number of threads alu-detect can use

- CONF : when set to "a", always answer yes to "continue?" prompts

- START_STAGE : skip stages less than this number

- END_STAGE : stop at the end of this stage

- ORIG_UNPAIRED_READS, ORIG_PAIRED_READS, ORIG_READS_1 ORIG_READS_2,
  ORIG_READ_GROUPS, ORIG_MAPPINGS : Described in section 4. When the start stage
  is 3 or more, these are unnecessary

- XTRACE : save bash xtrace to the given file (use "XTRACE=/dev/fd/2" to save it
  to stderr)

- SINGLE_READS_TO_REMAP : do not split reads.to_remap by rg


Sample command lines include:

$ ORIG_MAPPINGS=map.sam.gz NCPU=30 CONF=a ./alu-detect detect NA18507 hg18

$ ORIG_READS_1=fastq/*.1.fq.gz ORIG_READS_2=fastq/*.2.fq.gz \
ORIG_READ_GROUPS=$(for f in fastq/*.1.fq.gz; do basename $f .1.fq.gz; done) \
INPUT_PHRED=33 NCPU=30 CONF=a ./alu-detect detect NA18507 hg18

$ export NCPU=30; CONF=a START_STAGE=3 END_STAGE=8 ./alu-detect detect \
NA18507 hg18


--------------------------------------------------------------------------------
6. Output Specification
--------------------------------------------------------------------------------

6.1. Bed file fields
--------------------

Alu insertion calls are reported in a bed file. Each call is reported as a
confidence interval. The 1-based closed interval [a, b] is reported in bed
format as the 0-based half-open interval [a-1, b). The various columns contain:

$1: Chromosome name

$2: Start of interval in bed format (a-1, if 1-based closed interval is [a, b])

$3: End of interval in bed format (b, if 1-based closed interval is [a, b])

$4: Alu subfamily (comma separated, if ambiguous)

$5: Total alignment score, out of 1000.

$6: Strand

$7: Alu consensus sequence index of the left endpoint of the insertion.

$8: Alu consensus sequence index of the right endpoint of the insertion.

Note: $7<$8 iff positive strand insertion.

$9: Number of reads+readpairs supporting insertion.

$10: Number of reads+readpairs capturing the left end of the insertion.

$11: Number of reads+readpairs capturing the right end of the insertion.

$12: If both $10 and $11 are non-zero, the length of the TSD.

Note: If $12>0, the insertion follows the canonical format, where the right
breakpoint ($3) captures the left end of the insertion, and the left breakpoint
captures its right end. If $12<0, this is a non-canonical insertion with a
target site loss, rather than duplication.

$13: Score achieved by aligning all reads to the reference only (we sometimes
refer to this as "the null hypothesis").


6.2. Confidence intervals
-------------------------

If $10 == 0 and $11 == 0, then neither breakpoint is detected, and the
confidence interval is estimated using the pairing information and the location
of the mapped portions of the reads on the reference.

If $10 > 0 and $11 == 0, only the breakpoint capturing the left end of the
insertion is detected. If this is between reference positions x and x+1, then
the 1-based confidence interval is [x-50,x+1], which in bed format becomes the
0-based [x-51,x+1) (**)

If $10 == 0 and $11 > 0, only the breakpoint capturing the right end of the
insertion is detected. If this is between reference positions x and x+1, then
the 1-based confidence interval is [x,x+51], which in bed format becomes the
0-based [x-1,x+51) (**)

(**): Unintuitively at first sight, the breakpoint capturing the left end of the
insertion is near the right end of the confidence interval, and the breakpoint
capturing the right end of the insertion is near the left end of the confidence
interval. The reason for this convention is that this is the normal case for Alu
insertions exhibiting a TSD.

If $10 > 0 and $11 > 0 and $12 == 0, then the call was made with evidence for
both breakpoints, but at least one of them was ambiguous. E.g., this would
happen if an equal number of reads supported 2 different breakpoints for the
left end of the Alu insertion.

If $10 > 0 and $11 > 0 and $12 > 0, then both breakpoints were reliably detected
and the Alu insertion has the standard form, described in (**).

If $10 > 0 and $11 > 0 and $12 < 0, then both breakpoints were reliably
detected, but the Alu insertion exhibits a target site loss instead of a
duplication. In other words, contrary to (**), in this case the left end of the
insertions attaches to the left end of the confidence interval, the right end of
the insertion attaches to the right end of the confidence interval, and the
contents of the confidence interval are presumably lost. This shouldn't be taken
a priori as evidence of a deletion, only of a possible alignment.


--------------------------------------------------------------------------------
7. Output Filtering
--------------------------------------------------------------------------------

By default, alu-detect produces a list of raw Alu insertion calls. These should
be filtered to reduce false positives.


7.1. Available Filters
----------------------

filter-length x: inner length >= x. For positive strand insertions, inner length
= $8-$7+1. For negative stranf insertions, inner length = $7 - $8 + 1. (See
Section 6.1.)

filter-support x: the number of reads/read pairs supporting the call is at least
x.

filter-weak-null x: the difference between the score of the insertion ($5) and
the score of the null hypthesis ($12) is at least x. Note: maximum score is
1000.

filter-ci-length x: confidence interval length ($3-$2+1) <= x.

filter-ref-alu-bp: remove Alu insertion calls with only one breakpoint detected,
when this breakpoint is near an analogous breakpoint of a reference Alu.


7.2. Default filtering
----------------------

By default, alu-detect produces (in addition to the list of raw calls), a list
of calls filtered using some reasonable defaults:

- filter-ref-alu-bp
- filter-length 150
- filter-support 10
- filter-weak-null 200 (i.e, 20% of max)
- filter-ci-length 1100

The filtered calls will be generated by detect, and placed in the file:
<prefix>.<ref_name>.calls.basic-filters.bed


7.3. Using a fake reference
---------------------------

While the filtering in 7.2 can be adequate, alu-detect incorporates a better
filtering technique. This consists of several conceptual steps:

- Identify novel-looking Alus in the reference

- Remove them, contructing a "fake" reference

- Run detection against both the real and the fake reference

- Estimate precision and recall of various filters using the calls on the fake
  reference

- Apply those filters to the real calls

REQUIREMENT (*): To use this technique, you need to have the original reads in
fastq format. Mappings to the reference are *not* enough. The reason is that
many more reads will be mapped discrodantly to the fake reference than to the
real reference. You *must* remap the read set to the fake reference for this
filtering technique to be any useful.


Otherwise, these steps are mostly automated. To use them:

- Create a fake reference using the instructions in Section 3.4 and 3.5. E.g.:

- Run detection against both the real and the fake reference. E.g.:

$ NCPU=30 CONF=a ORIG_MAPPINGS=map.hg18.bam ./alu-detect detect NA18507 hg18

$ NCPU=30 CONF=a ORIG_READS_1=reads_1.fq.gz ORIG_READS_2=reads_2.fq.gz \
  INPUT_PHRED=33 ./alu-detect detect NA18507 fake_hg18

If you mapped the reads by hand to the fake reference, you can replace the
second step by:

$ NCPU=30 CONF=a ORIG_MAPPINGS=map.fake_hg18.bam ./alu-detect detect \
  NA18507 fake_hg18


NOTE: As explained above (*), the following is wrong:

$ NCPU=30 CONF=a ORIG_MAPPINGS=map.hg18.bam ./alu-detect detect \
  NA18507 fake_hg18

alu-detect cannot detect the conceptual error of mixing up references, and will
not complain. But the results of doing this are undefined. (Most likely no
suitable filters will be found.)


- Filter the calls on the 2 references together using the program
  filter-calls. Its arguments are: <ngs_name> <real_ref_name> <fake_ref_name>
  <real_ref_calls_dir> <fake_ref_calls_dir>. E.g.,

$ NCPU=30 CONF=a ./alu-detect filter-calls NA18507 hg18 fake_hg18 ./ ./

  The program will create several files and among them, 2 filtered call files:
  NA18507.hg18.calls.filtered.bed and NA18507.fake_hg18.calls.filtered.bed.

  If you cannot get a list of known novel insertions (as explained in 3.5), you
  must determine a good minimum precision threshold "by hand". For this you need
  to familiarize yourself with the various file formats used by
  alu-detect. Notably, the file table.<ngs_name>.csv produced by filter-calls
  contains precision in column $8. Once you determine a value using that file,
  edit bin/get-top-filter and replace the minimum precision value there.


--------------------------------------------------------------------------------
8. Notes
--------------------------------------------------------------------------------

NOTE on bash: When using bash as a shell, assignments of the form X=Y which
precede a command are treated as assigning Y to shell variable X. Assignments
following a command are instead passed verbatim as parameters. Thus:

$ NCPU=30 ./alu-detect detect

is correct, whereas

$ ./alu-detect detect NCPU=30

is not.