-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
711 lines (472 loc) · 23.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
README for alu-detect
http://compbio.cs.toronto.edu/alu-detect
This document describes alu-detect, a software package for detecting
alu insertions directly from Illumina HTS reads.
alu-detect was developed by:
Matei David
Harun Mustafa
Michael Brudno
The authors may be contacted at: alu-detect cs toronto edu.
--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------
1. Overview
2. Prerequisties
3. Installing alu-detect
3.1. Compile, if necessary
3.2. Setup the global settings
3.3. Setup a reference
3.4. Setup a fake reference (optional)
3.5. Setup list of known novel alus (optional)
3.6. Setting up PATH
4. Input Specification
4.1 Reads in fastq format
4.2 Mappings in sam/bam format
5. Running alu-detect
6. Output Specification
6.1. Bed file fields
6.2. Confidence intervals
7. Output Filtering
7.1. Default filtering
7.2. Using a fake reference
8. Notes
--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------
alu-detect is a software package for detecting alu insertions directly from
Illumina HTS reads.
Briefly, alu-detect selects "interesting" reads that might contain evidence of
novel insertions in the reference, e.g.: reads that map with insertions or
mismatches towards either end, read pairs where only one read maps, read pairs
mapped with with a discordant insert size. alu-detect then remaps these
interesting reads to the reference and to the set of alu consensus
sequences. Then, it builds clusters of reads with Alu evidence along the
reference, using read pair information when available. Each cluster is
investigated using a "split mapping" algorithm. Every Alu insertion call is
detected with 0, 1, or 2 breakpoints.
alu-detect takes as input either reads in fastq format, or mappings in sam/bam
format. If the input consists of fastq files, these are first mapped to the
reference, increasing the total running time and overall space requirements.
--------------------------------------------------------------------------------
2. Prerequisites
--------------------------------------------------------------------------------
A large part of alu-detect consists of bash scripts using tools commonly
available in a UNIX/Linux development environment.
If the package you downloaded does not contain binaries (e.g. bin/get-regions),
it is necessary to compile the some parts of alu-detect. For this, you need:
- gcc
- Boost C++ libraries
http://www.boost.org/
Additional prerequisites include:
- python 2.6+, including modules: argparse, atexit, bisect, functools, itertools,
math, operator, subprocess
- SAMtools
http://samtools.sourceforge.net/
- BEDTools
http://code.google.com/p/bedtools/
- RepeatMasker
http://www.repeatmasker.org/
- RepBase for RepeatMasker
http://www.girinst.org/
- Bowtie2
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
- SHRiMP
http://compbio.cs.toronto.edu/shrimp/
- Reference Genome(s)
http://genome.ucsc.edu/
- Bowtie2 indexes of the reference genome(s)
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Note: These can be regenerated locally using bowtie2.
- Repeat annotations of the reference genome(s) (i.e., RepeatMasker .out files)
E.g., for hg19 they are here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
Note: These can be regenerated locally using RepeatMasker.
--------------------------------------------------------------------------------
3. Installing alu-detect
--------------------------------------------------------------------------------
Here are the steps needed to setup alu-detect after you download and unpack the
alu-detect package.
3.1. Compile, if necessary
--------------------------
If the package does not contain binaries (e.g. bin/get-regions), it is necessary
to compile them. For this you need gcc and boost:
$ cd <alu-detect-dir>
$ BOOST=<path-to-boost-headers> make
If boost is already installed in a standard location on your system, you may
omit setting the BOOST variable.
3.2. Setup the global settings
------------------------------
For this step you need the paths to all prerequisites listed in Section 2,
including the RepBase libraries for RepeatMasker. Run:
$ ./alu-detect setup
3.3. Setup a reference
----------------------
For this step you need: the reference fasta file, bowtie2 indexes for that
reference, and repeat annotations for that reference. Run, e.g.:
$ ./alu-detect add-ref hg18
This will ask for: the reference fasta file, the prefix of the bowtie2 index
files (full path minus the .1.bt2 suffix), and a bash regexp (can use wildcards)
describing the path to the relevant .out files output by RepeatMasker.
The paths can also be specified on the command line. Type only
$ ./alu-detect add-ref
for a list of command line options. E.g.:
$ ./alu-detect add-ref \
-f /data/hg18.fa \
-x /data/bowtie2-index/hg18 \
-o "/tmp/RepeatMasker-hg18/chr*out" \
hg18
Note that the regex describing the path to the RepeatMasker .out files
must be quoted to prevent shell expansion.
3.4. Setup a fake reference (optional)
--------------------------------------
This step is optional! Please read the Output Filtering Section to understand
what this could be used for.
To add a fake reference, run, e.g.:
$ ./alu-detect make-fake-ref hg18 fake_hg18
For this to work, the reference hg18 must have been previously setup as in step
3.3 above. make-fake-ref consists of a number of stages which are run one after
the other. To avoid the "continue?" prompts, set CONF=a prior to running it. The
most time-consuming task is running RepeatMasker on the fake reference fasta
file. To use multiple threads, set the variable NCPU to the desired number of
threads (it defaults to 4). E.g.:
$ NCPU=30 CONF=a ./alu-detect make-fake-ref hg18 fake_hg18
This program will eventually invoke step 3 automatically, adding fake_hg18 as a
reference to the alu-detect settings/ folder. Additionally, it also creates
several files in the data/ folder which are used as "ground truth" when
computing precision and recall during filtering.
3.5. Setup list of known novel alus (optional)
----------------------------------------------
This step is optional! This list is useful only if you decide to filter calls
using a fake reference, as described in the Output Filtering Section.
If you decide to do this, once you remove Alus from the reference and run
detection using a donor HTS dataset, truly novel Alus in the donor with respect
to the reference will appear as false positives of the simulation. To alleviate
this, we can filter out every Alu insertion call that matches a known novel Alu
in any other individual.
For a reference named, e.g., hg15, the list of known novel Alus should be put in
the file data/known-novel-alus.hg15.bed
alu-detect is distributed with such lists for hg18 and hg19, using novel Alu
insertion calls from: dbRIP [Wang et al, 2006], [Hormozdiari et al., 2011],
[Stewart et al., 2011] and [Lee et al., 2012]. For other references, you can
attempt to lift those annotations over.
If you do not build such a list, and still use filtering using a fake reference,
the simulation might still work, but the observed precision will be reduced. By
default, alu-detect looks for filters that achieve precision at least .97 (and
maximizes recall). Without a list of known novel Alus, you should edit the file
bin/get-top-filter and reduce the target minimum precision, or else the
simulation will output no viable filters.
To get an idea of what precision might be suitable, after running alu-detect on
the fake reference, try filtering with some reasonable parameters and see what
you get, e.g.:
$ filter-ref-alu-bp $ALUS_BED <raw-calls-on-fake_hg15> \
| apply-filter 200 10 200 1100 \
| TRUTH=<alu-detect-path>/targets.fake_hg15.bed get-prec
3.6. Setting up PATH
--------------------
The main parts of alu-detect (adding/removing a reference, detecting Alus,
filtering) are accessed through the wrapper program called alu-detect. This
program detects its own installation directory and sets up an environment
variable which allows other scripts to locate the data folders associated with
the current installation. Thus, one can have 2 independent installations of
alu-detect, one in folder1, one in folder2, with their own data folders. To run
using a specific copy, specify the path to the alu-detect script (e.g., "$
folder1/alu-detect detect [...]"), then the alu-detect script in turn sets up
the PATH for other scripts.
Other scripts part of alu-detect are in the bin/ subfolder of
<alu-detect-dir>. To ensure those work properly, set PATH, AWKPATH, and
PYTHONPATH to include <alu-detect-dir>/bin. Make sure AWKPATH and PYTHONPATH are
exported by your shell. E.g.:
$ PATH=$PATH:<slu-detect-dir>/bin
$ export AWKPATH=$AWKPATH:<alu-detect-dir>/bin
$ export PYTHONPATH=$PYTHONPATH:<alu-detect-dir>/bin
--------------------------------------------------------------------------------
4. Input Specification
--------------------------------------------------------------------------------
alu-detect accepts these types of input:
4.1. Reads in fastq format
4.1.a. Unpaired reads
4.1.b. Paired reads in single files
4.1.c. Paired reads in separate files
4.2. Mappings in sam/bam format
All input files may optionally be gzipped.
In every run, there can be only 1 type of input: 1.a or 1.b or 1.c or 2. Several
files of the same input type can be given. To combine several types, you must
first bring them to a common type "by hand" (usually as sam/bam).
Also see NOTE on bash in Section 8.
4.1. Reads in fastq format
--------------------------
Whenever input consists of several sets of reads in fastq format, alu-detect
will first map all reads to the reference using bowtie2. It will then proceed
with the workflow as if the input was of type 2.
The read group of each set of reads must be specifed. (Read groups may be
identical, but there must be as many read groups given as there are fastq
files.)
If INPUT_PHRED is not empty, its value is used for all fastq files.
If INPUT_PHRED is empty or not given, a value is detected for each set of fastq
files.
There are several types of fastq inputs:
4.1.a. Unpaired reads:
$ ORIG_UNPAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" \
INPUT_PHRED=33 [...] ./alu-detect detect [...]
4.1.b. Paired reads in single files:
$ ORIG_PAIRED_READS="file1.fq file2.fq" ORIG_READ_GROUPS="rg1 rg2" \
INPUT_PHRED=33 [...] ./alu-detect detect [...]
Here, file1.fq contains consecutive reads from each pair:
@read1/1
AAAA
+
####
@read1/2
AAAA
+
####
@read2/1
AAAA
+
####
@read2/2
AAAA
+
####
4.1.c. Paired reads in separate files:
$ ORIG_READS_1="file1.1.fq file2.1.fq" ORIG_READS_2="file1.2.fq file2.2.fq" \
ORIG_READ_GROUPS="rg1 rg2" INPUT_PHRED=33 [...] ./alu-detect detect [...]
Here, file1.1.fq contains:
@read1/1
AAAA
+
####
@read2/1
AAAA
+
####
and file1.2.fq contains:
@read1/2
AAAA
+
####
@read2/2
AAAA
+
####
4.2. Mappings in sam/bam format
-------------------------------
$ ORIG_MAPPINGS="file1.bam file2.sam.gz" [...] ./alu-detect detect [...]
Read groups: Downstream processing steps need information about what read group
each read belongs to. For every read found in a SAM record, its read group is
determined from the RG:Z:<read_group> tag. If that is missing, a default value
of "00" is used. This is ok if and only if the pairing information is very
similar for all reads missing their RG tag.
To specify a default read group for some of the sam/bam files, you can
also set the variable ORIG_READ_GROUPS. Unlike the case of fastq input, there
may be fewer read groups specifed than there are input sam/bam files. In
that case, read group 1 is used as the default for sam/bam file 1, read
group 2 for file 2, and so on. Files with no corresponding read group will use
"00" as the default read group. E.g.:
$ ORIG_MAPPINGS="file1.bam file2.sam.gz" ORIG_READ_GROUPS="rg1" \
[...] ./alu-detect detect [...]
All reads missing the RG tag from file1.bam will be assigned to the read group
"rg1", and all reads the RG tag from file2.sam.gz will be assigned the read
group "00".
Ordering: When dealing with paired reads, downstream processing steps require
that mappings for both reads in each read pair be found in consecutive SAM
records. A common situation when this is not the case is when a sam/bam
file is sorted by coordinate. This is usually (but not always) specified by the
"SO:coordiante" tag on the first line of the file (see SAM spec). If that is the
case, alu-detect will re-sort the file appropriately.
Unfortunately, some programs do not honour this tag, notably samtools. So,
samtools may produce bam files sorted by coordinate, but without the
"SO:coordinate" tag. This breaks alu-detect. To fix it, set the additional RSORT
variable to a non-empty value. This will cause alu-detect to sort all input
files by read name, regardless of the contents of the SO tag:
$ ORIG_MAPPINGS="file1.bam file2.sam.gz" RSORT=1 [...] ./alu-detect detect [...]
If DROP_PAIRS_DIFF_CHR is non-empty, paired reads which have the two mates
mapped to different chromosomes will be dropped as soon as they are encountered.
In the absence of this flag, alu-detect will keep those reads in memory until it
finds the end of the file, at which time they are dropped. If there are many
such reads, they might fill up the RAM.
--------------------------------------------------------------------------------
5. Running alu-detect
--------------------------------------------------------------------------------
alu-detect detect is a large bash script manipulating the input reads
in a series of consecutive stages.
Stage 0: Obtain a sam file of the reads mapped to the reference with read pairs
appearing in consecutive records.
Stage 1: Detect pairing information
Stage 2: Extract "interesting" reads or read pairs
Stage 3: Prepare the reads using an internal naming format
Stage 4: Map the reads to the reference using bowtie2
Stage 5: Clean up the reference mappings
Stage 6: Extract reads for Alu mapping
Stage 7: Map the reads to the set of Alu consensus sequences
Stage 8: Aggregate mappings to reference and mappings to Alus, perform split
mapping, produce raw Alu insertion calls.
Stage 9: Apply basic filters to the raw calls
Stages 4 and 8 are the most time-consuming, and they are fully parallelizable
(see NCPU below). Stage 4 requires most hard disk space, and Stage 8 requires
most RAM.
The syntax for running alu-detect is (also see NOTE on bash in Section 8):
$ [variable assingments] ./alu-detect detect <ngs_name> <reference_name>
<reference_name> should be previously set up as in Section 3.3. The files
created by this run will be prefixed by <ngs_name>.<reference_name>. Most
importantly:
<ngs_name>.<reference_name>.calls.raw contains the raw calls
<ngs_name>.<reference_name>.calls.basic-filters contains raw with basic filters
Variables which affect alu-detect are:
- NCPU : the number of threads alu-detect can use
- CONF : when set to "a", always answer yes to "continue?" prompts
- START_STAGE : skip stages less than this number
- END_STAGE : stop at the end of this stage
- ORIG_UNPAIRED_READS, ORIG_PAIRED_READS, ORIG_READS_1 ORIG_READS_2,
ORIG_READ_GROUPS, ORIG_MAPPINGS : Described in section 4. When the start stage
is 3 or more, these are unnecessary
- XTRACE : save bash xtrace to the given file (use "XTRACE=/dev/fd/2" to save it
to stderr)
- SINGLE_READS_TO_REMAP : do not split reads.to_remap by rg
Sample command lines include:
$ ORIG_MAPPINGS=map.sam.gz NCPU=30 CONF=a ./alu-detect detect NA18507 hg18
$ ORIG_READS_1=fastq/*.1.fq.gz ORIG_READS_2=fastq/*.2.fq.gz \
ORIG_READ_GROUPS=$(for f in fastq/*.1.fq.gz; do basename $f .1.fq.gz; done) \
INPUT_PHRED=33 NCPU=30 CONF=a ./alu-detect detect NA18507 hg18
$ export NCPU=30; CONF=a START_STAGE=3 END_STAGE=8 ./alu-detect detect \
NA18507 hg18
--------------------------------------------------------------------------------
6. Output Specification
--------------------------------------------------------------------------------
6.1. Bed file fields
--------------------
Alu insertion calls are reported in a bed file. Each call is reported as a
confidence interval. The 1-based closed interval [a, b] is reported in bed
format as the 0-based half-open interval [a-1, b). The various columns contain:
$1: Chromosome name
$2: Start of interval in bed format (a-1, if 1-based closed interval is [a, b])
$3: End of interval in bed format (b, if 1-based closed interval is [a, b])
$4: Alu subfamily (comma separated, if ambiguous)
$5: Total alignment score, out of 1000.
$6: Strand
$7: Alu consensus sequence index of the left endpoint of the insertion.
$8: Alu consensus sequence index of the right endpoint of the insertion.
Note: $7<$8 iff positive strand insertion.
$9: Number of reads+readpairs supporting insertion.
$10: Number of reads+readpairs capturing the left end of the insertion.
$11: Number of reads+readpairs capturing the right end of the insertion.
$12: If both $10 and $11 are non-zero, the length of the TSD.
Note: If $12>0, the insertion follows the canonical format, where the right
breakpoint ($3) captures the left end of the insertion, and the left breakpoint
captures its right end. If $12<0, this is a non-canonical insertion with a
target site loss, rather than duplication.
$13: Score achieved by aligning all reads to the reference only (we sometimes
refer to this as "the null hypothesis").
6.2. Confidence intervals
-------------------------
If $10 == 0 and $11 == 0, then neither breakpoint is detected, and the
confidence interval is estimated using the pairing information and the location
of the mapped portions of the reads on the reference.
If $10 > 0 and $11 == 0, only the breakpoint capturing the left end of the
insertion is detected. If this is between reference positions x and x+1, then
the 1-based confidence interval is [x-50,x+1], which in bed format becomes the
0-based [x-51,x+1) (**)
If $10 == 0 and $11 > 0, only the breakpoint capturing the right end of the
insertion is detected. If this is between reference positions x and x+1, then
the 1-based confidence interval is [x,x+51], which in bed format becomes the
0-based [x-1,x+51) (**)
(**): Unintuitively at first sight, the breakpoint capturing the left end of the
insertion is near the right end of the confidence interval, and the breakpoint
capturing the right end of the insertion is near the left end of the confidence
interval. The reason for this convention is that this is the normal case for Alu
insertions exhibiting a TSD.
If $10 > 0 and $11 > 0 and $12 == 0, then the call was made with evidence for
both breakpoints, but at least one of them was ambiguous. E.g., this would
happen if an equal number of reads supported 2 different breakpoints for the
left end of the Alu insertion.
If $10 > 0 and $11 > 0 and $12 > 0, then both breakpoints were reliably detected
and the Alu insertion has the standard form, described in (**).
If $10 > 0 and $11 > 0 and $12 < 0, then both breakpoints were reliably
detected, but the Alu insertion exhibits a target site loss instead of a
duplication. In other words, contrary to (**), in this case the left end of the
insertions attaches to the left end of the confidence interval, the right end of
the insertion attaches to the right end of the confidence interval, and the
contents of the confidence interval are presumably lost. This shouldn't be taken
a priori as evidence of a deletion, only of a possible alignment.
--------------------------------------------------------------------------------
7. Output Filtering
--------------------------------------------------------------------------------
By default, alu-detect produces a list of raw Alu insertion calls. These should
be filtered to reduce false positives.
7.1. Available Filters
----------------------
filter-length x: inner length >= x. For positive strand insertions, inner length
= $8-$7+1. For negative stranf insertions, inner length = $7 - $8 + 1. (See
Section 6.1.)
filter-support x: the number of reads/read pairs supporting the call is at least
x.
filter-weak-null x: the difference between the score of the insertion ($5) and
the score of the null hypthesis ($12) is at least x. Note: maximum score is
1000.
filter-ci-length x: confidence interval length ($3-$2+1) <= x.
filter-ref-alu-bp: remove Alu insertion calls with only one breakpoint detected,
when this breakpoint is near an analogous breakpoint of a reference Alu.
7.2. Default filtering
----------------------
By default, alu-detect produces (in addition to the list of raw calls), a list
of calls filtered using some reasonable defaults:
- filter-ref-alu-bp
- filter-length 150
- filter-support 10
- filter-weak-null 200 (i.e, 20% of max)
- filter-ci-length 1100
The filtered calls will be generated by detect, and placed in the file:
<prefix>.<ref_name>.calls.basic-filters.bed
7.3. Using a fake reference
---------------------------
While the filtering in 7.2 can be adequate, alu-detect incorporates a better
filtering technique. This consists of several conceptual steps:
- Identify novel-looking Alus in the reference
- Remove them, contructing a "fake" reference
- Run detection against both the real and the fake reference
- Estimate precision and recall of various filters using the calls on the fake
reference
- Apply those filters to the real calls
REQUIREMENT (*): To use this technique, you need to have the original reads in
fastq format. Mappings to the reference are *not* enough. The reason is that
many more reads will be mapped discrodantly to the fake reference than to the
real reference. You *must* remap the read set to the fake reference for this
filtering technique to be any useful.
Otherwise, these steps are mostly automated. To use them:
- Create a fake reference using the instructions in Section 3.4 and 3.5. E.g.:
- Run detection against both the real and the fake reference. E.g.:
$ NCPU=30 CONF=a ORIG_MAPPINGS=map.hg18.bam ./alu-detect detect NA18507 hg18
$ NCPU=30 CONF=a ORIG_READS_1=reads_1.fq.gz ORIG_READS_2=reads_2.fq.gz \
INPUT_PHRED=33 ./alu-detect detect NA18507 fake_hg18
If you mapped the reads by hand to the fake reference, you can replace the
second step by:
$ NCPU=30 CONF=a ORIG_MAPPINGS=map.fake_hg18.bam ./alu-detect detect \
NA18507 fake_hg18
NOTE: As explained above (*), the following is wrong:
$ NCPU=30 CONF=a ORIG_MAPPINGS=map.hg18.bam ./alu-detect detect \
NA18507 fake_hg18
alu-detect cannot detect the conceptual error of mixing up references, and will
not complain. But the results of doing this are undefined. (Most likely no
suitable filters will be found.)
- Filter the calls on the 2 references together using the program
filter-calls. Its arguments are: <ngs_name> <real_ref_name> <fake_ref_name>
<real_ref_calls_dir> <fake_ref_calls_dir>. E.g.,
$ NCPU=30 CONF=a ./alu-detect filter-calls NA18507 hg18 fake_hg18 ./ ./
The program will create several files and among them, 2 filtered call files:
NA18507.hg18.calls.filtered.bed and NA18507.fake_hg18.calls.filtered.bed.
If you cannot get a list of known novel insertions (as explained in 3.5), you
must determine a good minimum precision threshold "by hand". For this you need
to familiarize yourself with the various file formats used by
alu-detect. Notably, the file table.<ngs_name>.csv produced by filter-calls
contains precision in column $8. Once you determine a value using that file,
edit bin/get-top-filter and replace the minimum precision value there.
--------------------------------------------------------------------------------
8. Notes
--------------------------------------------------------------------------------
NOTE on bash: When using bash as a shell, assignments of the form X=Y which
precede a command are treated as assigning Y to shell variable X. Assignments
following a command are instead passed verbatim as parameters. Thus:
$ NCPU=30 ./alu-detect detect
is correct, whereas
$ ./alu-detect detect NCPU=30
is not.