forked from rakarnik/ROSE
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
executable file
·124 lines (99 loc) · 7.4 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
=====ROSE: RANK ORDERING OF SUPER-ENHANCERS=====
ROSE IS RELEASED UNDER THE MIT X11 LICENSE
EXAMPLE DATA AND ADDTIONAL INFORMATION CAN BE FOUND HERE:
http://younglab.wi.mit.edu/super_enhancer_code.html
N.B. LICENSE.txt
For details of this analysis see:
Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes
Warren A. Whyte, David A. Orlando, Denes Hnisz, Brian J. Abraham, Charles Y. Lin, Michael H. Kagey, Peter B. Rahl, Tong Ihn Lee and Richard A. Young
Cell 153, 307-319, April 11, 2013
and
Selective Inhibition of Tumor Oncogenes by Disruption of Super-enhancers
Jakob Lovén, Heather A. Hoke, Charles Y. Lin, Ashley Lau, David A. Orlando, Christopher R. Vakoc, James E. Bradner, Tong Ihn Lee, and Richard A. Young
Cell 153, 320-334, April 11, 2013
Please cite these papers when using this code.
SOFTWARE AUTHORS: Charles Y. Lin, David A. Orlando, Brian J. Abraham
CONTACT: young_computation@wi.mit.edu
ACKNOWLEDGEMENTS: Graham Ruby
Developed using Python 2.7.3, R 2.15.3, and SAMtools 0.1.18
PURPOSE: To create stitched enhancers, and to separate super-enhancers from typical enhancers using sequencing data (.bam) given a file of previously identified constituent enhancers (.gff)
1) PREPARATION/REQUIREMENTS
.bam files of sequencing reads for factor of interest and control (WCE/IgG recommended).
.bam files must have chromosome IDs starting with "chr"
.bam files must be sorted and indexed using SAMtools in order for bamToGFF.py to work. (http://samtools.sourceforge.net/samtools.shtml)
Code must be run from directory in which it is stored.
.gff file of constituent enhancers previously identified (gff format ref: https://genome.ucsc.edu/FAQ/FAQformat.html#format3).
.gff must have the following columns:
1: chromosome (chr#)
2: unique ID for each constituent enhancer region
4: start of constituent
5: end of constituent
7: strand (+,-,.)
9: unique ID for each constituent enhancer region
NOTE: if value for column 2 and 9 differ, value in column 2 will be used
2) CONTENTS
ROSE_main.py: main program
ROSE_utils.py: utility methods
ROSE_bamToGFF.py: calculates density of .bam sequencing reads in .gff regions
ROSE_callSuper.R: ranks regions by their densities, creates a cutoff to separate super-enhancers from typical enhancers
ROSE_geneMapper.py: assigns stitched enhancers to genes
annotation/: Refseq gene tables for genomes MM8,MM9,MM10,HG18,HG19
In ROSE_DATA
example.sh: sample call of ROSE_main.py
data/: folder containing an example .gff input file and two example .bam files
example/: folder containing example output generated by example.sh
3) USAGE
Program is run by calling ROSE_main.py
From within root directory:
python ROSE_main.py -g GENOME_BUILD -i INPUT_CONSTITUENT_GFF -r RANKING_BAM -o OUTPUT_DIRECTORY [optional: -s STITCHING_DISTANCE -t TSS_EXCLUSION_ZONE_SIZE -c CONTROL_BAM]
Required parameters:
GENOME_BUILD: one of hg18, hg19, mm8, mm9, or mm10 referring to the UCSC genome build used for read mapping
INPUT_CONSTITUENT_GFF: .gff file (described above) of regions that were previously calculated to be enhancers. I.e. Med1-enriched regions identified using MACS.
RANKING_BAM: .bam file to be used for ranking enhancers by density of this factor. I.e. Med1 ChIP-Seq reads.
OUTPUT_DIRECTORY: directory to be used for storing output.
Optional parameters:
STITCHING_DISTANCE: maximum distance between two regions that will be stitched together (Default: 12.5kb)
TSS_EXCLUSION_ZONE_SIZE: exclude regions contained within +/- this distance from TSS in order to account for promoter biases (Default: 0; recommended if used: 2500). If this value is 0, will not look for a gene file.
CONTROL_BAM: .bam file to be used as a control. Subtracted from the density of the RANKING_BAM. I.e. Whole cell extract reads.
4) CODE PROCEDURE:
ROSE_main.py will:
format output directory hierarchy
Root name of input .gff ([input_enhancer_list].gff) used as naming root for output files.
stitch enhancer constituents in INPUT_CONSTITUENT_GFF based on STITCHING_DISTANCE and make .gff and .bed of stitched collection
TSS exclusion, if not zero, is attempted before stitching
Names of stitched regions start with number of regions stitched followed by leftmost constituent ID
call bamToGFF.py to get density of RANKING_BAM and CONTROL_BAM in stitched regions and constituents
Maximum time to wait for bamToGFF.py is 12h but can be changed -- quits if running too long
call callSuper.R to sort stitched enhancers by their background-subtracted density of RANKING_BAM and separate into two groups
5) OUTPUT:
All file names begin with the root of INPUT_CONSTITUENT_GFF
**OUTPUT_DIRECTORY/gff/
.gff: copied .gff file of INPUT_CONSTITUENT_GFF
(chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name)
STITCHED.gff: regions created by stitching together INPUT_CONSTITUENT_GFF at STITCHING_DISTANCE
(chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name) Name is number of constituents stitched together followed by ID of leftmost constituent.
**OUTPUT_DIRECTORY/mappedGFF/
*_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each constituent
(constituent ID, region tested, average read density in units of reads-per-million-mapped per bp of constituent)
*_STITCHED*_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each stitched enhancer
(stitched enhancer ID, region tested, average read density in units of reads-per-million-mapped per bp of stitched enhancer)
**OUTPUT_DIRECTORY/
STITCHED_ENHANCER_REGION_MAP.txt: all densities from bamToGFF calculated in stitched enhancers
(stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, rank of RANKING_BAM signal, signal of RANKING_BAM)
Signal of RANKING_BAM is density times length.
*_AllEnhancers.table.txt: Rankings and super status for each stitched enhancer
(stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0))
Signal of RANKING_BAM is density times length.
*_SuperEnhancers.table.txt: Rankings and super status for super-enhancers
(stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0))
Signal of RANKING_BAM is density times length.
*_Enhancers_withSuper.bed: .bed file to be loaded into the UCSC browser to visualize super-enhancers and typical enhancers.
(chromosome, stitched enhancer start, stitched enhancer end, stitched enhancer ID, rank by RANKING_BAM signal)
*_Plot_points.png: visualization of the ranks of super-enhancers and the two groups. Stitched enhancers are ranked by their RANKING_BAM signal and their ranks are along the X axis. Corresponding RANKING_BAM signal on the Y axis.
====================================
NOTES:
mapEnhancerFromFactor.py has a debug mode that can be enabled in the beginning of the main function.
Enhancers in INPUT_RANKING_GFF may overlap each other in the input.
This code can be easily parallelized by following the instructions in the main function of mapEnhancerFromFactor around line 369.
Other gene lists may be added if downloaded from UCSC.
Code from external sources are also cited in-line.