-
Notifications
You must be signed in to change notification settings - Fork 0
4. EN TEx ATAC seq data: Downstream analysis
1. Move to folder ATAC-seq, and create folders to store bigBed data files and peaks analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq.
sudo docker run -v $PWD:$PWD -w $PWD --rm -it dgarrimar/epigenomics_course
git clone https://github.com/bborsari/epigenomics_uvic
cd epigenomics_uvic
cd ATAC-seq ls
1.3. Go to ENCODE (ENTEX) to look for our donant (ENCDO451RUA) information and retrive the metadata. Specifically, all the experiments regarding DNA accessibility for sigmoid colon and stomach made under the GRCh38 genome version, and once downloaded we need to move the file to our docker.
cp /home/me/Downloads/files.txt /home/me/epigenomics_uvic/ATAC-seq/files.txt
1.4. Move to folder ATAC-seq, and create folders to store bigBed data files and peaks #analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq. Moreover, using the first line (link directs us to our experiment) we download the files from the experiment. Finally, we subset the fields in which we are interested in (from our metadata file)
./bin/download.metadata.sh "https://www.encodeproject.org/metadata/?type=Experiment&replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&assay_slims=DNA+accessibility&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&assembly=GRCh38&assay_title=ATAC-seq"
head -1 metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'
mkdir analyses
cd analyses
mkdir data
mkdir data/bigBed.files data/bigWig.files
ls
2. Retrieve from a newly generated metadata file ATAC-seq peaks (bigBed narrow, pseudoreplicated peaks, assembly GRCh38) for stomach and sigmoid_colon for the same donor used in the previous sections. Hint: have a look at what we did here. Make sure your md5sum values coincide with the ones provided by ENCODE.
2.1. Now that things are organized one can download the BigBed and bigWig files. The thing is that one wants the peaks from ATAC-seq experiments, so there is the need to extract the necessary information
Sigmoid colon ID obtained = ENCFF28TUHP
Stomach ID obtained = ENCFF762IFP
grep -F "bigBed_narrow" metadata.tsv | grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |awk 'BEGIN{FS=OFS="\t"}{print $1, $10, $22}' |\
sort -k2,2 -k1,1r | sort -k2,2 -u > bigBedATAC_peaks.txt
mv bigBedATAC_peaks.txt /home/me/epigenomics_uvic/ATAC-seq/analyses/data/bigBed.files
cut -f1 analyses/data/bigBed.files/bigBedATAC_peaks.txt |\
while read filename; do wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"; |\
done
2.3 Check the integrity of the downloaded files in order to verify their MD5 hash, whose is a kind of digital fingerprint of the file. When no output is generated after the next command, means that they coincide so everything is fine.
With the code below a md5sum.txt file is obtained so that the names can be compared to the ones stored in bigBedATAC_peaks.txt file.
../bin/selectRows.sh <(cut -f1 analyses/"$file_type"ATAC*.txt) metadata.tsv |\
cut -f1,45 > data/"$file_type".files/md5sum.txt cat data/"$file_type".files/md5sum.txt |\
while read filename original_md5sum; do md5sum data/"$file_type".files/"$filename"."$file_type" |\
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' |\
done > tmp mv tmp data/"$file_type".files/md5sum.txt awk '$2!=$3' data/"$file_type".files/md5sum.txt done
cut -f1 analyses/bigBedATAC_peaks.txt |\
while read filename; do bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed |\
done
2.5. Create an annotation folder for further steps and download from a link the list of promoters ([-2 kb, +2 Kb] from TSS) of protein-coding genes. Store this file inside the annotation folder (called gencode.v24.protein.coding.non.redundant.TSS.bed)
mkdir annotation
mv /home/me/Downloads/gencode.v24.protein.coding.non.redundant.TSS.bed /home/me/epigenomics_uvic/ATAC-seq/
mv gencode.v24.protein.coding.non.redundant.TSS.bed /home/me/epigenomics_uvic/ATAC-seq/annotation
3. For each tissue, run an intersection analysis using BEDTools report. Notice that the intersect function allows us to intersect two databases for retrieving those peaks from the first one that overlap the second one.
The different coordinates from the promoter regions are in the annotation folder created above, specifically in the filee called "gencode.v24.protein.coding.non.redundant.TSS.bed". From there, an intersection is applied to overlap the bed files obtained from each tissue and the file mentioned before, so that the number of peaks that intersect between both are retrieved.
cat analyses/bigBedATAC_peaks.txt |\
while read filename tissue; do bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u |\
cut -f7 | sort -u > analyses/genes.with.peaks."$tissue".ATAC.txt done
cd analyses
ls
Result: 38071
wc -l genes.with.peaks.sigmoid_colon.ATAC.txt
Result: 33169
wc -l genes.with.peaks.stomach.ATAC.txt
3.2. The number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions). Hint: have a look at what we did here and here.
To do so, the procedure will be the same than the one applied before, but instead of retrieving the peaks that overlap between both data, we are about to retrieve the ones that do NOT overlap. Besides, the argument that one is going to use will retrieve the entries in the first database that do NOT overlap with the second. Hence, the argument to use from the intersect function will be "-v".
3.2.1. Downloaded the GENCODE reference files, but notice that one should point out that there is the need to be in the same version used by ENCODE (v24 in our case), named: named gencode.v24.primary_assembly.annotation.gtf.gz
mv /home/me/Downloads/gencode.v24.primary_assembly.annotation.gtf.gz /home/me/epigenomics_uvic/ATAC-seq/annotation
3.2.2. Prepare a BED file with gene body coordinates of protein-coding genes, and uncompress the gtf.gz file
gunzip annotation/gencode.v24.primary_assembly.annotation.gtf.gz
awk '$3=="gene"' annotation/gencode.v24.primary_assembly.annotation.gtf |\
grep -F "protein_coding" |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > annotation/gencode.v24.protein.coding.gene.body.bed
head annotation/gencode.v24.protein.coding.gene.body.bed
cat analyses/bigBedATAC_peaks.txt |\
while read filename tissue; do bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.gene.body.bed -v |\
sort > analyses/genes.with.peaks."$tissue".ATAC_outside.bed
Result: 34537
wc -l genes.with.peaks.sigmoid_colon.ATAC_outside.txt
Result: 37035
wc -l genes.with.peaks.stomach.ATAC_outside.txt