Skip to content

4. EN TEx ATAC seq data: Downstream analysis

georginafp edited this page May 13, 2021 · 5 revisions

1. Move to folder ATAC-seq, and create folders to store bigBed data files and peaks analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq.

1.1. Run the docker

sudo docker run -v $PWD:$PWD -w $PWD --rm -it dgarrimar/epigenomics_course

1.2. Clone the epigenomics repository and move to different directories inside the docker

git clone https://github.com/bborsari/epigenomics_uvic
cd epigenomics_uvic
cd ATAC-seq ls

1.3. Go to ENCODE (ENTEX) to look for our donant (ENCDO451RUA) information and retrive the metadata. Specifically, all the experiments regarding DNA accessibility for sigmoid colon and stomach made under the GRCh38 genome version, and once downloaded we need to move the file to our docker.

cp /home/me/Downloads/files.txt /home/me/epigenomics_uvic/ATAC-seq/files.txt

1.4. Move to folder ATAC-seq, and create folders to store bigBed data files and peaks #analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq. Moreover, using the first line (link directs us to our experiment) we download the files from the experiment. Finally, we subset the fields in which we are interested in (from our metadata file)

./bin/download.metadata.sh "https://www.encodeproject.org/metadata/?type=Experiment&replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&assay_slims=DNA+accessibility&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&assembly=GRCh38&assay_title=ATAC-seq" 

head -1 metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'

1.5 Organize a few the things

mkdir analyses
cd analyses
mkdir data
mkdir data/bigBed.files data/bigWig.files
ls

2. Retrieve from a newly generated metadata file ATAC-seq peaks (bigBed narrow, pseudoreplicated peaks, assembly GRCh38) for stomach and sigmoid_colon for the same donor used in the previous sections. Hint: have a look at what we did here. Make sure your md5sum values coincide with the ones provided by ENCODE.

2.1. Now that things are organized one can download the BigBed and bigWig files. The thing is that one wants the peaks from ATAC-seq experiments, so there is the need to extract the necessary information

  • Sigmoid colon ID obtained = ENCFF28TUHP
  • Stomach ID obtained = ENCFF762IFP
grep -F "bigBed_narrow" metadata.tsv | grep -F "pseudoreplicated_peaks" |\ 
grep -F "GRCh38" |awk 'BEGIN{FS=OFS="\t"}{print $1, $10, $22}' |\ 
sort -k2,2 -k1,1r | sort -k2,2 -u > bigBedATAC_peaks.txt

2.2. Move and download them

mv bigBedATAC_peaks.txt /home/me/epigenomics_uvic/ATAC-seq/analyses/data/bigBed.files

cut -f1 analyses/data/bigBed.files/bigBedATAC_peaks.txt |\
 while read filename; do   wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"; |\
done

2.3 Check the integrity of the downloaded files in order to verify their MD5 hash, whose is a kind of digital fingerprint of the file. When no output is generated after the next command, means that they coincide so everything is fine.

With the code below a md5sum.txt file is obtained so that the names can be compared to the ones stored in bigBedATAC_peaks.txt file.

../bin/selectRows.sh <(cut -f1 analyses/"$file_type"ATAC*.txt) metadata.tsv |\
cut -f1,45 > data/"$file_type".files/md5sum.txt cat data/"$file_type".files/md5sum.txt |\
while read filename original_md5sum; do md5sum data/"$file_type".files/"$filename"."$file_type" |\
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' |\
done > tmp mv tmp data/"$file_type".files/md5sum.txt awk '$2!=$3' data/"$file_type".files/md5sum.txt done

2.4. Convert the files to BED files format with the bigBedToBed command:

cut -f1 analyses/bigBedATAC_peaks.txt |\
while read filename; do   bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed |\
done

2.5. Create an annotation folder for further steps and download from a link the list of promoters ([-2 kb, +2 Kb] from TSS) of protein-coding genes. Store this file inside the annotation folder (called gencode.v24.protein.coding.non.redundant.TSS.bed)

mkdir annotation
mv /home/me/Downloads/gencode.v24.protein.coding.non.redundant.TSS.bed /home/me/epigenomics_uvic/ATAC-seq/
mv gencode.v24.protein.coding.non.redundant.TSS.bed /home/me/epigenomics_uvic/ATAC-seq/annotation

3. For each tissue, run an intersection analysis using BEDTools report. Notice that the intersect function allows us to intersect two databases for retrieving those peaks from the first one that overlap the second one.

3.1. Retrieve the number of peaks that intersect promoter regions

The different coordinates from the promoter regions are in the annotation folder created above, specifically in the filee called "gencode.v24.protein.coding.non.redundant.TSS.bed". From there, an intersection is applied to overlap the bed files obtained from each tissue and the file mentioned before, so that the number of peaks that intersect between both are retrieved.

cat analyses/bigBedATAC_peaks.txt |\
while read filename tissue; do bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u |\
cut -f7 | sort -u > analyses/genes.with.peaks."$tissue".ATAC.txt done

3.1.1. Check that the files are there

cd analyses 
ls 

3.1.2. Number of genes for SIGMOID COLON:

  • Result: 38071
wc -l genes.with.peaks.sigmoid_colon.ATAC.txt

3.1.3. Number of genes for STOMACH:

  • Result: 33169
wc -l genes.with.peaks.stomach.ATAC.txt

3.2. The number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions). Hint: have a look at what we did here and here.

To do so, the procedure will be the same than the one applied before, but instead of retrieving the peaks that overlap between both data, we are about to retrieve the ones that do NOT overlap. Besides, the argument that one is going to use will retrieve the entries in the first database that do NOT overlap with the second. Hence, the argument to use from the intersect function will be "-v".

3.2.1. Downloaded the GENCODE reference files, but notice that one should point out that there is the need to be in the same version used by ENCODE (v24 in our case), named: named gencode.v24.primary_assembly.annotation.gtf.gz

mv /home/me/Downloads/gencode.v24.primary_assembly.annotation.gtf.gz /home/me/epigenomics_uvic/ATAC-seq/annotation

3.2.2. Prepare a BED file with gene body coordinates of protein-coding genes, and uncompress the gtf.gz file

gunzip annotation/gencode.v24.primary_assembly.annotation.gtf.gz

3.2.3. Retrieve the protein coding genes for getting the gene coordinates

awk '$3=="gene"' annotation/gencode.v24.primary_assembly.annotation.gtf |\
grep -F "protein_coding" |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > annotation/gencode.v24.protein.coding.gene.body.bed

3.2.4. Check how it looks like

head annotation/gencode.v24.protein.coding.gene.body.bed

3.2.5 Retrieve peaks that fall outside gene coordinates

cat analyses/bigBedATAC_peaks.txt |\
while read filename tissue; do bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.gene.body.bed -v |\
sort > analyses/genes.with.peaks."$tissue".ATAC_outside.bed

3.2.6. Number of peaks for SIGMOID COLON:

  • Result: 34537
wc -l genes.with.peaks.sigmoid_colon.ATAC_outside.txt

3.2.7. Number of peaks for for STOMACH:

  • Result: 37035
wc -l genes.with.peaks.stomach.ATAC_outside.txt
Clone this wiki locally