-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRNA-seq_Plotting.Rmd
242 lines (158 loc) · 6.29 KB
/
RNA-seq_Plotting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "RNA-seq_Plotting"
author: "MTM"
co-author: "Amkeele2"
date: "10/10/2020"
output:
pdf_document: default
html_document: default
---
# Loading Souce Data
Loaded human log2FC data, normalized counts and sample guide from Posfai's Plos Pathogens paper, and lncRNA plus gene neighbor lists from Kaessman's Nature paper.
```{r}
Hs_Log2FC <- read.csv("source_data/Hs_Log2FC.csv", header = TRUE)
lncRNAs_paper <- read.csv("source_data/lncRNAs_filtered_data_Kaessmann2019.csv", header= TRUE)
norm_counts_all <- read.csv("source_data/Norm_counts_all.csv", header = TRUE)
sample_guide <- read.csv("source_data/SampleGuide_data_dds.csv", header = TRUE)
```
Check data
```{r, message=FALSE}
str(Hs_Log2FC)
str(lncRNAs_paper)
str(norm_counts_all)
str(sample_guide)
```
# Clean up and transform dataframes
1. Transform categorical data into factors
```{r}
Hs_Log2FC$Cell.line <- as.factor(Hs_Log2FC$Cell.line)
Hs_Log2FC$time.point <- as.factor(Hs_Log2FC$time.point)
sample_guide$Cell.line <- as.factor(sample_guide$Cell.line)
sample_guide$time.groups2 <- as.factor(sample_guide$time.groups2)
```
2. Rename some columns for simplicity.
```{r}
lncRNAs_paper <- dplyr::rename(lncRNAs_paper, Ensembl.ID=ENSEMBL75.id)
norm_counts_all <- dplyr::rename(norm_counts_all, Ensembl.ID=X )
sample_guide <- dplyr::rename(sample_guide, sample.ID=X)
```
# Plotting DEGs
1. Create variables that contain threshold criteria. The lfc.cutoff is set to 0.58; remember that we are working with log2 fold changes so this translates to an actual fold change of 1.5.
```{r}
### Set thresholds
padj.cutoff <- 0.05
logfc.cutoff <- 0.58
```
2. Create vector that helps us identify the genes that meet our criteria:
```{r}
threshold<-Hs_Log2FC$adjusted.p.value<padj.cutoff & abs(Hs_Log2FC$log2FoldChange)>logfc.cutoff
```
We now have a logical vector of values that has a length which is equal to the total number of genes in the dataset. The elements that have a TRUE value correspond to genes that meet the criteria (and FALSE means it fails). How many genes are differentially expressed in infected compared to Control, given our criteria specified above? Does this reduce our results?
```{r}
length(which(threshold))
```
To add this vector to our results table we can use the $ notation to create the column on the left hand side of the assignment operator, and the assign the vector to it instead of using cbind():
```{r}
Hs_Log2FC$threshold<-threshold
```
Now we can easily subset the results table to only include those that are significant using the subset() function:
```{r}
sig_DE_all<-data.frame(subset(Hs_Log2FC,threshold==TRUE))
```
## Subset lncRNAs from sig_DE_all
1. Need to clean up lncRNA table
```{r}
#check how many lncRNAs are in table from paper
length(lncRNAs_paper$Ensembl.ID)
#remove na in r - remove rows - na.omit function / option
lncRNAs_paper <-na.omit(lncRNAs_paper)
```
NA values have been removed. Here I am not sure if I removed IDs that did had an NAs in some other column, but I think it doesn't matter.
2.Subset lncRNAs
```{r}
sig_DE_lncRNA <- sig_DE_all[sig_DE_all$Ensembl.ID %in% lncRNAs_paper$Ensembl.ID,]
print(sig_DE_lncRNA)
```
## Visualizing DE lncRNAs
Load packages that will be needed
```{r, message=FALSE}
# Load libraries
library(reshape)
library(ggplot2)
library(RColorBrewer)
install.packages("pheatmap")
library(pheatmap)
```
Before it gets more complicated, I am going to split my data per cell line.
```{r}
sig_DE_lncRNA_hepg2<-sig_DE_lncRNA[sig_DE_lncRNA$Cell.line=="HepG2",]
sig_DE_lncRNA_huh7<-sig_DE_lncRNA[sig_DE_lncRNA$Cell.line=="HuH7",]
```
I am going to focus on HepG2 for now
Extract the normalized count values for these genes by subsetting.
```{r}
sig_DE_lncRNA_hepg2_counts <- norm_counts_all[norm_counts_all$Ensembl.ID %in% sig_DE_lncRNA_hepg2$Ensembl.ID,]
# use first column for row names
sig_DE_lncRNA_hepg2_counts <- data.frame(sig_DE_lncRNA_hepg2_counts, row.names = 1)
```
Backtrack and get only HepG2 samples
```{r}
#transform analysis.groups to factors
sample_guide$analysis.groups <- as.factor(sample_guide$analysis.groups)
#extract HepG2 samples
HepG2samples<-sample_guide[sample_guide$Cell.line=="HepG2",c("sample.ID","analysis.groups")]
```
```{r}
remove(siglncHepG2_counts)
remove(counts_sigLncHepG2)
```
```{r, message=FALSE}
library(data.table)
# get data
data(sig_DE_lncRNA_hepg2_counts)
# transpose
t_sigHepG2counts <- transpose(sig_DE_lncRNA_hepg2_counts)
colnames(t_sigHepG2counts) <- rownames(sig_DE_lncRNA_hepg2_counts)
rownames(t_sigHepG2counts) <- colnames(sig_DE_lncRNA_hepg2_counts)
#make rownames a new column
setDT(t_sigHepG2counts, keep.rownames = "sample.ID")
```
Now subset
THERE IS AN ERROR IN THE SIG_HEPG2 COUNTS SAMPLE IDs. An X was added at some point in front of the name.
```{r}
#subset by indexing columns
sigHepG2counts<-sig_DE_lncRNA_hepg2_counts[,c(28:40)]
```
# Plot
```{r}
#load packages
library(pheatmap)
library(dplyr)
### Set annotation
HepG2samples$sample.ID = paste0('X', HepG2samples$sample.ID) #fix names in sample guide
HepG2samples<-select(HepG2samples,analysis.groups)
row.names(HepG2samples) <- colnames(sigHepG2counts)
HepG2samples$analysis.groups <- factor(HepG2samples$analysis.groups, levels = c("HepG2.control","HepG2.early","HepG2.mid","HepG2.late")) #change order of legend display
### Set a color palette
heat.colors <- brewer.pal(11, "PiYG")
## Run pheatmap
pheatmap(sigHepG2counts, color = heat.colors, cluster_rows = T, cluster_cols = T,clustering_distance_cols = "correlation", clustering_distance_rows = "correlation", show_rownames=T, show_colnames = F,
border_color=NA, fontsize = 10, scale="row",
fontsize_row = 10, height=5, width = 5, annotation_col = HepG2samples)
```
```{r}
#load packages
library(pheatmap)
library(tidyverse)
library(RColorBrewer)
#relocate columns (samples) in df to allow pheatmap to plot in order
sigHepG2counts <- sigHepG2counts %>% relocate(X50040_N_1,X50039_N_1,X50038_N_1,X50036_N_1)
### Set a color palette
heat.colors <- brewer.pal(11, "PiYG")
## Run pheatmap
pheatmap(sigHepG2counts,
color = heat.colors, cluster_rows = T, cluster_cols = T, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation", show_rownames = T, show_colnames = F,
border_color = NA, fontsize = 10, scale = "row",
fontsize_row = 10, height = 5, width = 5, annotation_col = HepG2samples
)
```