RNA-seq_Plotting.Rmd

---
title: "RNA-seq_Plotting"
author: "MTM"
co-author: "Amkeele2"
date: "10/10/2020"
output:
  pdf_document: default
  html_document: default
---

# Loading Souce Data 

Loaded human log2FC data, normalized counts and sample guide from Posfai's Plos Pathogens paper, and lncRNA plus gene neighbor lists from Kaessman's Nature paper.

```{r}
Hs_Log2FC <- read.csv("source_data/Hs_Log2FC.csv", header = TRUE)
lncRNAs_paper <- read.csv("source_data/lncRNAs_filtered_data_Kaessmann2019.csv", header= TRUE)
norm_counts_all <- read.csv("source_data/Norm_counts_all.csv", header = TRUE)
sample_guide <- read.csv("source_data/SampleGuide_data_dds.csv", header = TRUE)

```

Check data

```{r, message=FALSE}
str(Hs_Log2FC)
str(lncRNAs_paper)
str(norm_counts_all)
str(sample_guide)
```

# Clean up and transform dataframes

1. Transform categorical data into factors
```{r}
Hs_Log2FC$Cell.line <- as.factor(Hs_Log2FC$Cell.line)
Hs_Log2FC$time.point <- as.factor(Hs_Log2FC$time.point)
sample_guide$Cell.line <- as.factor(sample_guide$Cell.line)
sample_guide$time.groups2 <- as.factor(sample_guide$time.groups2)
```

2. Rename some columns for simplicity.

```{r}
lncRNAs_paper <- dplyr::rename(lncRNAs_paper, Ensembl.ID=ENSEMBL75.id)
norm_counts_all <- dplyr::rename(norm_counts_all, Ensembl.ID=X )
sample_guide <- dplyr::rename(sample_guide, sample.ID=X)
```


# Plotting DEGs

1. Create variables that contain threshold criteria. The lfc.cutoff is set to 0.58; remember that we are working with log2 fold changes so this translates to an actual fold change of 1.5.

```{r}
### Set thresholds

padj.cutoff <- 0.05
logfc.cutoff <- 0.58
```

2. Create vector that helps us identify the genes that meet our criteria:

```{r}
threshold<-Hs_Log2FC$adjusted.p.value<padj.cutoff & abs(Hs_Log2FC$log2FoldChange)>logfc.cutoff
```

We now have a logical vector of values that has a length which is equal to the total number of genes in the dataset. The elements that have a TRUE value correspond to genes that meet the criteria (and FALSE means it fails). How many genes are differentially expressed in infected compared to Control, given our criteria specified above? Does this reduce our results?

```{r}
length(which(threshold))
```

To add this vector to our results table we can use the $ notation to create the column on the left hand side of the assignment operator, and the assign the vector to it instead of using cbind():

```{r}
Hs_Log2FC$threshold<-threshold
```

Now we can easily subset the results table to only include those that are significant using the subset() function:

```{r}
sig_DE_all<-data.frame(subset(Hs_Log2FC,threshold==TRUE))
```

## Subset lncRNAs from sig_DE_all

1. Need to clean up lncRNA table

```{r}
#check how many lncRNAs are in table from paper

length(lncRNAs_paper$Ensembl.ID)

#remove na in r - remove rows - na.omit function / option

lncRNAs_paper <-na.omit(lncRNAs_paper)

```

NA values have been removed. Here I am not sure if I removed IDs that did had an NAs in some other column, but I think it doesn't matter. 

2.Subset lncRNAs

```{r}
sig_DE_lncRNA <- sig_DE_all[sig_DE_all$Ensembl.ID %in% lncRNAs_paper$Ensembl.ID,]
print(sig_DE_lncRNA)
```

## Visualizing DE lncRNAs

Load packages that will be needed
```{r, message=FALSE}

# Load libraries

library(reshape)
library(ggplot2)
library(RColorBrewer)
install.packages("pheatmap")
library(pheatmap)

```

Before it gets more complicated, I am going to split my data per cell line.

```{r}
sig_DE_lncRNA_hepg2<-sig_DE_lncRNA[sig_DE_lncRNA$Cell.line=="HepG2",]
sig_DE_lncRNA_huh7<-sig_DE_lncRNA[sig_DE_lncRNA$Cell.line=="HuH7",]
```

I am going to focus on HepG2 for now

Extract the normalized count values for these genes by subsetting.

```{r}
sig_DE_lncRNA_hepg2_counts <- norm_counts_all[norm_counts_all$Ensembl.ID %in% sig_DE_lncRNA_hepg2$Ensembl.ID,]

# use first column for row names

sig_DE_lncRNA_hepg2_counts <- data.frame(sig_DE_lncRNA_hepg2_counts, row.names = 1)

```

Backtrack and get only HepG2 samples

```{r}
#transform analysis.groups to factors

sample_guide$analysis.groups <- as.factor(sample_guide$analysis.groups)

#extract HepG2 samples

HepG2samples<-sample_guide[sample_guide$Cell.line=="HepG2",c("sample.ID","analysis.groups")]
```

```{r}
remove(siglncHepG2_counts)
remove(counts_sigLncHepG2)
```

```{r, message=FALSE}
library(data.table)

# get data

  data(sig_DE_lncRNA_hepg2_counts)

# transpose
  
  t_sigHepG2counts <- transpose(sig_DE_lncRNA_hepg2_counts)

colnames(t_sigHepG2counts) <- rownames(sig_DE_lncRNA_hepg2_counts)
  rownames(t_sigHepG2counts) <- colnames(sig_DE_lncRNA_hepg2_counts)  
  
#make rownames a new column
  
  setDT(t_sigHepG2counts, keep.rownames = "sample.ID")
```

Now subset

THERE IS AN ERROR IN THE SIG_HEPG2 COUNTS SAMPLE IDs. An X was added at some point in front of the name.

```{r}
#subset by indexing columns

sigHepG2counts<-sig_DE_lncRNA_hepg2_counts[,c(28:40)]
```

# Plot

```{r}
#load packages

library(pheatmap)
library(dplyr)

### Set annotation

HepG2samples$sample.ID = paste0('X', HepG2samples$sample.ID) #fix names in sample guide
HepG2samples<-select(HepG2samples,analysis.groups)
row.names(HepG2samples) <- colnames(sigHepG2counts)
HepG2samples$analysis.groups <- factor(HepG2samples$analysis.groups, levels = c("HepG2.control","HepG2.early","HepG2.mid","HepG2.late")) #change order of legend display

### Set a color palette
heat.colors <- brewer.pal(11, "PiYG")

## Run pheatmap
pheatmap(sigHepG2counts, color = heat.colors, cluster_rows = T, cluster_cols = T,clustering_distance_cols = "correlation", clustering_distance_rows = "correlation", show_rownames=T, show_colnames = F,
 border_color=NA, fontsize = 10, scale="row",
     fontsize_row = 10, height=5, width = 5, annotation_col = HepG2samples)
```

```{r}

#load packages

library(pheatmap)
library(tidyverse)
library(RColorBrewer)

#relocate columns (samples) in df to allow pheatmap to plot in order

sigHepG2counts <- sigHepG2counts %>% relocate(X50040_N_1,X50039_N_1,X50038_N_1,X50036_N_1)

### Set a color palette
heat.colors <- brewer.pal(11, "PiYG")

## Run pheatmap
pheatmap(sigHepG2counts,
  color = heat.colors, cluster_rows = T, cluster_cols = T, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation", show_rownames = T, show_colnames = F,
  border_color = NA, fontsize = 10, scale = "row",
  fontsize_row = 10, height = 5, width = 5, annotation_col = HepG2samples
)
```