Permutations Workflow.Rmd

---
title: "Workflow for ASD Data using MAST::zlm and Permutations"
author: "Isabel Marleen Pötzsch"
date: "17 7 2019"
output: html_document
---

#R setup

##Load Libraries and Data
Firstly, all relevant libraries, the raw data and the additional meta data are loaded. 
```{r}
knitr::opts_chunk$set(echo = TRUE)
library( ggplot2 )
library( Matrix )
library( locfit )
library( purrr )
library( furrr )
library( tidyverse )
library(MAST)
library(lme4)

#Load data
cellinfo <- read.delim( "~/Desktop/ASD/rawMatrix/meta.txt", stringsAsFactors=FALSE )
counts <- Seurat::Read10X("~/Desktop/ASD/rawMatrix")

```

##Subset and Normalise Counts
Then the counts are subset to include only genes that are expressed in at least one cell and cells that express at least 500 genes
```{r}
#Subset counts for expressed genes and to match cellinfo
counts <- counts[ rowSums(counts) > 0,  ]
counts <- counts[ , cellinfo$cell ]

#Normalising counts
f <- log1p(t(t(counts) / Matrix::colSums(counts))*10^6)
counts_norm <- f/log(2, base=exp(1))
```

##Scale Factors
To improve results using the zlm function, numeric factors are scaled and included in a data frame called cellinfo_scale.
```{r}
#Scale factors for zlm
cellinfo_scale <- data.frame(
  sample = cellinfo$sample,
  diagnosis = cellinfo$diagnosis,
  genes = scale(cellinfo$genes),
  age = scale(cellinfo$age),
  sex = cellinfo$sex,
  RIN = scale(cellinfo$RNA.Integrity.Number),
  PMI = scale(cellinfo$post.mortem.interval..hours.),
  region = cellinfo$region,
  Capbatch = cellinfo$Capbatch,
  Seqbatch = cellinfo$Seqbatch,
  ind = cellinfo$individual,
  ribo_perc = scale(cellinfo$RNA.ribosomal.percent))
```

#MAST analysis

##Create SCA object of cells in L2/3 cluster of one gene
For the MAST analysis, we create a SingleCellExperiment object first, including all cells in the L2/3 cluster, as given in cellinfo, remaining after preliminary filtering steps. This is converted into a SingleCellAssay object as required in the MAST package. This SingleCellAssay object is subset to one gene to exemplify the work flow and cut down on computational demand.
```{r}
#SCE object for L2/3 cells
sce_scale <- SingleCellExperiment( assays = 
                list( counts = counts_norm[, cellinfo$cell[ cellinfo$cluster == "L2/3"] ] ), 
                colData= cellinfo_scale[ cellinfo$cluster == "L2/3", ] )

#Transformation to SCA object
sca_scale <- as(sce_scale, 'SingleCellAssay')

#Function that subsets the SCA object for one gene
create_sca_gene <- function(g) {
  sca_scale <- sca_scale[ g, ]
  assay( sca_scale ) <- as.matrix( assay( sca_scale ) )
  return(sca_scale)
}

#gene set to one arbitrarily chosen gene
gene <- "TTF2"

#Subsetting of full SCA to one gene
sca <- create_sca_gene(gene)
```

##Fit ZLM
After creation of an appropriate SingleCellAssay object, a zero-inflated linear model is fitted via the zlm function from the MAST package and evaluated with a likelihood ratio test.
```{r}

#zlm fitting for SCA object created before
zlm <- MAST::zlm( ~diagnosis + (1|ind) + genes + age + sex + RIN + PMI + 
                        region + Capbatch + Seqbatch + ribo_perc, sca, method = "glmer", 
                        ebayes = F, silent=T, fitArgsD = list(nAGQ = 0))
#likelihood ratio test to obtain p-Values
lrTest( zlm, Hypothesis( "diagnosisControl" ))
summary(zlm, doLRT='diagnosisControl')

```

#Permutation Test
To validate the results seen in the zlm a permutation test is applied. This test randomly assigns the diagnoses to the samples, then a zlm is fitted and a likelihood ratio test is performed. A hundred permutations are run and their p-values evaluated in a QQ-plot to see whether MAST truly produces a random spread (which would lead to all points being on the diagonal) or is biased.

#Permutation test on scaled data
```{r}
#pvalue_scale is a list that will contain the one hundred p-values generated in the permutation test.
pvalue_scale <- list()

#The for loop runs the permutation script one hundred times
for (i in seq(1, 100, 1)) {
    scaperm <- sca
    #The diagnosis column is randomly reordered
    colData(scaperm) %>% as_tibble %>% left_join( by = "sample", colData(scaperm) %>% 
        as_tibble %>% dplyr::select( sample, diagnosis_old=diagnosis ) %>% 
        distinct %>% mutate( diagnosisPerm = sample(diagnosis_old) ) ) %>%
        pull( diagnosisPerm ) -> a 
    b <- colData(scaperm)
    b$diagnosisPerm <- a
    #A new SCA object is created with the randomised diagnoses implemented in the colData
    scaperm <- SingleCellExperiment(
      assays = list( counts =  assay(sca) ), colData = b)
    scaperm <- as(scaperm, 'SingleCellAssay')
    
    zlm2 <- MAST::zlm(~ diagnosisPerm + (1|ind) + genes + age + sex + RIN + PMI + 
                        region + Capbatch + Seqbatch + ribo_perc, scaperm, method = "glmer", 
                      ebayes = F, silent=T, fitArgsD = list(nAGQ = 0))
    k <- lrTest( zlm2, Hypothesis("diagnosisPermControl") )
    k <- as.data.frame(k)
    pvalue_scale[[i]] <- k$`hurdle.Pr(>Chisq)`
  }
```
  
#Looking at p-values in QQplot
The p-values generated for each of the permutations are sorted and the -log10 is plotted against the -log10 of a hundred evenly spaced points to generate a QQ plot.
```{r}
plot(-log10(ppoints(100, 0.1)), -log10(sort(unlist(pvalue_scale))), main="QQ Plot of Permutated P-Values against Evenly-Spaced Points", xlab="-log10 of Evenly-Spaced Points", ylab="-log10 of P-Values from Permutation Test" )
abline(0, 1)
```

This QQ plot shows that the p-values generated by a random distribution of diagnoses using MAST are sytematically too small, so that it is possible that the significant values seen in the actual analysis are an artefact of the MAST package that does not properly account for multiple testing as similar values have been generated in the permutation test.