08-simplemodelfitting.Rmd

# (PART\*) PUTTING IT TOGETHER {-}

# Simple Model Fitting

## Overview

Now that you are familiar with [`SummarizedExperiment`](summarizedexperiment.html#summarizedexperiment) and [design matrices](design-matrices.html#design-matrices), you can apply what you learned to the `airway` dataset. 

Recall that the `airway` data is from an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. You can learn more about the experiment in @Himes2014. 

## Load & Prepare Data

Load the `airway` dataset. You might need to install it.

```{r, warning = FALSE, message = FALSE}
# AnVIL::install("airway")
library(airway)
data(airway)
```

You should also save the assay data from `airway`. Remember that `assay()` extracts the gene counts.

```{r}
assay_data <- assay(airway)
```

## Creating the Design Matrix

You should create a design matrix like you did in the [previous chapter](design-matrices.html#inpractice). 

Since the dexamethasone treatment is a categorical variable, you need to use a [means](#means) or [mean-reference](#meanref) model. Going forward, we will use a mean-reference model because we are interested in the difference attributable to the treatment.

```{r}
sample_data <- colData(airway)
design_matrix <- model.matrix( ~ sample_data$dex)
```

## Testing with `lmFit` and `eBayes`

Now you start the exciting process of model fitting! You will need the `limma` package.

```{r, warning = FALSE, message = FALSE}
# AnVIL::install("limma")
library(limma)
```

You want to transform the count data before proceeding with model fitting. 

The `voom()` function transforms count data to $\log_{2}$-counts per million (logCPM), estimates the mean-variance relationship, and uses these values to compute appropriate observation-level weights. This is an important step that takes into account the mean-variance relationship in the data. In other words, it factors in that as gene counts (expression) becomes greater, so does the variance. `voom()` also corrects for differences in sequencing depth among samples (libraries). This can happen when one sample starts with more total RNA or the sequencer produces more data for one sample over another.

You should also supply the design matrix.

```{r}
assay_data <- voom(assay_data, design = design_matrix)
```

Next, use the `limma` function `lmFit()`. This function fits multiple linear models by weighted or generalized least squares. A linear model is fitted to the expression data for each gene. As above, you should supply the design matrix.

```{r}
fit <- lmFit(assay_data, design = design_matrix)
```

The code above estimates coefficients, but it's easier to interpret if you have log-odd or p-values. You can use `eBayes()` for this. 

The `eBayes()` function computes moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors, when this function is given the output of `lmFit()`. This allows us to rank genes in order of evidence for differential expression. Using empirical Bayes methods squeezes the gene-wise residual variances toward a common value (or towards a global trend).

```{r}
fit <- eBayes(fit)
```

Now that you have a fitted model and estimated statistics, you can look at individual genes. 

Because the fitted models are based on the design matrix we supplied, "top" genes will be based the difference between `trt` and `untrt`. Remember that these indicate whether or not cells received dexamethasone treatment. The function `topTable()` shows you the top-ranked genes given the model you supplied.

```{r}
topTable(fit)
```

## Interpreting `topTable()` Output

You now have the top genes affected by the treatment. Here's how you can interpret the output.

- **logFC**: estimate of the log2-fold-change corresponding to the effect of the treatment. Negative values mean the second treatment, in this case `untrt`, has lower expression than the baseline, in this case `trt`. Take a look at the design matrix to confirm.

- **AveExpr	**: average log2-expression for the gene over all samples.

- **t**: moderated t-statistic

- **P.Value**: raw p-value

- **adj.P.Value**: adjusted p-value using the Benjamini & Hochberg method (by default)

- **B**: log-odds that the gene is differentially expressed

::: {.reflection}
QUESTIONS:

1. The top three genes have a negative logFC. Which treatment (`untrt` or `trt`) had greater expression?
2. `eBayes()` performs p-value adjustment automatically. Why is it essential that we perform p-value adjustment for gene expression data?
:::

## Recap

```{r}
sessionInfo()
```