-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06_smokingMouse_plotting_basics.Rmd
748 lines (558 loc) · 41 KB
/
06_smokingMouse_plotting_basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
# RSE plotting basics with smokingMouse
Instructor: Daianna
## Download the smokingMouse data
We've used these commands already in an earlier chapter. Here we'll re-download
the data, though technically we are using the cached data we had previously
downloaded so this time the code below will run faster.
```{r download_data_biocfilecache_repeat}
## Load the container package for this type of data
library("SummarizedExperiment")
## Download data
library("BiocFileCache")
bfc <- BiocFileCache::BiocFileCache()
cached_rse_gene <- BiocFileCache::bfcrpath(
x = bfc,
"https://github.com/LieberInstitute/SPEAQeasyWorkshop2023/raw/devel/provisional_data/rse_gene_mouse_RNAseq_nic-smo.Rdata"
)
## Check the local path on our cache
cached_rse_gene
## Load the rse_gene object
load(cached_rse_gene, verbose = TRUE)
## Nicotine data
rse_gene_nic <- rse_gene[, which(rse_gene$Expt == "Nicotine")]
```
In this part of the course you'll have a plotting session in which you'll learn what type of plots you can create to explore your data, but please note that this is not a differential expression analysis itself. The steps that will be performed are shown in **Figure 1**.
The main objective of this first part is to explore the quality of the samples, their differences in gene expression variations and the impact of sample variables on gene expression variance. You'll also learn how to visualize gene expression counts.
<figure>
<img src="Figures/all_analyses.png" width="700px" align=center />
<figcaption style="color: gray; line-height: 0.9; text-align: justify">
<font size="-1.8">
<b>Figure 1</b>: <b> Summary of the analyses for differential expression. </b> Steps in yellow will be properly performed.
<b> 1. Data Preparation</b>: in this initial part counts are normalized and scaled and low-expressed genes are filtered.
<b> 2. Exploratory Data Analysis</b>: the quality of the samples is compared, poor-quality samples are removed and both gene-level and sample-level effects are explored.
<b> 3. Differential Expression Analysis</b>: a linear model is fitted for each gene and log fold changes are obtained for the contrast of interest; other statistics are also computed and compared. Here the differentially expressed genes (DEGs) are determined and quantified.
<b> 4. GO & KEGG Analyses</b>: in this part Gene Ontology and KEGG enrichment analyses are performed to identify biological processes and functions in which DEGs are significantly involved.
<b> 5. DEG visualization</b>: heatmaps are created to visually contrast gene expression levels of DEGs in control and experimental samples. </font>
<font size="0.8">
Abbreviations: CPM: counts per million; QC: quality control; PC: principal component;
DEG: differentially expressed genes; GO: Gene Ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes.
</font>
</figcaption>
</figure>
## Data preparation
Even before exploring the samples, we must normalize the counts and filter non-expressed genes; we won't do the processes themselves since we already have normalized and filtered data but let's check why these steps are important and where to extract the data we need for posterior analyses.
### Data normalization
Data normalization is a relevant preliminary step when working with expression data because raw counts do not necessarily reflect real expression measures of the genes, since there are technical differences in the way the libraries are prepared and sequenced, as well as intrinsic differences in the genes that are translated into more or less mapped reads. Particularly, there are *within-sample effects* that are the differences between genes in the same sample, such as their **length** (the longer the gene, the more reads it will have) and **GC content**, factors that contribute to variations in their counts. On the other hand, *between-sample effects* are differences between samples such as the **sequencing depth**, i.e., the total number of molecules sequenced, and the **library size**, i.e., the total number of reads of each sample [1].
These variables lead to virtually different mRNA amounts but of course are not due to the biological or treatment conditions of interest (such as nicotine administration in this example) so in order to remove, or at least, to minimize this technical bias and obtain measures comparable and consistent across samples, raw counts must be normalized by these factors. The data that we'll use in this case are already normalized in `assays(rse_gene)$logcounts`. Specifically, the assay contains counts per million (CPM), also known as reads per million (RPM), one basic gene expression unit that only normalizes by the sequencing depth and is computed by dividing the read counts of a gene in a sample by a scaling factor given by the total mapping reads of the sample per million [2]:
$$CPM = \frac{read \ \ counts \ \ of \ \ gene \ \ \times \ \ 10^6 }{Total \ \ mapping \ \ reads \ \ of \ \ sample}$$
As outlined in **Data overview and download**, the scaling factors were obtained with `calcNormFactors()` applying the Trimmed Mean of M-Values (TMM) method, the `r BiocStyle::Biocpkg("edgeR")` package's default normalization method that assumes that most genes are not differentially expressed. The effective library sizes of the samples and the CPM of each observation were computed with the `r BiocStyle::Biocpkg("edgeR")` function `cpm()` setting the `log` argument to `TRUE` and `prior.count` to 0.5 to receive values in $log_2(CPM+0.5)$.
After data normalization and scaling, we’d expect the read counts to follow a normal distribution, something we can confirm by comparing the counts' distribution before and after the normalization. Consider both datasets contain the exact same genes.
```{r Data preparation, message=FALSE, warning=FALSE}
library("ggplot2")
## Histogram and density plot of read counts before and after normalization
## Raw counts
counts_data <- data.frame(counts = as.vector(assays(rse_gene_nic)$counts))
plot <- ggplot(counts_data, aes(x = counts)) +
geom_histogram(colour = "black", fill = "lightgray") +
labs(x = "read counts", y = "Frecuency") +
theme_classic()
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
## Normalized counts
logcounts_data <- data.frame(logcounts = as.vector(assays(rse_gene_nic)$logcounts))
plot <- ggplot(logcounts_data, aes(x = logcounts)) +
geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
theme_classic() +
geom_density(fill = "#69b3a2", alpha = 0.3) +
labs(x = "log(CPM+0.5)", y = "Frecuency")
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
As presented, after data transformation, we can now see a more widespread distribution of the counts, but note that most of them are zeros in the first plot (the one with the raw counts) and those zeros remain after normalization, corresponding to counts below 0 in the second plot. That is because we haven't filtered the low and zero-expressed genes.
### Gene filtering
Lowly-expressed or non-expressed genes in many samples are not of biological interest in a study of differential expression because they don't inform about the gene expression changes and they are, by definition, not differentially expressed, so we have to drop them using `filterByExpr()` from `r BiocStyle::Biocpkg("edgeR")` that only keeps genes with at least *K* CPM in *n* samples and with a minimum total number of counts across all samples.
```{r, message=FALSE, warning=FALSE}
## Retain genes that passed filtering step
rse_gene_filt <- rse_gene_nic[rowData(rse_gene_nic)$retained_after_feature_filtering == TRUE, ]
## Normalized counts and filtered genes
filt_logcounts_data <- data.frame(logcounts = as.vector(assays(rse_gene_filt)$logcounts))
## Plot
plot <- ggplot(filt_logcounts_data, aes(x = logcounts)) +
geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
theme_classic() +
geom_density(fill = "#69b3a2", alpha = 0.3) +
labs(x = "log(CPM+0.5)", y = "Frecuency")
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
In this third plot we can observe a curve that is closer (though not completely) to a normal distribution and with less lowly-expressed genes.
With the object `rse_gene_filt` we can proceed with downstream analyses.
## Exploratory Data Analysis
The first formal step that we will be performing is the sample exploration. This crucial initial part of the analysis consists of an examination of differences and relationships between Quality-Control (QC) metrics of the samples from different groups in order to identify poor-quality samples that must be removed before DEA. After that, the sample variables in the metadata also need to be analyzed and filtered based on the percentage of gene expression variance that they explain for each gene.
## Quality Control Analysis
First we have to explore and compare the the quality-control metrics of the samples in the different groups given by covariates such as age, sex, pregnancy state, group, plate and flowcell. See **Sample Information** section in chapter 04 for a description of these variables.
<style>
p.comment {
background-color: #E6E6E6;
padding: 10px;
border: 1px solid black;
margin-left: 25px;
border-radius: 5px;
font-family: sans-serif;
}
</style>
<p class="comment">
❓ ***Why is that relevant?*** As you could imagine, technical and methodological differences in all the steps that were carried out during the experimental stages are potential sources of variation in the quality of the samples. Just imagine all that could have been gone wrong or unexpected while experimenting with mice, during the sampling, in the RNA extraction using different batches, when treating samples in different mediums, when preparing libraries in different plates and sequencing in different flowcells. Moreover, the inherent features of the mice from which the samples come from such as age, tissue, sex and pregnancy state could also affect the samples' metrics if, for example, they were separately analyzed and processed.
</p>
<p class="comment">
❓***But why do we care about mitochodrial and ribosomal counts as QC metrics?*** In the process of mRNA extraction, either by mRNA enrichment (capturing polyadenilated mRNAs) or rRNA-depletion (removing rRNA), we’d expect to have a low number of ribosomal counts, i.e., counts that map to rDNA, and if we don’t, something must have gone wrong with the procedures. In the case of mitochondrial counts something similar occurs: higher mitochondrial rates will be obtained if the cytoplasmic mRNA capture was deficient or if the transcripts were lost by some technical issue, increasing the proportion of mitochondrial mRNAs. As a result, high <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> rRNA_rate</span> </mark> imply poor quality in the samples.
</p>
<div class="alert alert-info">
<strong>Note:</strong> the QC metrics were computed with the unprocessed datasets (neither filtered nor normalized) to preserve the original estimates of the samples.
</div>
### Evaluate QC metrics for groups of samples
Fortunately, we can identify to some extent possible factors that could have influenced on the quality of the samples, as well as isolated samples that are problematic. To do that, we will create boxplots that present the distribution of the samples' metrics separating them by sample variables.
```{r QC_boxplots, message=FALSE, warning=FALSE}
library("Hmisc")
library("stringr")
library("cowplot")
## Define QC metrics of interest
qc_metrics <- c("mitoRate", "overallMapRate", "totalAssignedGene", "rRNA_rate", "sum", "detected", "ERCCsumLogErr")
## Define sample variables of interest
sample_variables <- c("Group", "Age", "Sex", "Pregnancy", "plate", "flowcell")
## Function to create boxplots of QC metrics for groups of samples
QC_boxplots <- function(qc_metric, sample_var) {
## Define sample colors depending on the sample variable
if (sample_var == "Group") {
colors <- c("Control" = "brown2", "Experimental" = "deepskyblue3")
} else if (sample_var == "Age") {
colors <- c("Adult" = "slateblue3", "Pup" = "yellow3")
} else if (sample_var == "Sex") {
colors <- c("F" = "hotpink1", "M" = "dodgerblue")
} else if (sample_var == "Pregnancy") {
colors <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
} else if (sample_var == "plate") {
colors <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
} else if (sample_var == "flowcell") {
colors <- c(
"HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
"HKCTMDSXX" = "tomato"
)
}
## Axis labels
x_label <- capitalize(sample_var)
y_label <- str_replace_all(qc_metric, c("_" = ""))
## x-axis text angle and position
if (sample_var == "flowcell") {
x_axis_angle <- 18
x_axis_hjust <- 0.5
x_axis_vjust <- 0.7
x_axis_size <- 4
} else {
x_axis_angle <- 0
x_axis_hjust <- 0.5
x_axis_vjust <- 0.5
x_axis_size <- 6
}
## Extract sample data in colData(rse_gene_filt)
data <- data.frame(colData(rse_gene_filt))
## Sample variable separating samples in x-axis and QC metric in y-axis
## (Coloring by sample variable)
plot <- ggplot(data = data, mapping = aes(x = !!rlang::sym(sample_var), y = !!rlang::sym(qc_metric), color = !!rlang::sym(sample_var))) +
## Add violin plots
geom_violin(alpha = 0, size = 0.4, color = "black", width = 0.7) +
## Spread dots
geom_jitter(width = 0.08, alpha = 0.7, size = 1.3) +
## Add boxplots
geom_boxplot(alpha = 0, size = 0.4, width = 0.1, color = "black") +
## Set colors
scale_color_manual(values = colors) +
## Define axis labels
labs(y = y_label, x = x_label) +
## Get rid of the background
theme_bw() +
## Hide legend and define plot margins and size of axis title and text
theme(
legend.position = "none",
plot.margin = unit(c(0.5, 0.4, 0.5, 0.4), "cm"),
axis.title = element_text(size = 7),
axis.text = element_text(size = x_axis_size),
axis.text.x = element_text(angle = x_axis_angle, hjust = x_axis_hjust, vjust = x_axis_vjust)
)
return(plot)
}
## Plots of all QC metrics for each sample variable
multiple_QC_boxplots <- function(sample_var) {
i <- 1
plots <- list()
for (qc_metric in qc_metrics) {
## Call function to create each individual plot
plots[[i]] <- QC_boxplots(qc_metric, sample_var)
i <- i + 1
}
## Arrange multiple plots into a grid
print(plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], plots[[7]], nrow = 2))
}
```
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("Age")
```
Initially, when we separate samples by Age, we can appreciate a clear segregation of adult and pup samples in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark>, with higher mitochondrial rates for adult samples and thus, being lower quality samples than the pup ones. We can also see that pup samples have higher <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>, again being higher quality. The samples are very similar in the rest of the QC metrics.
The former differences must be taken into account because they guide further sample separation by Age, which is necessary to avoid dropping most of the adult samples (that are lower quality) in the **QC-based sample filtering** (see below) and to prevent misinterpreting sample variation given by quality and not by mouse age itself.
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("Sex")
```
With Sex, female and male samples have similar QC metrics but there are some female samples that have high <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and low <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>, but males don´t. That is consistent with the fact that all males are pups and as we previously observed, pup samples present better QC metrics, but since not all pups are males, some female pups also have smaller <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and greater <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>. So in this case samples' differences are actually dictated by Age and not by Sex.
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("Group")
```
Notably, for Group no evident contrasts are seen in the quality of the samples, which means that both control and exposed samples have similar metrics and therefore the differences between them won't be determined by technical factors but effectively by gene expression changes.
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("Pregnancy")
```
Pregnancy variable is interesting because pregnant dams were obviously females and adults, which we already noted are lower quality. Accordingly to that, pregnant samples present overall higher <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and lower <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>, <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> sum</span> </mark> (library size) and <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> detected</span> </mark> (number of expressed genes) than some samples coming from non-pregnant mice. However, these samples have smaller |<mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> ERCCsumLogErr</span> </mark>|, which means that they had smaller differences between expected and observed concentrations of control transcripts.
Notwithstanding, we must clarify one more time that as in Sex, the trends observed in Pregnancy are all given by Age: it is Age the variable that fully segregates samples in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and almost completely in </mark>, lower <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>. Because all pregnant dams were adults, their metrics will be positioned where the adult samples were, but there were also not-pregnant adults that share similar QC values.
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("plate")
```
In plate, no alarming differences are presented but some samples from the 1st plate
have low (though not much lower) <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> overallMapRate</span> </mark>, <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> and <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> sum</span> </mark>.
```{r message=FALSE, warning=FALSE}
multiple_QC_boxplots("flowcell")
```
For the flowcell, again no worrying distinctions are seen, with the exception of a few individual samples far from the rest.
In all the previous plots we can appreciate a group of samples placed below in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> that correspond to pup samples. The relationship is more fuzzy in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark>.
### Examine relationships between QC metrics of samples
Secondly, we want to analyze if there are relationships between QC variables that explain the presence of samples with two or more particular QC metrics. For that, scatterplots are useful because they enable the evaluation of pairs of samples' QC metrics.
```{r "QC scatterplots", message=FALSE, warning=FALSE}
## Scatterplots for a pair of QC metrics
QC_scatterplots <- function(sample_var, qc_metric1, qc_metric2) {
## Define sample colors
if (sample_var == "Group") {
colors <- c("Control" = "brown2", "Experimental" = "deepskyblue3")
} else if (sample_var == "Age") {
colors <- c("Adult" = "slateblue3", "Pup" = "yellow3")
} else if (sample_var == "Sex") {
colors <- c("F" = "hotpink1", "M" = "dodgerblue")
} else if (sample_var == "Pregnancy") {
colors <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
} else if (sample_var == "plate") {
colors <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
} else if (sample_var == "flowcell") {
colors <- c(
"HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
"HKCTMDSXX" = "tomato"
)
}
data <- colData(rse_gene_filt)
## Scatterplots for continuous variable vs continuous variable
## First QC metric in x-axis and second QC metric in y-axis
plot <- ggplot(as.data.frame(data), aes(
x = !!rlang::sym(qc_metric1),
y = !!rlang::sym(qc_metric2),
## Color samples by a variable
color = !!rlang::sym(sample_var)
)) +
## Add scatterplot
geom_point(size = 1) +
## Add regression line
stat_smooth(geom = "line", alpha = 0.4, size = 0.4, span = 0.25, method = lm, color = "orangered3") +
## Colors
scale_color_manual(name = sample_var, values = colors) +
theme_bw() +
## Add Pearson correlation coefficient between the metrics as subtitle
labs(
subtitle = paste0("Corr: ", signif(cor(data[, qc_metric1], data[, qc_metric2], method = "pearson"), digits = 3)),
## Add axis labels
y = gsub("_", " ", qc_metric2),
x = gsub("_", " ", qc_metric1)
) +
## Plot margins and text size
theme(
plot.margin = unit(c(0.1, 1.2, 0.1, 1.2), "cm"),
axis.title = element_text(size = (7)),
axis.text = element_text(size = (6)),
plot.subtitle = element_text(size = 7, color = "gray40"),
legend.text = element_text(size = 6),
legend.title = element_text(size = 7)
)
return(plot)
}
## QC scatterplots coloring by all sample variables
multiple_QC_scatterplots <- function(qc_metric1, qc_metric2) {
sample_variables <- c("Age", "Sex", "plate", "Pregnancy", "Group", "flowcell")
i <- 1
plots <- list()
for (sample_var in sample_variables) {
plots[[i]] <- QC_scatterplots(sample_var, qc_metric1, qc_metric2)
i <- i + 1
}
plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], nrow = 3, rel_widths = c(1, 1))
}
```
```{r message=FALSE, warning=FALSE}
multiple_QC_scatterplots("mitoRate", "rRNA_rate")
```
For <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> vs <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> rRNA_rate</span> </mark> plots there's a negligible positive correlation which was anticipated since no obvious relationship exists between them, though if cytoplasmic mRNAs weren't well captured (lowering the total number of mapping reads) we could expect an increase in both mitochondrial and ribosomal rates.
Note the complete separation of samples in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> (but not in <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> rRNA_rate</span> </mark>) by Age, which in turn causes the grouping of male samples (all pups).
```{r message=FALSE, warning=FALSE}
multiple_QC_scatterplots("mitoRate", "totalAssignedGene")
```
Forcefully, <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark> and <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> have a strong negative correlation, which reveals the existence of very bad quality samples (with small <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> and high <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark>).
```{r message=FALSE, warning=FALSE}
multiple_QC_scatterplots("sum", "detected")
```
Noticeably, a positive correlation is present for <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> sum</span> </mark> and <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> detected</span> </mark> and that is also congruent because a higher number of expressed-genes implies bigger library sizes, but that is not necessarily true in the other way around: just a few highly expressed genes (small number of <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> detected</span> </mark> genes) could increase the library size. Interestingly, in this case samples are diagonally separated by age.
```{r message=FALSE, warning=FALSE}
multiple_QC_scatterplots("sum", "totalAssignedGene")
```
Contrary to expectations, <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> sum</span> </mark> is not correlated to <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> and this is unexpected because higher proportions of genes' reads could initially increase library size. However, since <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> is a fraction of the total reads, it could be that a high fraction of reads mapping to genes reflects a small number of total reads and does not necessarily imply more expressed genes (and therefore bigger libraries). In other words, even when <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> equals 1, if the total number of reads is small, the genes won't be widely covered by them and the libraries won't be sizeable. Observe the separarion of samples by Age.
```{r message=FALSE, warning=FALSE}
multiple_QC_scatterplots("detected", "totalAssignedGene")
```
Lastly, there's a slight positive correlation between <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> totalAssignedGene</span> </mark> and <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> detected</span> </mark>, which makes sense because reads assigned to genes rise the number of non-zero expressed genes but as mentioned, those genes are not necessarily highly expressed and thus, library sizes are not perforce so much bigger. Also note that the correlation wouldn't be conserved within pup samples.
<style>
p.exercise {
background-color: #E4EDE2;
padding: 10px;
border: 1px solid black;
border-radius: 10px;
font-family: sans-serif;
}
</style>
<p class="exercise">
📑 **Exercise 1**: What's the correlation between <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> subsets_Mito_sum</span> </mark> and <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoMapped</span> </mark>? What factor could explain that correlation? Why are <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoMapped</span> </mark> numbers bigger than <mark style= "background-color: #DFF0FE"> <span style="font-family: monospace"> subsets_Mito_sum</span> </mark> values?
</p>
```{r exercise1_EDA, message=FALSE, warning=FALSE, eval=FALSE, echo=FALSE}
## Solution
multiple_QC_scatterplots("subsets_Mito_sum", "mitoMapped")
## Because in mitoMapped you take reads that mapped to the whole mt chr, in subsets_Mito_sum only reads that were aligned to mt genes. But there's almost a perfect correlation between these two metrics.
```
Remarkably, samples were not differentiated by any other variable (including Group!) but the differences between adult and pup metrics suggest that they must be splitted into two different groups for the following analyses.
### QC sample filtering
After assessing how different or similar are the QC values between samples, we can now proceed to sample filtering based precisely, on these metrics. For that, we will use `isOutlier()` from `r BiocStyle::Biocpkg("scater")` to identify outlier samples only at the lower end or the higher end, depending on the QC metric.
```{r "QC sample filtering", message=FALSE, warning=FALSE}
library("scater")
library("rlang")
library("ggrepel")
## Separate data by Age
rse_gene_pups <- rse_gene_filt[, which(rse_gene_filt$Age == "Pup")]
rse_gene_adults <- rse_gene_filt[, which(rse_gene_filt$Age == "Adult")]
## Find outlier samples based on their QC metrics (samples that are 3 median-absolute-deviations away from the median)
## Filter all samples together
## Drop samples with lower library sizes (sum), detected number of genes and totalAssignedGene
outliers_library_size <- isOutlier(rse_gene_filt$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_filt$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_filt$totalAssignedGene, nmads = 3, type = "lower")
## Drop samples with higher mitoRates and rRNA rates
outliers_mito <- isOutlier(rse_gene_filt$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_filt$rRNA_rate, nmads = 3, type = "higher")
## Keep not outlier samples
not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_qc <- rse_gene_filt[, not_outliers]
## Number of samples retained
dim(rse_gene_qc)[2]
## Add new variables to rse_gene_filt with info of samples retained/dropped
rse_gene_filt$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_filt$SAMPLE_ID, function(x) {
if (x %in% rse_gene_qc$SAMPLE_ID) {
"Retained"
} else {
"Dropped"
}
}))
## Filter adult samples
outliers_library_size <- isOutlier(rse_gene_adults$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_adults$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_adults$totalAssignedGene, nmads = 3, type = "lower")
outliers_mito <- isOutlier(rse_gene_adults$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_adults$rRNA_rate, nmads = 3, type = "higher")
not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_adults_qc <- rse_gene_adults[, not_outliers]
## Number of samples retained
dim(rse_gene_adults_qc)[2]
rse_gene_adults$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_adults$SAMPLE_ID, function(x) {
if (x %in% rse_gene_adults_qc$SAMPLE_ID) {
"Retained"
} else {
"Dropped"
}
}))
## Filter pup samples
outliers_library_size <- isOutlier(rse_gene_pups$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_pups$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_pups$totalAssignedGene, nmads = 3, type = "lower")
outliers_mito <- isOutlier(rse_gene_pups$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_pups$rRNA_rate, nmads = 3, type = "higher")
not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_pups_qc <- rse_gene_pups[, not_outliers]
## Number of samples retained
dim(rse_gene_pups_qc)[2]
rse_gene_pups$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_pups$SAMPLE_ID, function(x) {
if (x %in% rse_gene_pups_qc$SAMPLE_ID) {
"Retained"
} else {
"Dropped"
}
}))
```
We already filtered outlier samples ... but what have we removed?
It is always important to trace the QC metrics of the filtered samples to verify that they really are poor quality. We don't want to get rid of useful samples! So let's go back to the QC boxplots but now color samples according to whether they passed or not the filtering step and also distinguishing samples' groups by shape.
```{r message=FALSE, warning=FALSE}
## Boxplots of QC metrics after sample filtering
## Boxplots
boxplots_after_QC_filtering <- function(rse_gene, qc_metric, sample_var) {
## Color samples
colors <- c("Retained" = "deepskyblue", "Dropped" = "brown2")
## Sample shape by sample variables
if (sample_var == "Group") {
shapes <- c("Control" = 0, "Experimental" = 15)
} else if (sample_var == "Age") {
shapes <- c("Adult" = 16, "Pup" = 1)
} else if (sample_var == "Sex") {
shapes <- c("F" = 11, "M" = 19)
} else if (sample_var == "Pregnancy") {
shapes <- c("Yes" = 10, "No" = 1)
} else if (sample_var == "plate") {
shapes <- c("Plate1" = 12, "Plate2" = 5, "Plate3" = 4)
} else if (sample_var == "flowcell") {
shapes <- c(
"HKCG7DSXX" = 3, "HKCMHDSXX" = 8, "HKCNKDSXX" = 14,
"HKCTMDSXX" = 17
)
}
y_label <- str_replace_all(qc_metric, c("_" = " "))
data <- data.frame(colData(rse_gene))
## Median of the QC var values
median <- median(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))
## Median-absolute-deviation of the QC var values
mad <- mad(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))
plot <- ggplot(data = data, mapping = aes(
x = "", y = !!rlang::sym(qc_metric),
color = !!rlang::sym("Retention_after_QC_filtering")
)) +
geom_jitter(alpha = 1, size = 2, aes(shape = eval(parse_expr((sample_var))))) +
geom_boxplot(alpha = 0, size = 0.15, color = "black") +
scale_color_manual(values = colors) +
scale_shape_manual(values = shapes) +
labs(x = "", y = y_label, color = "Retention after QC filtering", shape = sample_var) +
theme_classic() +
## Median line
geom_hline(yintercept = median, size = 0.5) +
## Line of median + 3 MADs
geom_hline(yintercept = median + (3 * mad), size = 0.5, linetype = 2) +
## Line of median - 3 MADs
geom_hline(yintercept = median - (3 * mad), size = 0.5, linetype = 2) +
theme(
axis.title = element_text(size = (9)),
axis.text = element_text(size = (8)),
legend.position = "right",
legend.text = element_text(size = 8),
legend.title = element_text(size = 9)
)
return(plot)
}
```
In the following plot we can confirm that taking all samples together (both adults and pups) all adult samples are dropped by their high <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark>, but that is not desirable because we want to analyze these samples too, so we need to analyze pup and adult samples separately.
```{r message=FALSE, warning=FALSE}
## Plots
## All samples together
p <- boxplots_after_QC_filtering(rse_gene_filt, "mitoRate", "Age")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
For adults, only 3 controls are dropped, 3 of them by their <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> mitoRate</span> </mark>.
```{r message=FALSE, warning=FALSE}
## Adult samples
p <- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
For pups, 1 experimental sample was dropped by its <mark style= "background-color: #FCF3CF"> <span style="font-family: monospace"> rRNA_rate</span> </mark>.
```{r message=FALSE, warning=FALSE}
## Pup samples
p <- boxplots_after_QC_filtering(rse_gene_pups, "rRNA_rate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
<p class="exercise">
📑 **Exercise 2**: Were the filtered adult samples from pregnant or not pregnant mice? From which plates and flowcells? What were their library sizes?
</p>
```{r exercise2_EDA, message=FALSE, warning=FALSE, eval=FALSE, echo=FALSE}
## Solution
p <- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "Pregnancy")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
p <- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "plate")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
p <- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "flowcell")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
p <- boxplots_after_QC_filtering(rse_gene_adults, "sum", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
```
<div class="alert alert-info">
Try the function yourself with different QC metrics and sample variables!
</div>
## Miscellaneous about CPM values
```{r}
## Why do we see log(CPM + 0.5) values smaller than -1?
log2(0.5)
## Maybe the prior count is getting shrunk
log2(0.05)
## logcounts values we see across all samples
summary(as.vector(assays(rse_gene)$logcounts))
## Sample 203 in particular has some cases like it
summary(assays(rse_gene)$logcounts[, 203])
## Where we find one gene in sample 203 with that property
i_gene <- which.min(assays(rse_gene)$logcounts[, 203])
## We can check the logcounts and counts
assays(rse_gene)$logcounts[i_gene, 203]
assays(rse_gene)$counts[i_gene, 203]
## What if we try to reverse engineer the number?
2^assays(rse_gene)$logcounts[i_gene, 203]
log2(0.01578469)
0.01578469 / 0.5
0.5 * 0.03156938
## Hm... 31 doesn't ring any bells
1 / 0.03156938
```
* Bioconductor website for browsing code: https://code.bioconductor.org/
* Bioconductor support website: https://support.bioconductor.org/
Links Leo showed:
* `smokingMouse_indirects` code for computing the `logcounts` https://github.com/LieberInstitute/smokingMouse_Indirects/blob/704692a357ec391348ebc3568188d41827328ba5/code/02_build_objects/02_build_objects.R#L100
* https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L9
* https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L14
* https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L78
* https://code.bioconductor.org/browse/edgeR/blob/devel/src/R_calculate_cpm.cpp#L30
* https://code.bioconductor.org/browse/edgeR/blob/devel/src/add_prior.cpp#L29
* https://support.bioconductor.org/post/search/?query=prior.count+edgeR
* https://support.bioconductor.org/p/59846/
* https://support.bioconductor.org/p/59846/
```{r}
## Check the documentation
## ?edgeR::cpm
## > If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero.
## https://github.com/LieberInstitute/smokingMouse_Indirects/blob/704692a357ec391348ebc3568188d41827328ba5/code/02_build_objects/02_build_objects.R#L100C51-L100C135
## Let's save the output of calcNormFactors
DGElist_with_norm <- edgeR::calcNormFactors(rse_gene, method = "TMM")
class(DGElist_with_norm)
## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L9
## We can see that it has the lib.size and norm.factors values there
head(DGElist_with_norm$samples[, seq_len(3)])
## Exploring lib.size across all samples
summary(DGElist_with_norm$samples$lib.size)
DGElist_with_norm$samples$lib.size[203]
## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L14
## We can see there how edgeR::cpm() computes the adjusted library sizes
summary(DGElist_with_norm$samples$norm.factors)
DGElist_with_norm$samples$norm.factors[203]
## adjusted library sizes
DGElist_with_norm$samples$lib.size[203] * DGElist_with_norm$samples$norm.factors[203]
## Let's save these values for all samples
adj_lib_size <- DGElist_with_norm$samples$lib.size * DGElist_with_norm$samples$norm.factors
adj_lib_size[203]
## Proportional adjusted library size to the mean adjusted library size
adj_lib_size[203] / mean(adj_lib_size)
1 / (adj_lib_size[203] / mean(adj_lib_size))
## Hm.... we are missing something to get to the values we saw earlier when we
## tried to reverse engineer the issue.
## Hm.... I couldn't recalculate that -5.98 value manually
log2(0.5 / (adj_lib_size[203] / mean(adj_lib_size) * 2))
```
In the end, I (Leo) don't know enough C++ to fully understand the math of what `edgeR` is doing. But well, the point is that it is scaling that `prior.count` (`0.5` in the `smokingMouse` case instead of the default value of `2`) which results in a smaller number, and hence why the `log2()` value of it is smaller than `-1`.
<script defer class="speakerdeck-embed" data-slide="24" data-id="434ae0541ae64731a95f35431acb754d" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
## References {-}
1. Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. *Briefings in bioinformatics*, 19(5), 776-792.
2. Bedre, R. (2023). Gene expression units explained: RPM, RPKM, FPKM, TPM, *DESeq*, TMM, SCnorm, GeTMM, and ComBat-Seq. Web site: https://www.reneshbedre.com/blog/expression_units.html