-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-VisualizingData.Rmd
702 lines (483 loc) · 38.7 KB
/
09-VisualizingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
# **VISUALIZING TAXONOMIC DATA**
You can build lots of different charts through QIIME2 via the compute canada clusters, like we did with the diversity plots - but I think the QIIME2 plots are not as versatile as the ones that can be built in R. Here, I walk you through the steps of exporting the appropriate data to R to use several different packages that handle QIIME2 artifacts. The figures made in these next visualization sections can be saved as a vector file and then edited in other software such as Illustrator, Power Point, Libre Office, etc. for publishing.
If you're brand new to using R and RStudio, you should become familiar with the interface first. Here are a few links to get started
[REducation Beginners Tutorial](https://education.rstudio.com/learn/beginner/)
[Quantitative Guide for Biology: Intro to R](https://ahurford.github.io/quant-guide-all-courses/install.html)
## **Exporting data**
To begin making the figures, let's export the data on to a local computer from the cluster using the safe file transfer protocol (SFTP). You can return to **Chapter 03 - Importing Data** to review how the SFTP works.
I exported the following QZA files onto my local computer and kept them in a file called `Desktop/QIIME_files`. **Make sure to put these files into a dedicated folder for the remainder of the analysis so they are easy to find**
|File type | Associated file names |
|---------|----------------------------|
| Final feature table | final_table-sediments.qza, final_table-birds.qza, final_merged-table.qza |
| Representative sequences | rep-seqs_sediments.qza, rep-seqs_birds.qza, rep_seqs.qza |
| Classification files | classification_sediments.qza, classification_birds.qza, classification.qza|
| Phylogenetic tree | asvs-tree_sediments.qza, insertion-placements_sediments.qza, asvs-tree_birds.qza, insertion-placements_birds.qza, asvs-tree.qza, insertion-placements.qza|
| Merged feature table | merged-table.qza |
| Grouped feature table | grouped-table.qza |
| Diversity folder for birds | includes all of the files from the `qiime core-metrics-phylogenetic` command run on the bird samples (put it into a folder named `/diversity-birds`)|
| Diversity folder for sediments | includes all of the files from the `qiime core-metrics-phylogenetic` command run on the sediment samples (put it into a folder named `/diversity-sediments`)|
| Diversity folder for merged files | includes all of the files from the `qiime core-metrics-phylogenetic` command run on the merged feature table (put it into a folder named `/diversity-merged`)|
## **Importing Q2 files to R**
A few different packages are used to load the QIIME2 artifacts into R. If you're new to R and haven't installed Rtools, do that first. RTools is a separate software that you have to download and install in your system (not in R). Follow this link to install RTools on your system - https://cran.rstudio.com/bin/windows/Rtools/. Sometimes you can get away without doing this step, it doesn't hurt to try!
Begin by installing and loading the following R packages. To install the qiime2R package, I used the remotes package because some versions of R will not allow you to install qiime2R through the normal `install.packages` method.
```{r, eval=FALSE}
#install the ggforce and remotes package
install.packages("remotes")
#now load the remotes package
library(remotes)
#install qiime2R
remotes::install_github("jbisanz/qiime2R")
```
Now install and load some other packages that will be used in the downstream analysis
```{r, eval=FALSE}
install.packages("ggplot2")
install.packages("ggforce")
install.packages("tidyverse")
```
```{r, eval=FALSE}
#load the packages
library(ggplot2)
library(ggforce)
library(tidyverse)
```
## **QIIME2R**
[QIIME2R](https://github.com/jbisanz/qiime2R) is an R package that was created to manipulate QIIME2 artifact files for downstream analysis. It allows users to extract data from an artifact file, manipulate the data within the files, and create publication quality figures, including taxonomic bar charts, diversity plots, and more.
Start by setting your working directory, here I use the file `Desktop/QIIME-files`
```{r, eval=FALSE}
setwd("C:/Users/Hanna/Desktop/QIIME_files")
#check what files are in your wd
list.files()
```
To begin using QIIME2R, make sure the package is loaded first.
```{r, eval=FALSE}
library(qiime2R)
```
### Load the feature table
Next, read the feature tables in .QZA format that we will be using for the downstream analysis. Here I use the `<-` to assign the term `bird_features` to the file containing the feature counts for every bird sample (found in the folder QIIME-files in my Desktop folder).
```{r, eval=FALSE}
bird_features<- read_qza("final_table-birds.qza")
```
Once the data is read, you can view it using this command:
```{r, eval=FALSE}
view(bird_features$data)
```
Now do the same for the sediment samples and the grouped feature table that was made in Ch-06 (Data Preprocessing):
```{r, eval=FALSE}
sed_features <- read_qza("final_table-sediments.qza")
view(sed_features$data)
grouped_features <- read_qza("grouped-table.qza")
view(grouped_features$data)
```
To view if the information stored in the files is correct you can also use the`names` command:
```{r, eval=FALSE}
names(sed_features) #shows first 5 samples and first 5 taxa
names(bird_features)
```
One handy command to know is the `write.csv()` command, which can be used to create .csv file of your feature table.
```{r, eval=FALSE}
write.csv(bird_features$data, "feature-table_birds.csv")
```
### Load the metadata
Now that the feature table is loaded into R, the metadata can also be loaded. Always ensure that your metadata file is properly formatted.
If it's unclear how metadata is used in this analysis, you should download the [metadata files](<https://github.com/johannabosch/QIIME2_for_Graham/blob/main>) titled `metadata.csv`, `birds_metadata.csv`, and `sediments_metadata.csv`. Take a look at the formatting of each file, and the type of data that's attached to each sample.
Metadata in microbiome analysis is akin to the footnotes in a research paper; while not essential for understanding the main text, it provides valuable information about the samples, such as site ID, sampling location, and species Latin names. This additional context enhances the comprehensibility and depth of taxonomic bar charts, diversity box plots, and other figures. The first column, labelled `#SampleID` is a special column header indicating the sample IDs, the other categories are the metadata. Each row represents a sample, with corresponding values for each metadata category. Remember that the specific metadata categories and their names can vary depending on your study and the information you want to associate with your samples. To learn more about using metadata in QIIME2, visit the [Metadata in QIIME2](https://docs.qiime2.org/2023.7/tutorials/metadata/#metadata-columns) page.
Load the metadata into R:
```{r, eval=FALSE}
#metadata file for only sediments
sediment_metadata<-read_q2metadata("metadata-sediments.csv")
#metadata file for only bird fecal samples
bird_metadata<-read_q2metadata("metadata-birds.csv")
#metadata file containing all the samples
metadata<-read_q2metadata("metadata.csv")
#metadata file containing all the grouped samples
#you do not have to use the QIIME2R's command,
grouped_metadata<-read_q2metadata("metadata-grouped.csv")
head(metadata) #used to show top lines of a metadata file
```
### Load the taxonomic data
The next step to visualizing data using QIIME2R is to load the taxonomy data into R. This is done using the same `read_qza` command that was used before.
```{r, eval=FALSE}
taxonomy<-read_qza("classification.qza")
head(taxonomy$data) #used to see the first lines of the taxonomy data file
view(taxonomy$data) #used to open to taxonomy file in a new tab
```
### Parsing taxonomic data
An important step that should be considered is parsing the `classification.qza` file to break up the taxonomic levels. This will separate each level of classification given to a feature. Here, the term `taxonomy` is being assigned to the part of the `classification.qza` file that contains the data (`taxonomy$data`). In other words, the term `taxonomy` now becomes the name of the variable where the parsed taxonomic data will be stored. `parse_taxonomy` is simply a function in the QIIME2R package specifically designed to extract taxonomic data from QIIME 2 artifacts. For example, a string of taxonomic classifications that looks like the first table shown here will end up looking like the second table:
**Table 1. Original taxonomic file:**
|FeatureID|Taxonomy|
|------|--------|
|4595dec6ed...| k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Roseburia;s__Roseburia intestinalis|
**Table 2. Parsed taxonomic data file:**
|FeatureID|Kingdom|Phylum|Class|Order|Family|Genus|Species|
|-------|--------|-------|--------|-------|-------|----------|
|4595dec6ed... | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Roseburia | Roseburia intestinalis|
To parse the taxonomy file, use the following command:
```{r, eval=FALSE}
taxonomy<-parse_taxonomy(taxonomy$data)
view(taxonomy)
#save a .csv formatted version of the file if you want
write.csv(taxonomy, "taxa.csv")
```
**NOTE:** when you do this, you will replace your original file assigned to the term `taxonomy` with the new parsed file that was also assigned to the term `taxonomy`. If you want to keep both files separate, you will have to change the term assigned to the parsed file to something like `parsed-taxonomy`.
### Create a taxa barplot
In this next section, I create a taxonomic bar chart which presents the top 10 most abundant phyla across all of my seabird and sediment samples together. This means that the resulting plot will show the top 10 most abundant phyla for every seabird species or sediment sampling site side by side. To do this we have to use the `grouped-table.qza` file that contains feature counts for both the sediment sampling sites and seabird species that was made in the last steps of Ch-06 (Data Preprocessing).
Make sure the following packages are loaded for this next section
```{r, eval=FALSE}
library(dplyr)
library(tidyr)
library(qiime2R)
```
Next, load the colors you'd like to use for your bar plot. This can be done in many different ways. You can always use a pre-made color palette that can be loaded through various R packages.
Let's try that first! Start by trying to load the RColorBrewer package and choosing the `Spectral` palette. Here I specify how many colors are needed for each sample group with `n=10`.
```{r, eval=FALSE}
# method 1: loading a color palette
install.packages("RColorBrewer")
library(RColorBrewer)
mycolors = c(brewer.pal(name="Spectral", n = 10), brewer.pal(name="BrBG", n = 10))
```
If you have more sampling groups than the color palette has colors, then this method will not work. You can also combine color palettes, but at that point I just prefer to specifiy the colors myself. And in many cases, assigning your own unique HEX codes to each group is really useful. In cases where you're working with a lot of different samples, you'll need the colors to be distinguishable from one another when sitting side by side in the barplot. By assigning your own colors to groups you can always go back and switch the order of colors in your plot.
```{r, eval=FALSE}
#method 2: assigning your own colors
colors <- c("#FF0000", "#FFA500", "#FFFF00", "#00FF00", "#00FFFF", "#0000FF", "#800080", "#FF00FF", "#FF4500", "#FF69B4", "#004445")
```
This next line of code specifies the dimensions of your plot, which is sometimes necessary for plots that are really large but may not necessary.
```{r, eval=FALSE}
par(mar=c(1000,1000,1000,1000))
```
Make sure the grouped-table.qza file (assigned to the term `grouped_features`) and the metadata.csv file (assigned to the term `metadata`) is loaded into R. Next, use the `summarize_taxa` command to summarize the taxonomy file by phylum. This line of code can also be used to summarize data by genera, or any other taxonomic level present in your parsed taxonomy file, simply change the term after the `$` in this code.
```{r, eval=FALSE}
taxasums <- summarize_taxa(grouped_features$data, taxonomy)$Phylum
```
Open this new `taxasums` file to see what happened here. This command merged the taxonomy file and the grouped feature table, and replaced the feature ID's with their associated taxa, meanwhile keeping the feature counts present for each group.
Now use this command to plot the bar graph. Here, I specify that the data should be plotted according to the `group` column in the file `metadata.csv`, and that 20 phyla should be plotted to the graph.
```{r, eval=FALSE}
taxa_barplot(taxasums, grouped_metadata, "group", ntoplot=10) +
scale_fill_manual(values = colors) + #add colors we chose to the plot
theme(text = element_text(size = 12)) + #add a theme to change font size
guides(fill=FALSE) #remove this line if you want a legend
```
```{r fig1, fig.cap="Raw version of the taxa barplot showing the top 10 most abundant phyla present in each sampling group", echo=FALSE}
knitr::include_graphics('images/taxa_barplot.png')
```
Save the figure as an .SVG file so that you can edit it in another software by ungrouping the single layer file. I like using LibreOffice Draw on Linux, or you can use PowerPoint, Illustrator, etc. If you don't know how to do this in your chosen software, Google it!
```{r, eval=FALSE}
ggsave("taxa_birds.svg", height = 50, width = 100, units="cm")
```
If you'd like to know the exact feature counts used in this graph, create a .csv file here that contains all of the feature counts from the taxasums file
```{r, eval=FALSE}
write.csv(taxasums, "top-taxa-phyla.csv")
```
### Create a taxa heatmap
Now let's make two heatmaps that show the top 50 most abundant genera in (1) the sediment samples and (2) the seabird fecal samples. Taxonomic heatmaps that highlight the most abundant genera in your data set are important in microbiome studies because they provide a quick and visually intuitive way to identify the key microbial taxa driving differences between samples or experimental groups. These heat maps help researchers pinpoint which genera are significantly contributing to variations in microbial composition, aiding in the discovery of potential biomarkers or ecological patterns. Additionally, they simplify the complex data by focusing on the most relevant taxa, making it easier to interpret and communicate findings to a broader audience.
Here's a review on how to set up the data for the taxonomic heatmaps. If you're having trouble understanding some of these concepts, revisit the previous section on taxonomic bar plots for a more thorough review on the commands.
```{r, eval=FALSE}
#summarize the taxa based on genera for each sample type
taxasums_birds <-summarize_taxa(bird_features$data, taxonomy)$Genus
taxasums_sediments <-summarize_taxa(sed_features$data, taxonomy)$Genus
#check the new taxasums files
view(taxasums_birds)
view(taxasums_sediments)
#save the files in .csv format for your analysis
write.csv(taxasums_birds, "taxa_counts_birds_genera.csv")
write.csv(taxasums_sediments, "taxa_counts_sediments_genera.csv")
#you can specify what metadata columns you want to use like this
species <- bird_metadata$species
site <- sediment_metadata$site
```
Now that we're all set up, let's create a taxonomic heatmap using the `taxa_heatmap` command for all of the seabird fecal samples. We start by using the summarized taxa file that was created above (`taxasums_birds`), and then specify the accompanying metadata file (`bird_metadata`) to be used, with the specific column we want to use for the x-axis of the heatmap (`species`). Next, we want to plot 50 taxa (`ntoplot=50`), and normalize the data with a `log10(percent)`, which emphasizes relative differences between taxa while reducing the influence of extreme values, making it easier to visualize patterns and compare abundance across taxa. Before running this command make sure you have enough space in your `Plots` quadrant on your screen has enough space to accommodate the heatmap.
```{r, eval=FALSE}
taxa_heatmap(taxasums_birds, bird_metadata, "species", ntoplot=50, normalize = "log10(percent)") +
scale_fill_gradient2(low="#9ac094", mid="#5ba14f", high="#003510", midpoint=0, na.value="white") +
scale_y_discrete(labels = function(x) str_replace(x, ".*;", "")) +
theme_light() + theme(axis.text.x = element_text(size = 12, angle =-75))
```
Make sure you have the `Plot` tab for the bottom right quadrant completely expanded to accommodate the size of the plot. After expanding the quadrant and running this code, you'll notice that the y-axis takes up most of the space on the plot. Here is what the y-axis labels look like right now:
```{r, echo=FALSE}
knitr::include_graphics('images/heatmap_screenshot.png')
```
If you do not want to include the entire taxonomic lineage in the y-axis of the figure, you can simply add the `scale_y_discrete` argument to your command. This line of code modifies the y-axis labels of a discrete scale in a plot by removing everything before and including the last semicolon in each label using the str_replace function from the stringr package.
```{r, eval=FALSE}
taxa_heatmap(taxasums_birds, bird_metadata, "species", ntoplot=50, normalize = "log10(percent)") +
scale_fill_gradient2(low="#9ac094", mid="#5ba14f", high="#003510", midpoint=0, na.value="white") +
scale_y_discrete(labels = function(x) str_replace(x, ".*;", "")) +
theme_light() + theme(axis.text.x = element_text(size = 12, angle =-75))
```
```{r fig2, fig.cap="Raw version of the taxa heatmap showing the top 50 most abundant genera present in each seabird fecal sampling group", echo=FALSE}
knitr::include_graphics('images/taxa_heatmap_birds.png')
```
Let's do the same with the sediment samples:
```{r, eval=FALSE}
taxa_heatmap(taxasums_sediments, sediment_metadata, "site", ntoplot=50, normalize = "log10(percent)") +
scale_fill_gradient2(low="#c99884", mid="#644133", high="#351b10", midpoint=0) +
scale_y_discrete(labels = function(x) str_replace(x, ".*;", "")) +
theme_light() + theme(axis.text.x = element_text(size=12, angle=-75))
```
```{r fig3, fig.cap="Raw version of the taxa heatmap showing the top 50 most abundant genera present in each sediment coring site group", echo=FALSE}
knitr::include_graphics('images/taxa_heatmap_sediments.png')
```
### Plot beta-diversity
PCA (Principal Component Analysis) plots are useful in microbiome analysis because they help visualize complex data by reducing it to a few key dimensions. In the context of comparing microbiomes between sites or species, PCA plots allow us to see patterns and similarities in microbial community composition, making it easier to identify which factors or variables are driving the differences in a data set.
While QIIME2 produces some nice interactive PCA plots, it's difficult to use these for publishing. With the QIIME2R package, you can create beautiful PCA plots and edit them to your liking.
Start by assigning the term `species` and `site` to the appropriate column from your `metadata-birds.csv` and `metadata-sediments.csv` file:
```{r, eval=FALSE}
species <- bird_metadata$species
site <- sediment_metadata$site
```
After running the `qiime diversity core-metrics-phylogenetic` command in Ch07 (Diversity Analyses), two files called `unweighted_unifrac_pcoa_results.qza` and `weighted_unifrac_pcoa_results.qza` were produced for both the sediment and seabird fecal samples (4 files total). These are the files we will use to create the weighted and unweighted PCA plots.
> **NOTE:** Weighted UniFrac in PCA analysis takes into account the abundance of organisms, highlighting the influence of the most abundant species, while unweighted UniFrac focuses on presence and absence, emphasizing less abundant organisms in community comparisons.
Check that these files are in the appropriate `/diversity` folders within your working directory, go back to the **Exporting data** section of this chapter for a quick refresher. Start by loading these files into R, where `wunifrac` refers to the weighted PCA files, and `unwunifrac` refers to the unweighted PCA files:
```{r, eval=FALSE}
wunifrac_birds<-read_qza("diversity-birds/weighted_unifrac_pcoa_results.qza")
unwunifrac_birds<-read_qza("diversity-birds/unweighted_unifrac_pcoa_results.qza")
wunifrac_seds<-read_qza("diversity-sediments/weighted_unifrac_pcoa_results.qza")
unwunifrac_seds<-read_qza("diversity-sediments/unweighted_unifrac_pcoa_results.qza")
```
If you want to add a little extra addition to your PCA plots, you can integrate the shannon diversity metrics into the plot. The addition of the Shannon data (shannon) in this code allows you to include the Shannon Diversity Index as a size aesthetic in the PCA plot. By doing so, you can visualize how the diversity of microbial communities (as measured by Shannon Diversity) relates to the distribution of samples in the PCA space. This can help you explore whether samples with higher or lower Shannon diversity values cluster differently in the PCA plot, providing insights into the relationship between diversity and community composition.
```{r, eval=FALSE}
# Shannon entropy data for sediment core samples
shannon_seds<-read_qza("diversity-sediments/shannon_vector.qza")$data %>% rownames_to_column("SampleID")
shannon_values_seds <- shannon_seds$shannon_entropy
# Shannon entropy data for seabird fecal samples
shannon_birds<-read_qza("diversity-birds/shannon_vector.qza")$data %>% rownames_to_column("SampleID")
shannon_values_birds <- shannon_birds$shannon_entropy
```
The next commands are used to plot the weighted (`PCA_w`) and unweighted (`PCA_unw`)PCA's of the sediment samples. This command can be broken down into several different parts that are explained below:
```{r, eval=FALSE}
#plotting weighted PCA for sediments
PCA_w_seds <- wunifrac_seds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(sediment_metadata) %>%
left_join(shannon_seds) %>%
ggplot(aes(x = PC1, y = PC2, color = site, size = shannon_values_seds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 1, geom = "polygon", aes(fill = site), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12)) +
xlab(paste(round(100 * wunifrac_seds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * wunifrac_seds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity")
PCA_w_seds
#plotting unweighted PCA for sediments
PCA_unw_seds <- unwunifrac_seds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(sediment_metadata) %>%
left_join(shannon_seds) %>%
ggplot(aes(x = PC1, y = PC2, color = site, size = shannon_values_seds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 12, geom = "polygon", aes(fill = site), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12)) +
xlab(paste(round(100 * unwunifrac_seds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * unwunifrac_seds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity") # Set the legend title for size
PCA_unw_seds
```
Here is an explanation of some of the arguments in the PCA plotting command:
* `PCA_w_seds <- wunifrac_seds$data$Vectors %>%` : This part of the command selects the `Vectors` data from the `wunifrac_seds` object and begins a chain of operations.
* `select(SampleID, PC1, PC2) %>%` : selects three columns from the data: "SampleID," "PC1," and "PC2."
*`left_join(sediment_metadata) %>%`: performs a left join with the data from `sediment_metadata`, combining the selected columns from the previous step with the `SampleID` column.
* `left_join(shannon_seds) %>%` : similar to the previous step, performs a left join with the "shannon_seds" data, combining the result of the previous join with this data.
* `ggplot(aes(x = PC1, y = PC2, color = site, size = shannon_values_seds))` : This part starts creating a ggplot, where the `color` aesthetic is determined by `site`, and the `size` aesthetic is determined by `shannon_values_seds`.
* `stat_ellipse(type = "norm", level = 0.95, linetype = 1, geom = "polygon", aes(fill = site), alpha = 0.05)`: This adds ellipses to the plot that represent normal distribution-based confidence intervals (at 95% level) around the points. I chose linetype 1 for weighted PCAs and linetype 12 for unweighted PCAs.
* `xlab(paste(round(100 * wunifrac_seds$data$ProportionExplained[1], 2), "% - PC1"))` : Sets the x-axis label (and y-axis label in the next line) using the info from the "wunifrac_seds" object that denotes the proportion explained by the first principal component (PC1, or PC2 for the y-axis), converting it to a percentage.
Now plot the same for the seabird fecal samples:
```{r, eval=FALSE}
#plotting weighted PCA for seabird fecal samples
PCA_w_birds <- wunifrac_birds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(bird_metadata) %>%
left_join(shannon_birds) %>%
ggplot(aes(x = PC1, y = PC2, color = species, size = shannon_values_birds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 1, geom = "polygon", aes(fill = species), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12)) +
xlab(paste(round(100 * wunifrac_birds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * wunifrac_birds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity") # Set the legend title for size
PCA_w_birds
#plotting weighted PCA for seabird fecal samples
PCA_unw_birds <- unwunifrac_birds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(bird_metadata) %>%
left_join(shannon_birds) %>%
ggplot(aes(x = PC1, y = PC2, color = species, size = shannon_values_birds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 12, geom = "polygon", aes(fill = species), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12)) +
xlab(paste(round(100 * unwunifrac_birds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * unwunifrac_birds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity") # Set the legend title for size
PCA_unw_birds
```
Now that we have all 4 plots done for both unweighted and weighted metrics, lets put them all on the same figure using the `patchwork` package provided by R.
```{r, eval=FALSE}
install.packages("patchwork")
library(patchwork)
wrap_plots(PCA_w_seds, PCA_unw_seds, PCA_w_birds, PCA_unw_birds)
```
Make sure you expand your bottom left quadrant in RStudio as much as possible to accommodate the plot size, as you can see in this screenshot. Your wrapped PCA figure will look something like this:
```{r, echo=FALSE}
knitr::include_graphics('images/PCA_1.png')
```
You can see that we don't really need two legends for both unweighted and weighted versions of the same plot. Let's quickly fix the code to remove a legend from the two weighted plots to the left, so that all we have is legends to the right hand side of this figure by adding `legend.position="none"` to the `theme()` argument we already have going.
```{r, eval=FALSE}
# removing the legend for the sediment sample's weighted PCA
PCA_w_seds <- wunifrac_seds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(sediment_metadata) %>%
left_join(shannon_seds) %>%
ggplot(aes(x = PC1, y = PC2, color = site, size = shannon_values_seds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 1, geom = "polygon", aes(fill = site), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12), legend.position = "none") +
xlab(paste(round(100 * wunifrac_seds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * wunifrac_seds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity")
# removing the legend for the seabird fecal sample's weighted PCA
PCA_w_birds <- wunifrac_birds$data$Vectors %>%
select(SampleID, PC1, PC2) %>%
left_join(bird_metadata) %>%
left_join(shannon_birds) %>%
ggplot(aes(x = PC1, y = PC2, color = species, size = shannon_values_birds)) +
geom_point(alpha = 0.5) +
stat_ellipse(type = "norm", level = 0.95, linetype = 1, geom = "polygon", aes(fill = species), alpha = 0.05) +
theme_q2r() +
theme(text = element_text(size = 12), panel.grid.major = element_line(), panel.grid.minor = element_line(), axis.text = element_text(size = 12), legend.position = "none") +
xlab(paste(round(100 * wunifrac_birds$data$ProportionExplained[1], 2), "% - PC1")) +
ylab(paste(round(100 * wunifrac_birds$data$ProportionExplained[2], 2), "% - PC2")) +
labs(size = "Shannon Diversity") # Set the legend title for size
```
Now wrap the plots one more time to see the final result!
```{r, eval=FALSE}
wrap_plots(PCA_w_seds, PCA_unw_seds, PCA_w_birds, PCA_unw_birds)
```
```{r fig4, fig.cap="Raw version of the weighted PCA (two left plots) and unweighted PCA (two right plots) for both sediment samples (top) and seabird fecal samples (bottom)", echo=FALSE}
knitr::include_graphics('images/PCA_2.png')
```
## **Phyloseq**
[Phyloseq](https://joey711.github.io/phyloseq/) is a handy tool in microbiome analysis that helps researchers work with and visualize microbial community data **[19]**. It can be used alongside QIIME2 artifact files by converting QIIME2 outputs into a Phyloseq object. Like QIIME2R, it allows for advanced analysis, visualization, and exploration of microbiome data within R's programming environment.
install and load the needed packages
```{r, eval=FALSE}
install.packages("phyloseq")
library(phyloseq)
library(ggplot2)
library(tidyverse)
library(qiime2R)
```
### Build a phyloseq object
Next, create a phyloseq object for your data. These objects can contain several different .QZA files. In this case, they include the feature table, taxonomy file and metadata file, but they can also include the trees, insert-placements and other files which were created throughout the preprocessing and diversity analysis stages.
In these next commands, I build 2 phyloseq objects for (1) the seabird fecal samples, (2) the sediment samples. Make sure you are in the correct working directory, in this case `/QIIME_files`.
```{r, eval=FALSE}
#building an object for seabird fecal samples
birds_physeq<-qza_to_phyloseq(
features="final_table-birds.qza",
tree="asvs-tree_birds.qza","classification_birds.qza",
metadata = "metadata-birds.csv")
birds_physeq
#building an object for sediment core samples
sediments_physeq<-qza_to_phyloseq(
features="final_table-sediments.qza",
tree="asvs-tree_sediments.qza","classification_sediments.qza",
metadata = "metadata-sediments.csv")
sediments_physeq
```
### Taxonomic characteristics
If you want to find out how many taxa have been classified at different taxonomic levels within these objects, use the following commands:
```{r, eval=FALSE}
#for seabird fecal samples
tax_table(birds_physeq) %>%
as("matrix") %>%
as_tibble(rownames = "OTU") %>%
gather("Rank", "Name", rank_names(birds_physeq)) %>%
na.omit() %>% # remove rows with NA value
group_by(Rank) %>%
summarize(ntaxa = length(unique(Name))) %>% # computes number of unique taxa
mutate(Rank = factor(Rank, rank_names(birds_physeq))) %>%
arrange(Rank)
#for sediment samples
tax_table(sediments_physeq) %>%
as("matrix") %>%
as_tibble(rownames = "OTU") %>%
gather("Rank", "Name", rank_names(sediments_physeq)) %>%
na.omit() %>%
group_by(Rank) %>%
summarize(ntaxa = length(unique(Name))) %>%
mutate(Rank = factor(Rank, rank_names(sediments_physeq))) %>%
arrange(Rank)
```
The above command takes the taxonomic table in the `birds_physeq` object and performs a series of data manipulation operations, using the dplyr and tidyr packages in R. It first converts the taxonomic table to a matrix, then converts it to a tibble with row names set as `OTU` (used as a generic term here even though I am using ASVs). It gathers the data to create two columns, `Rank` and `Name`, from the taxonomic ranks present in `birds_physeq`. The`na.omit()` line removes rows with missing values, and then `group_by()` line groups the data by `Rank`, then calculates the number of unique taxa for each rank using the `summarize()` command. Finally, the `mutate()` command converts `Rank` into a factor variable with predefined rank names, and finally arranges the data by `Rank` in ascending order using the `arrange()` command. The output for the command will look something like this (these numbers are from running the command on the sediment sample phyloseq object):
```{r, eval=FALSE}
# A tibble: 7 × 2
Rank ntaxa
<fct> <int>
1 Kingdom 2
2 Phylum 57
3 Class 125
4 Order 244
5 Family 309
6 Genus 393
7 Species 160
```
### Plot alpha-diversity
What's nice about Phyloseq is that you can calculate and plot diversity metrics all in one command, so you actually don't even have to run any qiime diversity commands if you don't want to! I like the plots that this Phyloseq package produces more anyways, and they are easier to save as an .SVG image to edit later on for publishing.
Here I plot three diversity metrics together for the sediment samples according to site, using the following `plot_richness` command:
```{r, eval=FALSE}
sediments_richness <- plot_richness(sediments_physeq, x = "site", color="depth", nrow=1,
measures = c("Simpson", "Shannon", "Observed")) +
geom_point(size=2, alpha=0.5) +
geom_boxplot(alpha=0.3, size=0.3, colour="black") +
theme_bw()+
theme(text = element_text(size = 10), panel.grid.major.x=element_line(FALSE), axis.text = element_text(size=10), axis.text.x=(element_text(angle=90, vjust=0.5)))
```
The above command includes metrics for Simpson, Shannon, and observed features as subplots. I assign the plot to the term `sediments_richness` so that I can merge the diversity plots for both sediment and seabird samples together afterwards. Here is a quick review of the arguments I used here:
* `x="site"` will split the plot up into the 3 coring sites listed in the metadata file of the phyloseq object
* `color=depth` will color the points on the box plot so that diversity can be visualized across different depths of the core.
* The `nrow` argument let's you adjust the orientation of the plots, with 1 row the plots will end up in a horizontal line, with 3 rows the plots will end up in a vertical line.
* The `geom_points()` and `geom_boxplot()` arguments let you manipulate the size and transparency (alpha) of the box plot's features.
> When adding a theme to boxplot, there are several options that you can choose from (to see a list of the themes visit [this link](https://ggplot2.tidyverse.org/reference/ggtheme.html)). I chose the black and white theme (`theme_bw()`) for my plot, and then on the last line I use the `theme()` argument to specify the overall text size of the boxplot and legend, as well as the axis text size.
* The `panel.grid.major.x=element_line(FALSE)` argument to removes the vertical grid lines as they aren't necessary here.
* The `panel.grid.major.y=element_line()` argument can also be used to specify the `color()`, `linetype()` and `size()` of your grid lines.
* The `axis.text.x=(element_text(angle=90, vjust=0.5)` argument changes the orientation of the x-axis text so that it all fits better in the plot using the `angle` option, and the `vjust` option to adjust the vertical position of the labels so they are not overlapping the axis.
Next, do the same for the seabird fecal samples. This time plotting by species, without any colors for the points.
```{r, eval=FALSE}
birds_richness <- plot_richness(birds_physeq, x = "species", nrow=3, measures = c("Simpson", "Shannon", "Observed")) +
geom_point(size=2, alpha=0.5) +
geom_boxplot(alpha=0.3, size=0.3, colour="black") +
theme_bw()+
theme(text = element_text(size = 10), panel.grid.major.x=element_line(FALSE), axis.text = element_text(size=10), axis.text.x=(element_text(angle=90, vjust=0.5)))
```
Now we can make a larger plot for publishing that puts all of these plots together in one figure. You can always do this with plots using the `patchwork` package in R.
```{r, eval=FALSE}
install.packages("patchwork")
library(patchwork)
wrap_plots(birds_richness, sediments_richness, nrow=1)
```
Your final box plot should look like this:
```{r fig5, fig.cap="Raw version of the taxa boxplot showing alpha-diversity metrics for seabird fecal samples (left) and sediment samples (right)", echo=FALSE}
knitr::include_graphics('images/boxplots.png')
```
If you want to use other diversity metrics for a plot like this, Phyloseq offers all of the following metrics:
* `Observed`: The number of distinct OTUs (Operational Taxonomic Units) or taxa observed in a sample, reflecting species richness.
* `Chao1`: An estimate of total species richness, taking into account both observed species and the number of rare or singleton species. It provides a more robust estimate of richness.
* `ACE` (Abundance-based Coverage Estimator): Similar to Chao1, it estimates total species richness while considering the abundance distribution of rare species. It is often used in cases where there are many rare species.
* `Shannon`: Measures both richness and evenness of species in a sample. Higher values indicate higher diversity, considering both species number and their relative abundances.
* `Simpson`: Focuses on the dominance of a few species in a sample. Lower values indicate higher diversity, with less dominance by a single species.
* `InvSimpson` (Inverse Simpson): Similar to Simpson but transformed to emphasize the dominance of species. Higher values indicate higher diversity.
* `Fisher`: A rarefaction-based diversity index that estimates species richness while considering the number of individuals sampled. It is sensitive to rare species and is used to compare diversity between samples with different sampling efforts.
___
<hr />
<p style="text-align: center;">Github: [johannabosch](https://github.com/johannabosch) </a></p>
<p style="text-align: center;"><span style="color: #808080;"><em>yohannabosch@gmail.com</em></span></p>
<!-- Add icon library -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<!-- Add font awesome icons -->
<p style="text-align: center;">
<a href="https://www.instagram.com/yohannabosch/" class="fa fa-instagram"></a>
<a href="https://www.linkedin.com/in/yan-holtz-2477534a/" class="fa fa-linkedin"></a>
<a href="https://github.com/johannabosch/" class="fa fa-github"></a>
</p>