-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path05-explore.Rmd
285 lines (208 loc) · 10.3 KB
/
05-explore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
```{r, echo=FALSE}
library(mia)
library(miaViz)
library(dplyr)
se <- readRDS("data/se.rds")
tse <- readRDS("data/tse.rds")
```
# Microbiome data exploration
Now we have loaded the data set into R. Next, let us walk through some
basic operations for data exploration to confirm that the data has all
the necessary components.
## Data structure
Let us now investigate how taxonomic profiling data is organized in R.
Dimensionality tells us how many taxa and samples the data
contains. As we can see, there are `r dim(tse)[1]` taxa and `r dim(tse)[2]`
samples.
```{r}
dim(tse)
```
The `rowData` slot contains a taxonomic table. This includes taxonomic
information for each of the `r nrow(tse)` entries. With the `head()`
command, we can print just the beginning of the table.
The `rowData` seems to contain information from `r ncol(rowData(tse))`
different taxonomy classes.
```{r}
knitr::kable(head(rowData(tse))) %>%
kableExtra::kable_styling("striped",
latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
The colData slot contains sample metadata. It contains information for all `r ncol(tse)` samples.
However, here only the 6 first samples are shown as we use the `head()` command. There
are `r ncol(colData(tse))` columns, that contain information, e.g., about patients' status, and cohort.
```{r}
knitr::kable(head(colData(tse))) %>%
kableExtra::kable_styling("striped",
latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
From here, we can draw summaries of the sample (column) data, for
instance to see what is the patient status distribution.
The command `colData(tse)$patient_status` fetches the data from the
column, and `table()` creates a table that shows how many times each
class is present, and `sort()` sorts the table to ascending order.
There are `r length(colData(tse)$patient_status[colData(tse)$patient_status == "ADHD"])`
samples from patients having ADHD,
and `r length(colData(tse)$patient_status[colData(tse)$patient_status == "Control"])` control samples.
```{r}
sort(table(colData(tse)$patient_status))
```
### Transformations
Microbial abundances are typically 'compositional' (relative) in the
current microbiome profiling data sets. This is due to technical
aspects of the data generation process (see e.g. [Gloor et al.,
2017](https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full)).
The next example calculates relative abundances as these are usually easier to
interpret than plain counts. For some statistical models we need to
transform the data into other formats as explained in above link (and
as we will see later).
```{r}
# Calculates relative abundances, and stores the table to assays
tse <- transformCounts(tse, method = "relabundance")
```
A variety of standard transformations for microbiome data are available for `TSE` data objects through [mia R package](https://microbiome.github.io/mia/reference/transformCounts.html).
### Aggregation
Microbial species can be called at multiple taxonomic resolutions. We
can easily agglomerate the data based on taxonomic ranks. Here, we
agglomerate the data at Phylum level.
```{r 05explore_agglomerateByRank}
tse_phylum <- agglomerateByRank(tse, rank = "Phylum")
# Show dimensionality
dim(tse_phylum)
```
Now there are `r dim(tse_phylum)[1]` taxa and `r dim(tse_phylum)[2]`
samples, meaning that there are `r dim(tse_phylum)[1]` different
Phylum level taxonomic groups. Looking at the `rowData` after
agglomeration shows all Firmicutes are combined together, and all
lower rank information is lost.
From the assay we can see that all abundances of taxa that belong to
Firmicutes are summed up.
```{r}
knitr::kable(head(rowData(tse_phylum))) %>%
kableExtra::kable_styling("striped",
latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
If you are sharp, you have by now noticed that all the aggregated
values in the above example are NA's (missing data). This is because
the agglomeration is missing abundances for certain taxa, and in that
case the sum is not defined by default (`na.rm = FALSE`). We can
ignore the missing values in summing up the data by setting `na.rm =
TRUE`; then the taxa that do not have information in specified level
will be removed. Those taxa that do not have information in specified
level are agglomerated at lowest possible level that is left after
agglomeration.
```{r}
temp <- rowData(agglomerateByRank(tse, rank = "Genus"))
# Prints those taxa that do not have information at the Genus level
knitr::kable(head(temp[temp$Genus == "",])) %>%
kableExtra::kable_styling("striped",
latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
Here agglomeration is done similarly, but na.rm = TRUE
```{r}
temp2 <- rowData(agglomerateByRank(tse, rank = "Genus", na.rm = TRUE))
print(paste0("Agglomeration with na.rm = FALSE: ", dim(temp)[1], " taxa."))
print(paste0("Agglomeration with na.rm = TRUE: ", dim(temp2)[1], " taxa."))
```
The [mia
package](https://microbiome.github.io/mia/reference/index.html)
contains further examples on various data agglomeration and splitting
options.
## Visualization
The [miaViz package](https://microbiome.github.io/miaViz/) facilitates
data visualization. Let us plot the Phylum level abundances.
```{r}
# Here we specify "relabundance" to be abundance table that we use for plotting.
# Note that we can use agglomerated or non-agglomerated tse as an input, because
# the function agglomeration is built-in option.
# Legend does not fit into picture, so its height is reduced.
plot_abundance <- plotAbundance(tse, abund_values="relabundance", rank = "Phylum") +
theme(legend.key.height = unit(0.5, "cm")) +
scale_y_continuous(label = scales::percent)
plot_abundance
```
**Density plot** shows the overall abundance distribution for a given
taxonomic group. Let us check the relative abundance of Firmicutes
across the sample collection. The density plot is a smoothened
version of a standard histogram.
The plot shows peak abundances around 30 %.
```{r}
# Subset data by taking only Firmicutes
tse_firmicutes <- tse_phylum["Firmicutes"]
# Gets the abundance table
abundance_firmicutes <- assay(tse_firmicutes, "relabundance")
# Creates a data frame object, where first column includes abundances
firmicutes_abund_df <- as.data.frame(t(abundance_firmicutes))
# Rename the first and only column
colnames(firmicutes_abund_df) <- "abund"
# Creates a plot. Parameters inside feom_density are optional. With
# geom_density(bw=1000), it is possible to adjust bandwidth.
firmicutes_abund_plot <- ggplot(firmicutes_abund_df, aes(x = abund)) +
geom_density(color="darkred", fill="lightblue") +
labs(x = "Relative abundance", title = "Firmicutes") +
theme_classic() + # Changes the background
scale_x_continuous(label = scales::percent)
firmicutes_abund_plot
```
```{r, echo=FALSE}
# # Does the same thing but differently
# # Calculates the density. Bandwidth can be adjusted; here, it is 0.065.
# # density() is from stats package
# density_firmicutes <- density(abundance_firmicutes, bw = 0.065)
#
# # Plots the density
# plot(density_firmicutes,
# xlab="Relative abundance",
# ylab="Density",
# main=paste0("Firmicutes (",density_firmicutes$n, " obs, ",
# density_firmicutes$bw, " bw)"))
```
For more visualization options and examples, see the [miaViz vignette](https://microbiome.github.io/miaViz/articles/miaViz.html).
## Exercises (optional)
Explore some of the following questions on your own by following
[online examples](https://microbiome.github.io/OMA/). Prepare a
reproducible report (Rmarkdown), and include the code that you use to
import the data and generate the analyses.
* **Abundance table** Retrieve the taxonomic abundance table from the
example data set (TSE object). Tip: check "assays" in [data import
section](https://microbiome.github.io/OMA/data-introduction.html#loading-experimental-microbiome-data)
* How many different samples and genus-level groups this phyloseq
object has? Tips: see dim(), rowData()
* What is the maximum abundance of Akkermansia in this data set? Tip:
aggregate the data to Genus level with agglomerateByRank, pick
abundance assay, and check a given genus (row) in the assay
* Draw a histogram of library sizes (total number of reads per
sample). Tip: Library size section in
[OMA](https://microbiome.github.io/OMA/quality-control.html). You
can use the available function, or count the sum of reads per
sample by using the colSums command applied on the abundance
table. Check [Vandeputte et
al. 2017](https://www.nature.com/articles/nature24460) for further
discussion on the differences between absolute and relative
quantification of microbial abundances.
* **Taxonomy table** Retrieve the taxonomy table and print out the
first few lines of it with the R command head(). Investigate how
many different phylum-level groups this phyloseq object has? Tips:
rowData, taxonomicRanks in
[OMA](https://microbiome.github.io/OMA/taxonomic-information.html#functions-to-access-taxonomic-information).
* **Sample metadata** Retrieve sample metadata. How many patient
groups this data set has? Draw a histogram of sample
diversities. Tips: colData
* **Subsetting** Pick a subset of the data object including only
ADHD individuals from Cohort 1. How many there are? Tips: subsetting in [OMA](https://microbiome.github.io/OMA/datamanipulation.html#subsetting)
* **Transformations** The data contains read counts. We can convert
these into relative abundances and other formats. Compare abundance
of a given taxonomic group using the example data before and after
the compositionality transformation (with a cross-plot, for
instance). You can also compare the results to CLR-transformed data
(see e.g. [Gloor et
al. 2017](https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full))
* **Visual exploration** Visualize the population distribution of
abundances for certain taxonomic groups. Do the same for
CLR-transformed abundances. Tip: assays, transformCounts
* Experiment with other data manipulation tools from
[OMA](https://microbiome.github.io/OMA/taxonomic-information.html#functions-to-access-taxonomic-information).
* Example solution: [Solutions](06-3-ex-sol-ADHD.html)