Skip to content

Latest commit

 

History

History
825 lines (670 loc) · 33.6 KB

High-dimensional testing lab tutorial.md

File metadata and controls

825 lines (670 loc) · 33.6 KB

High-dimensional testing tutorial

Omnibus tests on data from the HMP2 baseline only data


  • What are omnibus tests and how do they differ from featurewise tests?
  • How might these complement each other?
Answer:

Omnibus tests are global tests, here of the microbial community diversity. Especially in smaller studies, it's possible that no individual features reach significance, but some microbiome association can nonetheless be statistically shown.


In this tutorial, we are going to take data from the second phase of the Human Microbiome Project, and look at some common exploratory analyses which search for evidence of general microbiome associations with cohort metadata. These tests are frequently the first we perform on a new feature table and can give a good idea of things such as the overall strength of an intervention or the degree to which covariates might confound the main microbiome outcomes of interest.

We are starting with the baseline timepoint, metagenome-derived species data, since species tables (+ associated metadata) are what you will likely most commonly encounter, and longitudinal designs require more advanced statistical techniques beyond the scope of this tutorial. See the full HMP2 manuscript for how within-subject correlations were studed (especially wrt PERMANOVA).

Start R and set the working directory. In the bioBakery VM, as follows:

R
setwd("~/Tutorials/highdimtesting")

Load the R packages needed:

library(vegan)

MGX taxonomy

Feature table and metadata table creation and formatting

Download the MGX taxonomy relative abundance data and put it into the data directory:

download.file("https://raw.githubusercontent.com/biobakery/omnibus-and-maaslin2-rscripts-and-hmp2-data/master/taxonomic_profiles_pcl_week0.csv", "./Data/taxonomic_profiles_pcl_week0.csv")

Note that check.names = F is supplied so that invalid or repeated column names are not flagged as errors or automatically "corrected". This is useful when you want to make sure your output always matches the original input file formating, but can be dangerous.

Read the taxonomic relative abundance data into your R environment:

tax = read.csv(file = "./Data/taxonomic_profiles_pcl_week0.csv", header = T, row.names = 1, check.names = F)

Take a look at tax:

tax[1:10,1:10]

  • Why not use head(tax)? Try it if you are unsure.
Answer:

head cuts off at a certain number of rows, but not columns.


Check the dimensions:

dim(tax)
[1]   96 1484

Extract the metadata from tax, the first 5 columns in this file:

metadata = tax[1:5]

  • What are the advantages and disadvantages of storing and accessing data like this?
Answer:

A single file can be easier to deal with and is required for some tools, but accessing data with static indicies can cause problems if the file is changed (e.g. if another metadata column is added then tax[1:5] is not correct).


Take a look at metadata:

head(metadata)
            site_name    sex  race consent_age diagnosis
CSM67UH7 Cedars-Sinai   Male White          69    nonIBD
CSM79HHW Cedars-Sinai Female White          36        CD
HSM67VDT   Cincinnati   Male White          11    nonIBD
HSM6XRQB   Cincinnati Female White          16        CD
HSM6XRR3   Cincinnati   Male White          13        CD
HSM7J4LP   Cincinnati Female White          12        CD

Check the structure:

str(metadata)
'data.frame':	96 obs. of  5 variables:
 $ site_name  : chr  "Cedars-Sinai" "Cedars-Sinai" "Cincinnati" "Cincinnati" ...
 $ sex        : chr  "Male" "Female" "Male" "Female" ...
 $ race       : chr  "White" "White" "White" "White" ...
 $ consent_age: int  69 36 11 16 13 12 16 14 30 57 ...
 $ diagnosis  : chr  "nonIBD" "CD" "nonIBD" "CD" ...

Check for NAs in metadata, since these will cause issues later in PERMANOVA tests:

sapply(metadata, function(x) sum(is.na(x)))
site_name         sex        race consent_age   diagnosis 
        0           0           0           6           0 

Age has 6 NA.

If this was a discrete variable we could classify the NAs as "unknown" and keep them in the model, but since age is a continuous variable typically we would either remove those samples from the data or impute.


  • What is the main drawback of keeping NA values for discrete variables?
  • Is there a case where this is totally justified?
Answer:

Adding another category means increasing the degrees of freedom in downstream statistical tests. You may however want to specifically investigate if missing values are meaningful (e.g. failure to report a value can indicate study attrition which may be caused by an adverse reaction).


In this case, let's impute with the median:

metadata$consent_age[is.na(metadata$consent_age)] = median(metadata$consent_age, na.rm = T)

Extract species data from tax and transpose the dataframe so that taxa are in rows:

species = as.data.frame(t(tax[6:ncol(tax)]))

Note that transpose returns a matrix, so we need to coerce it back into a dataframe.

Check that species is all numeric values:

all(sapply(species, is.numeric))

Take a look at the output:

species[1:8,1:2]
                                                                                      CSM67UH7
k__Archaea                                                                                                                                                                                        0
k__Archaea|p__Euryarchaeota                                                                                                                                                                       0
k__Archaea|p__Euryarchaeota|c__Methanobacteria                                                                                                                                                    0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales                                                                                                                              0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae                                                                                                       0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter                                                                                 0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii                                                   0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii|t__Methanobrevibacter_smithii_unclassified        0
                                                                                                                                                                                           CSM79HHW
k__Archaea                                                                                                                                                                                        0
k__Archaea|p__Euryarchaeota                                                                                                                                                                       0
k__Archaea|p__Euryarchaeota|c__Methanobacteria                                                                                                                                                    0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales                                                                                                                              0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae                                                                                                       0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter                                                                                 0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii                                                   0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii|t__Methanobrevibacter_smithii_unclassified        0

As we can see here, species currently has all taxonomic levels for each microbe. For now, we want to only keep the species level information.

First, grep the rows that do not include strain stratifications and store them in a new temporary vector:

tmp_ind = grep("\\|t__", row.names(species), invert = T)

Create a new dataframe with only those row numbers in tmp_ind:

tmp = species[tmp_ind,]

Verify that strains were removed and skim the output to make sure nothing unexpected happened:

grep("\\|t__", row.names(tmp), invert = F)
row.names(tmp)[1:10]
integer(0)
  [1] "k__Archaea"                                                                                                                                                                           
  [2] "k__Archaea|p__Euryarchaeota"                                                                                                                                                          
  [3] "k__Archaea|p__Euryarchaeota|c__Methanobacteria"                                                                                                                                       
  [4] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales"                                                                                                                 
  [5] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae"                                                                                          
  [6] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter"                                                                    
  [7] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii"                                      
  [8] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified"                                 
  [9] "k__Bacteria"                                                                                                                                                                          
 [10] "k__Bacteria|p__Acidobacteria"

  • What will happen if there is no strain level information in the starting file?
Answer:

There's no issue, but it's always good to check that your code can handle different situations. MetaPhlAn doesn't alway return strain stratifications, so it's is particularly important in this case.


Now, let's select only the species level stratifications, removing all taxonomic levels before it:

tmp_ind = grep("\\|s__", row.names(tmp))

Create a new dataframe with only those row numbers in tmp_ind. Make sure to select from tmp, not species:

species = tmp[tmp_ind,]

Check the output to make sure that we only have species level stratifications:

row.names(species)[1:10]
  [1] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii"                                      
  [2] "k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified"                                 
  [3] "k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Granulicella|s__Granulicella_unclassified"                                                 
  [4] "k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Terriglobus|s__Terriglobus_unclassified"                                                   
  [5] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinobaculum|s__Actinobaculum_schaalii"                                                    
  [6] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinobaculum|s__Actinobaculum_unclassified"                                                
  [7] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_europaeus"                                                       
  [8] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_graevenitzii"                                                    
  [9] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_odontolyticus"                                                   
 [10] "k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_turicensis"    

Now we have only species level information in species, but let's make the names shorter for display by trimming off all other taxonomic information:

row.names(species) = gsub(".*\\|", "", row.names(species))

See if the row names look as expected:

row.names(species)[1:6]
  [1] "s__Methanobrevibacter_smithii"                 "s__Methanobrevibacter_unclassified"            "s__Granulicella_unclassified"                 
  [4] "s__Terriglobus_unclassified"                   "s__Actinobaculum_schaalii"                     "s__Actinobaculum_unclassified"                

Now that we have successfully truncated the names in species, let's check the sample sums (colSums) to make sure they are in proportion format (0-1) and are all ~1:

colSums(species)[1:6]
   CSM67UH7    CSM79HHW    HSM67VDT    HSM6XRQB    HSM6XRR3    HSM7J4LP 
  0.9999752   1.0000004   1.0000005   0.9993355   0.9996256   1.0000003 

Our downstream analysis require relative abundances so these were previously (as is default for MetaPhlAn) total sum scaled.

It is often useful to filter out low prevalence/abundance features in 'omics data, so we will make a filtered copy of the species table.

Check the dimensions of species pre-filtering:

dim(species)
[1] 572  96

Filter:

species_filt = species[apply(species, 1, function(x) sum(x > 0.0001) > 0.1 * ncol(species)), ]

  • What exactly is this command doing? What does 0.0001 represent? 0.1? Note that you can break the command apart to poke at the output, e.g. run apply(species, 1, function(x) sum(x > 0.0001) first.
Answer:

This filters out species with less than 0.0001 relative abundance in 10% of samples.



  • Why perform this type of filtering? Are there situations were it might not be appropriate?
Answer:

Low prevalence/abundance features are often not meaningful and can reduce power of some tests. However, one needs to be certain this isn't removing real biological variation. Alpha diversity is sensitive to filtering (so we generally don't) and some datasets have important low prevalence taxa or outlier samples.


Check the dimensions of species post-filtering:

dim(species_filt)
[1] 115  96

Let's transpose the dataframes for easier use downstream, making the rows be the samples, like in the metadata:

species_filt = as.data.frame(t(species_filt), check.names = F)
species = as.data.frame(t(species), check.names = F)

Make sure the sample names are in the same order in both the metadata and the species dataframes:

all(row.names(metadata) == row.names(species_filt))
all(row.names(metadata) == row.names(species))

Not all tools will check that samples align correctly, so you can easily get results that "look correct" but aren't if you don't do this!

NOTE: These formatted files are also located in the Tutorials/highdimtesting directory of the bioBakery image. To work with them from there assign them in R with the following code:

metadata = read.csv(file = "metadata.csv", header = T, row.names = 1, check.names = F)
species = read.csv(file = "species.csv", header = T, row.names = 1, check.names = F)
species_filt = read.csv(file = "species_filt.csv", header = T, row.names = 1, check.names = F)

Alpha diversity


  • What is alpha diversity, and why is it important?
Answer:

Alpha diversity summarizes the structure of an ecological community with respect to its richness (roughly, number of features), evenness (roughly, distribution of features), or both. Many perturbations to the microbiome affect this.


Using the unfiltered species data, calculate a common alpha diversity index:

alpha_div = as.data.frame(diversity(species, index = "shannon"))

You can check which indices the vegan package supports with ?diversity. We often also use Inverse Simpson, which gives similar results. However, there are several more you might see in the literature, with varying pros and cons, e.g. Chao1.

Check the output:

head(alpha_div)
         diversity(species, index = "shannon")
CSM67UH7                             1.7906860
CSM79HHW                             2.5342475
HSM67VDT                             2.2925910
HSM6XRQB                             2.5677991
HSM6XRR3                             2.2772366
HSM7J4LP                             0.3959699

Rename the columns to shorter names:

names(alpha_div) = c("Shannon")

Collate with the metadata, so that a single dataframe can be provided to upcoming statistical tests:

alpha_meta <- merge(alpha_div, metadata, by = "row.names")

Note that the merge function makes the old row names into new column. This might not be what you expect, but you can see that this is intended with ?merge.

Restore the row names (sample IDs) and get rid of the column generated by the merge:

row.names(alpha_meta) = alpha_meta$Row.names
alpha_meta$Row.names = NULL

Confirm this looks as expected:

head(alpha_meta)
            Shannon    site_name    sex  race consent_age diagnosis
CSM5FZ3N_P 1.741766 Cedars-Sinai Female White          43        CD
CSM5FZ3T_P 1.634631 Cedars-Sinai Female White          76        CD
CSM5FZ4A_P 2.117305 Cedars-Sinai Female White          47        UC
CSM5MCTZ_P 2.665711 Cedars-Sinai   Male White          32        UC
CSM5MCU4_P 1.119533 Cedars-Sinai Female White          53        CD
CSM5MCV1_P 1.478073 Cedars-Sinai Female White          20        CD

Univariable

Alpha diversity is relatively well-behaved and generates a single value per sample, allowing for a number of statistical tests. Linear models are straightforward and can accommodate a large variety of experimental designs, so we will go with that.

Run a linear model on diagnosis and return as an ANOVA table:

anova(lm(Shannon ~ diagnosis, data = alpha_meta))
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value Pr(>F)
diagnosis  2  0.4046 0.20231  0.6692 0.5146
Residuals 93 28.1163 0.30233               

Can do this with a for loop to quickly get results for each metadata variable:

for (col in names(metadata)) {
  alpha_shannon_univ = anova(lm(as.formula(paste("Shannon ~ ", col)), data = alpha_meta)) 
  print(col)
  print(alpha_shannon_univ)
}
[1] "site_name"
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value Pr(>F)
site_name  4  1.6732 0.41831  1.4179 0.2344
Residuals 91 26.8477 0.29503               
[1] "sex"
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  0.9209 0.92088  3.1363 0.07981 .
Residuals 94 27.6001 0.29362                  
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1
[1] "race"
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value Pr(>F)
race       3  0.9266 0.30886  1.0297 0.3832
Residuals 92 27.5944 0.29994               
[1] "consent_age"
Analysis of Variance Table

Response: Shannon
            Df  Sum Sq  Mean Sq F value Pr(>F)
consent_age  1  0.0105 0.010538  0.0347 0.8525
Residuals   94 28.5104 0.303302               
[1] "diagnosis"
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value Pr(>F)
diagnosis  2  0.4046 0.20231  0.6692 0.5146
Residuals 93 28.1163 0.30233               

Multivariable


  • What are univariable and multivariable tests, and what benefits do both provide?
Answer:

Univariable tests have only one independent variable and multivariable have more. Note that multivariate formally refers to multiple dependent variables, though these terms are often used interchangeably.


Try a model with the age and diagnosis:

anova(lm(Shannon ~ diagnosis + sex, data = alpha_meta))
Analysis of Variance Table

Response: Shannon
          Df  Sum Sq Mean Sq F value Pr(>F)
diagnosis  2  0.4046 0.20231  0.6808 0.5087
sex        1  0.7781 0.77807  2.6184 0.1091
Residuals 92 27.3383 0.29716  

If all variables in the file are used, there's a nice shorthand in R:

anova(lm(Shannon ~ ., data = alpha_meta))
Analysis of Variance Table

Response: Shannon
            Df  Sum Sq Mean Sq F value Pr(>F)
site_name    4  1.6732 0.41831  1.4118 0.2371
sex          1  0.4001 0.40012  1.3504 0.2485
race         3  0.6939 0.23130  0.7807 0.5080
consent_age  1  0.7356 0.73559  2.4826 0.1189
diagnosis    2  0.1294 0.06472  0.2184 0.8042
Residuals   84 24.8887 0.29629               

  • This can be expanded to include random effects with the lme4 package. Is there anything in this demo data that might indicate a mixed model would be useful?
Answer:

site_name is the location where the sample was collected and is likely better considered a random effect, since we aren't necessarily interested in how the sites vary but rather controlling for the overall effect of sampling site (of theoretically many).


Beta Diversity


  • What is beta diversity, and why is it important?
Answer:

Beta diversity quantifies how similar (or dissimilar) pairs of samples are in composition. This can be used to create a map of distances between samples (e.g. PCoA plot) and the significance of grouping or gradients of samples can be tested.


Using the filtered species table, calculate Bray-Curtis dissimilarity:

bray_species = vegdist(species_filt, method = "bray")

Bray-Curtis is commonly used for microbiome data because it incorporates presence vs. absence and relative abundance of features when determining dissimilarities (i.e. is weighted). Another common index for species data is Weighted UniFrac, which additionally includes phylogenetic information.

Univariable

Unlike alpha diversity, beta diversity is not defined on a single sample level, and therefore, many common statistical tests are not usable.


  • Why bother with beta diversity then?
Answer:

While one could take a difference of alpha diversity between samples and get a number, this is not particularly robust or meaningful. Alpha diversity trends are also in general more sensitive to difficult to control or arbitrary factors, such as sampling depth and technical noise, and various alpha diveristy indicies can give very different results, with no clear best practice.


PERMANOVA is the most common omnibus beta diversity test and is implemented as adonis2 by the vegan package. Note that adonis is now officially deprecated and older code that uses this function may act strangely.

Run a PERMANOVA and print the results:

set.seed(123)
adonis2(bray_species ~ diagnosis, data = metadata)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = bray_species ~ diagnosis, data = metadata)
          Df SumOfSqs      R2      F Pr(>F)  
diagnosis  2   0.9399 0.03255 1.5646   0.02 *
Residual  93  27.9347 0.96745                
Total     95  28.8747 1.00000                
---
Signif. codes:  `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

As you can see, the results format looks quite similar to the more familiar ANOVA test.


  • Why is a seed set before calling adonis2? Try running it again without resetting the seed.
Answer:

PERMANOVA determines p-values via permutation tests, so unless a seed is set, the values will be different each time you run it.


Can do this with a for loop as well:

for (col in names(metadata)) {
  set.seed(123)
  adonis_univ <- adonis2(as.formula(paste("bray_species ~ ", col)), data = metadata)
  print(col)
  print(adonis_univ)
}
[1] "site_name"
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = as.formula(paste("bray_species ~ ", col)), data = metadata)
          Df SumOfSqs      R2      F Pr(>F)  
site_name  4   1.6651 0.05767 1.3922  0.021 *
Residual  91  27.2095 0.94233                
Total     95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1
[1] "sex"
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = as.formula(paste("bray_species ~ ", col)), data = metadata)
         Df SumOfSqs      R2      F Pr(>F)  
sex       1   0.5925 0.02052 1.9694  0.013 *
Residual 94  28.2821 0.97948                
Total    95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1
[1] "race"
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = as.formula(paste("bray_species ~ ", col)), data = metadata)
         Df SumOfSqs      R2      F Pr(>F)
race      3   0.7778 0.02694 0.8489  0.771
Residual 92  28.0969 0.97306              
Total    95  28.8747 1.00000              
[1] "consent_age"
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = as.formula(paste("bray_species ~ ", col)), data = metadata)
            Df SumOfSqs      R2      F Pr(>F)  
consent_age  1   0.4683 0.01622 1.5497  0.067 .
Residual    94  28.4063 0.98378                
Total       95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1
[1] "diagnosis"
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = as.formula(paste("bray_species ~ ", col)), data = metadata)
          Df SumOfSqs      R2      F Pr(>F)  
diagnosis  2   0.9399 0.03255 1.5646   0.02 *
Residual  93  27.9347 0.96745                
Total     95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

  • What does the R2 value tell us?
  • Biologically, when might you expect a large R2? How about a small but significant R2?
Answer:

R2 is proportion of the variance for a dependent variable that's explained by an independent variable. You might expect a large R2 for a distinct perturbation like recent antibiotic treatment and a small but significant R2 for a complex population level trend, such as urbanicity in childhood.


Multivariable

Try a model with age and diagnosis:

set.seed(123)
adonis2(bray_species ~ consent_age + diagnosis, data = metadata)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = bray_species ~ consent_age + diagnosis, data = metadata)
            Df SumOfSqs      R2      F Pr(>F)  
consent_age  1   0.4683 0.01622 1.5696  0.060 .
diagnosis    2   0.9569 0.03314 1.6035  0.015 *
Residual    92  27.4495 0.95064                
Total       95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

PERMANOVA is by default sequential, but it's possible to do a marginal test:

set.seed(123)
adonis2(bray_species ~ consent_age + diagnosis, data = metadata, by = "margin")
Permutation test for adonis under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 999

adonis2(formula = bray_species ~ consent_age + diagnosis, data = metadata, by = "margin")
            Df SumOfSqs      R2      F Pr(>F)  
consent_age  1   0.4853 0.01681 1.6264  0.048 *
diagnosis    2   0.9569 0.03314 1.6035  0.015 *
Residual    92  27.4495 0.95064                
Total       95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

Note that this can get very computationally expensive as more terms are added.


  • What is the difference between these two models?
Answer:

The marginal model is agnostic to the ordering of terms. For PERMANOVA, it can be generated by hand by running the sequential model twice (see that the second term, diagnosis, has the same output both times). The trade-off is that the R2 do not necessarily add up to 1.


Can use the same shorthand as with linear models to include all variables:

set.seed(123)
adonis2(bray_species ~ ., data = metadata)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = bray_species ~ ., data = metadata)
            Df SumOfSqs      R2      F Pr(>F)  
site_name    4   1.6651 0.05767 1.4195  0.018 *
sex          1   0.4848 0.01679 1.6531  0.044 *
race         3   0.8897 0.03081 1.0113  0.438  
consent_age  1   0.3545 0.01228 1.2088  0.230  
diagnosis    2   0.8470 0.02933 1.4441  0.051 .
Residual    84  24.6335 0.85312                
Total       95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

Interaction terms are also possible:

set.seed(123)
adonis2(bray_species ~ diagnosis * sex, data = metadata)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = bray_species ~ diagnosis * sex, data = metadata)
              Df SumOfSqs      R2      F Pr(>F)  
diagnosis      2   0.9399 0.03255 1.5909  0.015 *
sex            1   0.5614 0.01944 1.9004  0.013 *
diagnosis:sex  2   0.7874 0.02727 1.3327  0.059 .
Residual      90  26.5860 0.92074                
Total         95  28.8747 1.00000                
---
Signif. codes:  0 `***` 0.001 `**` 0.01 `*` 0.05 `.` 0.1 ` ` 1

  • Is this result reasonable?
  • What factors might complicate this relationship?
Answer:

There are known epidemiological incidence and prevalence sex differences in IBD, but they are not straightforward and are confounded by other factors such as age and type of IBD (UC vs CD) as well as social conditions (i.e sexism). Also, some caution should be taken when using and interpreting interaction effects in PERMANOVA models since these are not as developed or rigorously tested as in ANOVA.


Next steps

Hopefully upon finishing this tutorial, you can apply these concepts and reuse some of this code on your own projects. We also encourage you to try visualizing your results with something like the tidyverse package set.

If you'd like to continue practing on the HMP2 data, metagenomic pathways and metatranscriptomic pathways are available:

download.file("https://raw.githubusercontent.com/biobakery/omnibus-and-maaslin2-rscripts-and-hmp2-data/master/dna_pathabundance_relab_pcl_week0.csv", "./Data/dna_pathabundance_relab_pcl_week0.csv")
download.file("https://raw.githubusercontent.com/biobakery/omnibus-and-maaslin2-rscripts-and-hmp2-data/master/rna_pathabundance_relab_pcl_week0.csv", "./Data/rna_pathabundance_relab_pcl_week0.csv")

Note that these will not exactly run with the above steps, but they should be close enough that you can figure out the differences. Some things to consider are:

  • How are the pathways formatted and do you want to keep all of this information or no?
  • Are metadata and data dimensions guaranteed to match and how can you handle this if not?