Skip to content

Differential abundance (DA) and correlation analyses for microbial absolute abundance data

Notifications You must be signed in to change notification settings

microbiome/ANCOMBC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ANCOMBC

Author & Maintainer: Huang Lin: huanglinfrederick@gmail.com

ANCOMBC is a package containing differential abundance (DA) and correlation analyses for microbiome data. Specifically, the package includes Analysis of Compositions of Microbiomes with Bias Correction 2 (ANCOM-BC2, submitted), Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC), and Analysis of Composition of Microbiomes (ANCOM) for DA analysis, and Sparse Estimation of Correlations among Microbiomes (SECOM) for correlation analysis. Microbiome data are typically subject to two sources of biases: unequal sampling fractions (sample-specific biases) and differential sequencing efficiencies (taxon-specific biases). Methodologies included in the ANCOMBC package are designed to correct these biases and construct statistically consistent estimators.

To install the latest release version of ANCOMBC

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("ANCOMBC")

Usage

library(ANCOMBC)
?ancombc2
?ancombc 
?ancom
?secom_linear
?secom_dist

Commonly asked questions

1. Q: What are the differences between the formula and group arguments in ancombc and ancombc2?

A: The formula and group arguments serve different purposes in the ancombc and ancombc2 functions. Here's a breakdown of their differences:

  1. formula: This argument is used to specify the variables in your experiment that can potentially influence microbial abundances. It is essential to include all relevant variables in the formula to ensure proper adjustment and accurate results. For example, if you have a continuous variable like age as your main variable of interest, and you have additional categorical variables that need adjustment but are not directly related to your research question, you can include them in the formula while leaving group as NULL.

  2. group: The group argument is optional and should only be specified if you want to detect structural zeros (presence/absence test) or perform multi-group comparisons, such as the global test, pairwise directional test, Dunnett's type of test, or trend test. If your variable of interest is a categorical variable with more than three levels and you want to conduct multi-group comparisons, you should include the group argument. It is important to note that group is not the same as main_var in ancom. In ancombc and ancombc2, group is used for multi-group comparisons and correction of p-values for multiple comparisons.

Remember not to include the main_var in the adj_formula in ancom, but always include group in the formula or fix_formula (in ancombc and ancombc2, respectively) if group is not NULL. This ensures that the appropriate adjustments and comparisons are made in the analysis.

2. Q: Why are some taxa absent from the primary results?

A: There are a couple of reasons why certain taxa may be absent from the primary results in ancombc or ancombc2. Here's an explanation:

  1. Prevalence Exclusion: Taxa with prevalences below the specified threshold (prv_cut) will be excluded from the analysis. The prv_cut value determines the minimum prevalence required for a taxon to be considered in the analysis. If a taxon's prevalence falls below this threshold, it will not be included in the primary results.

  2. Structural Zeros: Taxa that exhibit structural zeros, meaning they consistently have zero counts across all samples, will be considered significant only by the presence/absence test. The ANCOM-BC and ANCOM-BC2 methodologies are not designed to detect significant differences in taxa with structural zeros. As a result, these taxa are summarized separately and not included in the primary results of ancombc or ancombc2.

To access the results of the presence/absence test, you can refer to the zero_ind output. This will provide information on the taxa that exhibit structural zeros.

3. Q: In the primary results, what do lfc_(Intercept), lfc_groupB, and lfc_groupC represent if I have a group variable with categories A, B, and C?

A: In the primary results, the terms lfc_groupB and lfc_groupC represent the log fold changes (logFC) relative to the reference group, which is group A by default. These logFC values indicate the difference in abundance between group B and group A, and between group C and group A, respectively.

On the other hand, lfc_(Intercept) refers to the log fold change of the grand mean, which may not be a parameter of particular interest in this context.

It's worth mentioning that if you wish to change the reference group, you can use the factor function in R to rearrange the levels of the group variable accordingly.

4. Q: I encountered an error message stating "'rank' must be a value from 'taxonomyRanks()'. What does it mean and how can I resolve it?

A: The error message "'rank' must be a value from 'taxonomyRanks()'" typically occurs when the rank names in your tax_table are not properly labeled. In order to resolve this issue, it is recommended to use the taxonomyRanks(se) function, where se is your tse object.

Firstly, ensure that the rank names in your tax_table are correctly named as one of the standard taxonomic ranks, such as "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", or "Species". If the rank names are currently labeled as something else, such as "ta1", "ta2", "ta3", and so on, you will need to update them accordingly.

This issue commonly occurs when a tax_table is formed from a data.frame instead of a matrix. Therefore, it's important to ensure that the rank names are correctly assigned before proceeding with the analysis.

Once you have updated the rank names in your tax_table, you can utilize the ancombc2 function, specifying the desired taxonomic level using the tax_level argument (e.g., "Genus"). This will enable you to perform statistical analyses on your microbiome data at the specified taxonomic level.

5. Q: I encountered an issue while using rand_formula in ancombc2. What is the correct syntax for specifying random effects?

A: When specifying random effects using rand_formula in ancombc2, it is important to follow the syntax conventions used in the lmerTest package. Pay close attention to the placement of parentheses and vertical bars.

To correctly specify a random subject effect, the syntax should be in the form of "(1|subjid)", where subjid represents the variable name for the subject identifier. This syntax ensures that the random subject effect is properly accounted for in the analysis.

On the other hand, it is incorrect to use rand_formula as "1|subjid" or "(subjid)" for specifying random effects. The correct syntax should always include parentheses around the random effect and a vertical bar to separate it from the fixed effects.

By using the correct syntax for specifying random effects, you will be able to accurately incorporate these effects into your ancombc2 analysis.

6. Q: What are the differences between the primary results and the results of Dunnett's type of test in ANCOM-BC2?

A: The primary results and the results of Dunnett's type of test in ANCOM-BC2 provide information on differentially abundant taxa, but there are differences in the correction of p-values.

In the primary results, the p-values are corrected across taxa, meaning that they account for multiple comparisons among different taxa. This correction helps control the false positive rate when determining the significance of individual taxa.

On the other hand, Dunnett's type of test not only corrects the p-values across taxa but also corrects for multiple comparisons between groups. Specifically, it compares the abundance of each taxon in groups B and C with the reference group A. The correction for multiple comparisons in Dunnett's type of test results in a more conservative outcome, reducing the likelihood of false positive results.

Therefore, while both the primary results and Dunnett's type of test provide information on differentially abundant taxa, the results of Dunnett's type of test offer additional control for multiple comparisons, making them more conservative and reliable.

7. Q: Can the ancombc or ancombc2 function handle interaction terms in the analysis?

A: Unfortunately, the inclusion of interaction terms in the fix_formula argument of ancombc or ancombc2 can lead to complexities and potential confusion in the multi-group comparisons. To address this, it is recommended to manually create the interaction term of interest outside of the formula and perform the analysis accordingly.

By manually creating the interaction term, you can ensure that the analysis accurately captures the interaction effect between variables. Once the interaction term is created, you can include it in the fix_formula argument or any other relevant part of the analysis, depending on your specific research question and design.

8. Q: Can the ANCOM-BC methodology be applied to other data types such as functional abundances, RNA-seq, or single-cell RNA data?

A: The ANCOM-BC methodology can be applied to other data types as long as they are considered compositional. However, it is essential to be aware that the methodology has been primarily benchmarked and validated using microbiome data.

9. Q: What does "not a positive definite matrix" mean in fitting the ancombc2 mixed effects model? How can I debug this issue?

A: The error message "not a positive definite matrix" indicates that the correlation matrix used in the mixed effects model is not positive definite. A positive definite matrix is a square matrix where all eigenvalues are positive. This error typically occurs when there is an issue with the data or model specification.

To debug this issue, I recommend fitting a mixed effects model to your RAW data using the lmerTest package in R. Use the same fixed effects and random effects specifications that you used in the ancombc2 function. By fitting the model directly, you may receive more informative error messages that can help diagnose the problem.

Here are the steps you can follow to debug the issue:

  1. Install and load the lmerTest package in R: install.packages("lmerTest") and library(lmerTest).
  2. Prepare your data in its raw format without any transformation or preprocessing.
  3. Specify the fixed effects and random effects in the model formula, similar to what you used in ancombc2.
  4. Fit the mixed effects model using the lmer() function from the lmerTest package.
  5. Check if the model fitting process encounters any errors or warnings. These messages can provide valuable insights into the issue.
  6. Analyze the error or warning messages to identify the underlying problem. It could be related to the data structure, model specification, or potential collinearity among variables.
  7. Address the issue based on the information provided in the error or warning messages. This may involve revising the model specification, examining the data for anomalies, or resolving any collinearity issues.

By following these steps, you can gain a better understanding of the problem causing the "not a positive definite matrix" error and take appropriate actions to address it.

If you continue to encounter difficulties or need further assistance, it may be helpful to seek advice from statisticians or experts in your specific field of research.

10. Q: If a higher LFC value corresponds to a larger abundance, why did some of my OTU/ASV counts show opposite directions?

A: It's important to note that the log-fold change (LFC) values in the context of ANCOM-BC or ANCOM-BC2 do not directly reflect the relative abundances (such as proportions) or observed abundances (such as OTU or ASV counts). The LFC values represent the difference in bias-corrected abundances between groups.

In ANCOM-BC or ANCOM-BC2, a higher LFC value indicates a larger difference in bias-corrected abundances between groups. However, this does not necessarily mean that the group with higher LFC has a higher relative abundance or larger observed counts for a specific OTU or ASV.

11. Q: Can you give me a more complicated example of performing ANCOM-BC2 trend test?

A: For example, when using the trend test with a group variable of 5 ordered categories (A, B, C, D, E) in R, we are actually estimating 4 contrasts, which are (B-A, C-A, D-A, E-A). Testing the trend of A < B < C < D < E is equivalent to testing 0 < B - A < C - A < D - A < E - A. Therefore, we can specify the contrast matrix as follows:

# B-A    C-A     D-A    E-A
    1      0      0     0
    -1     1      0     0
    0     -1     1      0
    0     0     -1      1

In R, it should be

matrix(c(1, 0, 0, 0,
  -1, 1, 0, 0,
   0, -1, 1, 0,
   0, 0, -1, 1),
   nrow = 4, 
   byrow = TRUE)

12. Q: OMG, I am still very confused at structural zeros. What are they? What do the struc_zero and neg_lb arguments do?

A: A taxon is considered to have structural zeros in some (>=1) groups if it is completely or nearly completely absent in those groups. For example, if there are three groups, g1, g2, and g3, and the counts of taxon A are 0 in g1 but non-zero in g2 and g3, taxon A will be considered to contain structural zeros in g1. In this scenario, taxon A is declared to be differentially abundant between g1 and g2, g1 and g3, and is consequently globally differentially abundant with respect to the group variable. Such taxa are not further analyzed using ANCOM-BC or ANCOM-BC2, but the results are summarized in the zero_ind. You can treat the detection of structural zeros as performing a presence/absence test.

The detection of structural zeros is based on a separate paper, ANCOM-II. Specifically, setting neg_lb = TRUE indicates that both criteria stated in section 3.2 of ANCOM-II are used to detect structural zeros. Alternatively, setting neg_lb = FALSE will only use equation 1 in section 3.2 of ANCOM-II to declare structural zeros, making it a more conservative approach. As the OTU/ASV table is usually very sparse, it is recommended to choose neg_lb = FALSE to prevent false discoveries. However, if you have a more dense table such as a family level table with a sufficiently large sample size, using neg_lb = TRUE may be a better idea. It is important to note that neg_lb has no function if struc_zero is set to FALSE. Therefore, there are three possible combinations: struc_zero = FALSE (regardless of neg_lb), struc_zero = TRUE, neg_lb = FALSE, or struc_zero = TRUE, neg_lb = TRUE.

About

Differential abundance (DA) and correlation analyses for microbial absolute abundance data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 95.6%
  • TeX 4.4%