Author & Maintainer: Huang Lin: huanglinfrederick@gmail.com
ANCOMBC is a package containing differential abundance (DA) and correlation analyses for microbiome data. Specifically, the package includes Analysis of Compositions of Microbiomes with Bias Correction 2 (ANCOM-BC2, submitted), Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC), and Analysis of Composition of Microbiomes (ANCOM) for DA analysis, and Sparse Estimation of Correlations among Microbiomes (SECOM) for correlation analysis. Microbiome data are typically subject to two sources of biases: unequal sampling fractions (sample-specific biases) and differential sequencing efficiencies (taxon-specific biases). Methodologies included in the ANCOMBC package are designed to correct these biases and construct statistically consistent estimators.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("ANCOMBC")
library(ANCOMBC)
?ancombc2
?ancombc
?ancom
?secom_linear
?secom_dist
1. Q: What are the differences between the formula
and group
arguments in ancombc
and ancombc2
?
A: The formula
and group
arguments serve different purposes in the ancombc
and ancombc2
functions. Here's a breakdown of their differences:
-
formula
: This argument is used to specify the variables in your experiment that can potentially influence microbial abundances. It is essential to include all relevant variables in theformula
to ensure proper adjustment and accurate results. For example, if you have a continuous variable likeage
as your main variable of interest, and you have additional categorical variables that need adjustment but are not directly related to your research question, you can include them in theformula
while leavinggroup
as NULL. -
group
: Thegroup
argument is optional and should only be specified if you want to detect structural zeros (presence/absence test) or perform multi-group comparisons, such as the global test, pairwise directional test, Dunnett's type of test, or trend test. If your variable of interest is a categorical variable with more than three levels and you want to conduct multi-group comparisons, you should include thegroup
argument. It is important to note thatgroup
is not the same asmain_var
inancom
. Inancombc
andancombc2
,group
is used for multi-group comparisons and correction of p-values for multiple comparisons.
Remember not to include the main_var
in the adj_formula
in ancom
, but always include group
in the formula
or fix_formula
(in ancombc
and ancombc2
, respectively) if group
is not NULL. This ensures that the appropriate adjustments and comparisons are made in the analysis.
2. Q: Why are some taxa absent from the primary results?
A: There are a couple of reasons why certain taxa may be absent from the primary results in ancombc
or ancombc2
. Here's an explanation:
-
Prevalence Exclusion: Taxa with prevalences below the specified threshold (
prv_cut
) will be excluded from the analysis. Theprv_cut
value determines the minimum prevalence required for a taxon to be considered in the analysis. If a taxon's prevalence falls below this threshold, it will not be included in the primary results. -
Structural Zeros: Taxa that exhibit structural zeros, meaning they consistently have zero counts across all samples, will be considered significant only by the presence/absence test. The ANCOM-BC and ANCOM-BC2 methodologies are not designed to detect significant differences in taxa with structural zeros. As a result, these taxa are summarized separately and not included in the primary results of
ancombc
orancombc2
.
To access the results of the presence/absence test, you can refer to the zero_ind
output. This will provide information on the taxa that exhibit structural zeros.
3. Q: In the primary results, what do lfc_(Intercept)
, lfc_groupB
, and lfc_groupC
represent if I have a group
variable with categories A
, B
, and C
?
A: In the primary results, the terms lfc_groupB
and lfc_groupC
represent the log fold changes (logFC) relative to the reference group, which is group A
by default. These logFC values indicate the difference in abundance between group B
and group A
, and between group C
and group A
, respectively.
On the other hand, lfc_(Intercept)
refers to the log fold change of the grand mean, which may not be a parameter of particular interest in this context.
It's worth mentioning that if you wish to change the reference group, you can use the factor
function in R to rearrange the levels of the group
variable accordingly.
4. Q: I encountered an error message stating "'rank' must be a value from 'taxonomyRanks()'. What does it mean and how can I resolve it?
A: The error message "'rank' must be a value from 'taxonomyRanks()'" typically occurs when the rank names in your tax_table
are not properly labeled. In order to resolve this issue, it is recommended to use the taxonomyRanks(se)
function, where se
is your tse
object.
Firstly, ensure that the rank names in your tax_table
are correctly named as one of the standard taxonomic ranks, such as "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", or "Species". If the rank names are currently labeled as something else, such as "ta1", "ta2", "ta3", and so on, you will need to update them accordingly.
This issue commonly occurs when a tax_table
is formed from a data.frame
instead of a matrix
. Therefore, it's important to ensure that the rank names are correctly assigned before proceeding with the analysis.
Once you have updated the rank names in your tax_table
, you can utilize the ancombc2
function, specifying the desired taxonomic level using the tax_level
argument (e.g., "Genus"). This will enable you to perform statistical analyses on your microbiome data at the specified taxonomic level.
5. Q: I encountered an issue while using rand_formula
in ancombc2
. What is the correct syntax for specifying random effects?
A: When specifying random effects using rand_formula
in ancombc2
, it is important to follow the syntax conventions used in the lmerTest
package. Pay close attention to the placement of parentheses and vertical bars.
To correctly specify a random subject effect, the syntax should be in the form of "(1|subjid)"
, where subjid
represents the variable name for the subject identifier. This syntax ensures that the random subject effect is properly accounted for in the analysis.
On the other hand, it is incorrect to use rand_formula
as "1|subjid"
or "(subjid)"
for specifying random effects. The correct syntax should always include parentheses around the random effect and a vertical bar to separate it from the fixed effects.
By using the correct syntax for specifying random effects, you will be able to accurately incorporate these effects into your ancombc2
analysis.
6. Q: What are the differences between the primary results and the results of Dunnett's type of test in ANCOM-BC2?
A: The primary results and the results of Dunnett's type of test in ANCOM-BC2 provide information on differentially abundant taxa, but there are differences in the correction of p-values.
In the primary results, the p-values are corrected across taxa, meaning that they account for multiple comparisons among different taxa. This correction helps control the false positive rate when determining the significance of individual taxa.
On the other hand, Dunnett's type of test not only corrects the p-values across taxa but also corrects for multiple comparisons between groups. Specifically, it compares the abundance of each taxon in groups B
and C
with the reference group A
. The correction for multiple comparisons in Dunnett's type of test results in a more conservative outcome, reducing the likelihood of false positive results.
Therefore, while both the primary results and Dunnett's type of test provide information on differentially abundant taxa, the results of Dunnett's type of test offer additional control for multiple comparisons, making them more conservative and reliable.
7. Q: Can the ancombc
or ancombc2
function handle interaction terms in the analysis?
A: Unfortunately, the inclusion of interaction terms in the fix_formula
argument of ancombc
or ancombc2
can lead to complexities and potential confusion in the multi-group comparisons. To address this, it is recommended to manually create the interaction term of interest outside of the formula and perform the analysis accordingly.
By manually creating the interaction term, you can ensure that the analysis accurately captures the interaction effect between variables. Once the interaction term is created, you can include it in the fix_formula
argument or any other relevant part of the analysis, depending on your specific research question and design.
8. Q: Can the ANCOM-BC methodology be applied to other data types such as functional abundances, RNA-seq, or single-cell RNA data?
A: The ANCOM-BC methodology can be applied to other data types as long as they are considered compositional. However, it is essential to be aware that the methodology has been primarily benchmarked and validated using microbiome data.
9. Q: What does "not a positive definite matrix" mean in fitting the ancombc2
mixed effects model? How can I debug this issue?
A: The error message "not a positive definite matrix" indicates that the correlation matrix used in the mixed effects model is not positive definite. A positive definite matrix is a square matrix where all eigenvalues are positive. This error typically occurs when there is an issue with the data or model specification.
To debug this issue, I recommend fitting a mixed effects model to your RAW data using the lmerTest
package in R. Use the same fixed effects and random effects specifications that you used in the ancombc2
function. By fitting the model directly, you may receive more informative error messages that can help diagnose the problem.
Here are the steps you can follow to debug the issue:
- Install and load the
lmerTest
package in R:install.packages("lmerTest")
andlibrary(lmerTest)
. - Prepare your data in its raw format without any transformation or preprocessing.
- Specify the fixed effects and random effects in the model formula, similar to what you used in
ancombc2
. - Fit the mixed effects model using the
lmer()
function from thelmerTest
package. - Check if the model fitting process encounters any errors or warnings. These messages can provide valuable insights into the issue.
- Analyze the error or warning messages to identify the underlying problem. It could be related to the data structure, model specification, or potential collinearity among variables.
- Address the issue based on the information provided in the error or warning messages. This may involve revising the model specification, examining the data for anomalies, or resolving any collinearity issues.
By following these steps, you can gain a better understanding of the problem causing the "not a positive definite matrix" error and take appropriate actions to address it.
If you continue to encounter difficulties or need further assistance, it may be helpful to seek advice from statisticians or experts in your specific field of research.
10. Q: If a higher LFC value corresponds to a larger abundance, why did some of my OTU/ASV counts show opposite directions?
A: It's important to note that the log-fold change (LFC) values in the context of ANCOM-BC or ANCOM-BC2 do not directly reflect the relative abundances (such as proportions) or observed abundances (such as OTU or ASV counts). The LFC values represent the difference in bias-corrected abundances between groups.
In ANCOM-BC or ANCOM-BC2, a higher LFC value indicates a larger difference in bias-corrected abundances between groups. However, this does not necessarily mean that the group with higher LFC has a higher relative abundance or larger observed counts for a specific OTU or ASV.
11. Q: Can you give me a more complicated example of performing ANCOM-BC2 trend test?
A: For example, when using the trend test with a group
variable of 5 ordered categories (A, B, C, D, E
) in R, we are actually estimating 4 contrasts, which are (B-A, C-A, D-A, E-A
). Testing the trend of A < B < C < D < E
is equivalent to testing 0 < B - A < C - A < D - A < E - A
. Therefore, we can specify the contrast matrix as follows:
# B-A C-A D-A E-A
1 0 0 0
-1 1 0 0
0 -1 1 0
0 0 -1 1
In R, it should be
matrix(c(1, 0, 0, 0,
-1, 1, 0, 0,
0, -1, 1, 0,
0, 0, -1, 1),
nrow = 4,
byrow = TRUE)
12. Q: OMG, I am still very confused at structural zeros. What are they? What do the struc_zero
and neg_lb
arguments do?
A: A taxon is considered to have structural zeros in some (>=1) groups if it is completely or nearly completely absent in those groups. For example, if there are three groups, g1, g2, and g3, and the counts of taxon A are 0 in g1 but non-zero in g2 and g3, taxon A will be considered to contain structural zeros in g1. In this scenario, taxon A is declared to be differentially abundant between g1 and g2, g1 and g3, and is consequently globally differentially abundant with respect to the group variable. Such taxa are not further analyzed using ANCOM-BC or ANCOM-BC2, but the results are summarized in the zero_ind
. You can treat the detection of structural zeros as performing a presence/absence test.
The detection of structural zeros is based on a separate paper, ANCOM-II. Specifically, setting neg_lb = TRUE
indicates that both criteria stated in section 3.2 of ANCOM-II are used to detect structural zeros. Alternatively, setting neg_lb = FALSE
will only use equation 1 in section 3.2 of ANCOM-II to declare structural zeros, making it a more conservative approach. As the OTU/ASV table is usually very sparse, it is recommended to choose neg_lb = FALSE
to prevent false discoveries. However, if you have a more dense table such as a family level table with a sufficiently large sample size, using neg_lb = TRUE
may be a better idea. It is important to note that neg_lb
has no function if struc_zero
is set to FALSE
. Therefore, there are three possible combinations: struc_zero = FALSE
(regardless of neg_lb
), struc_zero = TRUE, neg_lb = FALSE
, or struc_zero = TRUE, neg_lb = TRUE
.