This is a simple script that performs Poisson, negative binomial and zero-inflated negative binomial regression. It also calculates AIC and BIC for each of the 3 models calculated, so the user can select the best model.
The script is adjusted such that it can take in any OTU table file, generated via QIIME 1.8.0 (stable public release), (in tab-delimited format) as input and standard mapping/metadata file compatible with QIIME.
Presently, the script can only perform a single category comparison for variables. For example, if the metadata have two variables such as diet and antibiotic exposure, the script will have to be run seperately for each variable. A joint model including both explanatory variables (i.e., diet and antibiotic exposure) cannot be currently calculated.
Furthermore, the script can only work with variables that that have two levels. For example, if the explanatory variable is temperature, it must only contain two levels such as High and Low for the script to run.
Work is in progress to include comparisons for more than 2 levels and to create models in which multiple combinations of explanatory variables can be included.
There is 1 script in the folder src
. It is called nb_regression_outlier_filtering.R
. The script is run via command line using the Rscript command (in terminal). There are 3 scripts in the folder old_code
but they can be ignored as they are just older versions, saved for trouble-shooting. There is also an additional Python script in the filtering
. It is called filter_mapper.py
. It is to be used to further results by lowest BIC of the 3 models tested and a user-defined q-value threshold.
To run the script, pass the command in following format:
Rscript nb_regression_outlier_filtering.R high_vs_low_otu_table.txt high_low_mapfile.txt High Low Treatment ZINB_NB_Output_result.txt 2
As seen from the command, the script takes in 7 commands. They are as follows:
-
OTU table generated via QIIME (which is called high_vs_low_otu_table.txt in the above example)
-
QIIME compatible mapping file (which is called high_low_mapfile.txt in the above example)
-
Level 1 of the category being compared (which is called High in the above example)
-
Level 2 of the category being compared (which is called Low in the above example)
-
Column name of the category being compared as labelled in the mapping file (which is called Treatment in the above example)
-
Output file contaiing result (which is called ZINB_NB_Output_result.txt in the above example)
-
No. of cores to use. More cores on machine, faster the analysis will complete (which is 2 in the above example)
Please ensure that all the 7 arguments are provided, in the correct order and format. Otherwise, the script will crash and cause problems.
Input of file format should be one compatabile with QIIME. However, please ensure that the sample IDs are not numeric. That is, the sample IDs should not be like: 1560.1, 1561.1, 1559.1, etc. If such is the case, please slightly modify the sample IDs in both the mapping file and OTU table by adding any alphabet. So, for example, sample ID 1560.1 will become p1560.1.
Also, please make sure that the mapping file has the same number of samples as the OTU tables, having the same sample IDs. If mapping file has more or less sample IDs than the samples in the OTU table, the script will crash.
The output of the script contains information for all of the OTUs tested. Currently, there are 51 columns in the output file (called ZINB_NB_Output_result.txt in the example above) as generated via this script. The columns and their descriptions of the output file are as follows:
-
OTU_ID: Indicates the OTU ID
-
Poiss_Coeff: Indicates the expoentiated regression coefficient estimated when fitting the Poisson model on the OTU count data. This value is the is the multiplicative change in OTU abundance, comparing one level of a specific treatment group to another. It is to be interpreted in a similar way to the NB_Coeff value, which is explained further.
-
Poiss_pval: p-value of estimated Poisson Coeff
-
Poiss_qval: q-value of the estimated Poisson Coeff
-
NB_Coeff: Indicates the exponentiated regression coefficient for the regular negative binomial model. To elaborate more, in our example dataset, OTU_26 has a NB_Coeff value of 1.329890281. This means that the abundance of OTU_26 is 1.329890281 times higher in the Low group of treatment compared to the High group of treatment. To find out which group is the base group, we look at the column called High_minus_Low_mean for our example dataset. (The name of this group will change from mapping file to mapping file as explained later in detail) In our example dataset, the High_minus_Low_mean column has the value of -176.8338789. This suggests that the mean OTU_26 is higher in Low group when compared to High group.
-
NB_pval: p-value of the estimated NB_Coeff
-
NB_qval: q-value of the estimated NB_Coeff
-
ZINB_Coeff: Indicates the exponentiated regression coefficient for the zero inflated negative binomial model. This value is the multiplicative change in OTU abundance, comparing one level of a specific treatment group to another. It is to be interpreted in a similar way as NB_Coeff value, which is explained earlier. For many OTUs, this value may turn out to be NA and it's okay if it does. It simply reflects that ZINB is not a good model to fit to data (mostly due to convergence issues).
-
ZINB_pval: p-value of the estimated ZINB Coeff
-
ZINB qval: q-value of the estimated ZINB Coeff
-
mean in group High: The mean of a specific OTU in a samples belonging to specific one treatment level. (The name of this column will change base on user defined mapping file).
-
mean in group Low: The mean of a specific OTU in a samples belonging to specific the other treatment level. (The name of this column will change base on user defined mapping file).
-
High_minus_Low_mean: The name of this column is specific to each dataset. In our metadata file, we had a column called Treatment which had two levels: High and Low. The ordering of the names of the treatment levels in this column will vary from mapping file to mapping file. However, they will always be consistent such that if the value is positive, then the first level as suggested by column name has higher mean than second one (High in our example). And if the value is negative, then the second level as suggested by column name has a higher mean than first one (Low in our example).
-
ttest_pval: p-value of the t-test.
-
ttest_qval: q-value of the t-test.
-
KW_pval: Kruskal-Wallis pvalue.
-
KW_qval: Kruskal-Wallis qvalue.
-
NB_Coeff_Estimate_Error: Indicates (with yes/no) whether there was in error in convergence in estimating the negative binomial regression coefficient.
-
# of 0's in High: Indicates the total # of zeroes in one treatment group (such as High).
-
# of 0's in Low: Indicates the total # of zeroes in the other treatment group (such as Low).
-
# of non-zeroes in High: Indicates the total number of non-zero entries in one treatment group (such as High).
-
# of non-zeroes in Low: Indicates the total number of non-zero entries in other treatment group (such as Low).
-
Total count in High: Sum of all values in one treatment group (such as High).
-
Total count in Low: Sum of all values in other treatment group (such as Low).
-
mean_otu: Mean across all treatment groups (both High and Low).
-
variance_otu: Variance across all treatment groups (both High and Low).
-
var/mean ratio: Variance to mean ratio across all treatment groups (such as High and Low).
-
Shapiro_Wilk_Normality_pvalue: Indicates whether the data is normally distributed or not, informing us about the validity of using the t-test. A significant p-value in this column indicates that data are not normally distributed and t-test may not be that appropriate.
-
taxonomy: Indicates the taxonomy/lineage of the specific OTU.
-
pois_filt_pval: p-value of the Poisson model with outlier(s) filtered. An outlier for a specific OTU is defined as that count value for the OTU which is greater than 5 times the inter-quartile range of the OTU across all samples.
-
pois_filt_qval: q-value of Poisson model with outlier(s) filtered.
-
nb_filt_pval: p-value of negative binomial model with outlier(s) filtered. Outlier defined as above.
-
nb_filt_qval: q-value of negative binomial model with outlier(s) filtered.
-
zinb_filt_pval: p-value of zero inflated negative binomial model with outlier(s) filtered. Outlier defined as above.
-
zinb_filt_qval: q-value of zero inflated negative binomial model with outlier(s) filtered.
-
aic.pois: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for Poisson model.
-
aic.nb: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for negative binomial model.
-
aic.zinb: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for zero inflated negative binomial model.
-
bic.pois: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for Poisson model.
-
bic.nb: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for negative binomial model.
-
bic.zinb: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for zero inflated negative binomial model.
-
aic.filt.pois: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for Poisson model with outlier(s) filtered.
-
aic.filt.nb: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for negative binomial model with outlier(s) filtered.
-
aic.filt.zinb: [Akaike information criterion (AIC)] (http://en.wikipedia.org/wiki/Akaike_information_criterion) value for zero inflated negative binomial model with outlier(s) filtered.
-
bic.filt.pois: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for Poisson model with outlier(s) filtered.
-
bic.filt.nb: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for negative binomial model with outlier(s) filtered.
-
bic.filt.zinb: [Bayesian information criterion (BIC)] (http://en.wikipedia.org/wiki/Bayesian_information_criterion) value for zero inflated negative binomial model with outlier(s) filtered.
-
aic.nonfilt.best: Column indicating which model is best (based on lowest AIC value) for a given OTU with no outliers filtered.
-
bic.nonfilt.best: Column indicating which model is best (based on lowest BIC value) for a given OTU with no outliers filtered.
-
aic.filt.best: Column indicating which model is best (based on lowest AIC value) for a given OTU with outliers filtered.
-
bic.filt.best: Column indicating which model is best (based on lowest BIC value) for a given OTU with outliers filtered.
This script is being included as part of a larger pipeline, I'm developing called [biological-convergence] (https://github.com/alifar76/biological-convergence). Since the output of this script produces too many columns, there is some additional filtering done based on user specified values to produce a more manageable file.
That manageable file, with far fewer columns, will be produced by the [biological-convergence] (https://github.com/alifar76/biological-convergence) pipeline and will look like this file. Users can, therefore, run that pipeline to get this analysis done.