MZILN: Conditional regression for microbiome analysis based on multivariate zero-inflated logistic normal model

Description

For estimating and testing the associations of abundance ratios with covariates.

Usage

MZILN(
  experiment_dat,
  refTaxa,
  allCov = NULL,
  sampleIDname = NULL,
  adjust_method = "BY",
  fdrRate = 0.15,
  paraJobs = NULL,
  bootB = 500,
  taxDropThresh = 0,
  standardize = FALSE,
  sequentialRun = TRUE,
  verbose = TRUE,
  seed = 1
)

Value

A list with two elements.

full_results: The main results for MZILN containing the estimation and testing results for all associations between all taxa ratios with refTaxan being the denominator and all covariates in allCov. It is a dataframe with each row representing an association, and ten columns named as "ref_tax", "taxon", "cov", "estimate", "SE.est", "CI.low", "CI.up", "adj.p.value", "unadj.p.value", and "sig_ind". The columns correspond to the denominator taxon, numerator taxon, covariate name, association estimates, standard error estimates, lower bound and upper bound of the 95% confidence interval, adjusted p value, and the indicator showing whether the association is significant after multiple testing adjustment.
metadata: The metadata is a list containing total time used in minutes, random seed used, FDR rate, and multiple testing adjustment method used.

Arguments

experiment_dat: A SummarizedExperiment object containing microbiome data and covarites (see example on how to create a SummarizedExperiment object). The microbiome data can be absolute abundance or relative abundance with each column per sample and each row per taxon/OTU/ASV (or any other unit). No imputation is needed for zero-valued data points. The covarites data contains covariates and confounders with each row per sample and each column per variable. The covarites data has to be numeric or binary.
refTaxa: Denominator taxa names specified by the user for the targeted ratios. This could be a vector of names.
allCov: All covariates of interest (including confounders) for estimating and testing their associations with the targeted ratios. Default is 'NULL' meaning that all covariates in covData are of interest.
sampleIDname: Name of the sample ID variable in the data. In the case that the data does not have an ID variable, this can be ignored. Default is NULL.
adjust_method: The adjusting method for p value adjustment. Default is "BY" for dependent FDR adjustment. It can take any adjustment method for p.adjust function in R.
fdrRate: The false discovery rate for identifying taxa/OTU/ASV associated with allCov. Default is 0.15.
paraJobs: If sequentialRun is FALSE, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified as NULL, it will automatically detect the cores to decide the number of parallel jobs. Default is NULL.
bootB: Number of bootstrap samples for obtaining confidence interval of estimates for the high dimensional regression. The default is 500.
taxDropThresh: The threshold of number of non-zero sequencing reads for each taxon to be dropped from the analysis. The default is 0 which means taxon without any sequencing reads will be dropped from the analysis.
standardize: This takes a logical value TRUE or FALSE. If TRUE, the design matrix for X will be standardized in the analyses and the results. Default is FALSE.
sequentialRun: This takes a logical value TRUE or FALSE. Default is TRUE. It can be set to be "FALSE" to increase speed if there are multiple taxa in the argument 'refTaxa'.
verbose: Whether the process message is printed out to the console. The default is TRUE.
seed: Random seed for reproducibility. Default is 1. It can be set to be NULL to remove seeding.

Details

Most of the time, users just need to feed the first three inputs to the function: experiment_dat, refTaxa and allCov. All other inputs can just take their default values. The regression model for MZILN() can be expressed as follows: (Y_i^kY_i^K+1)|Y_i^k>0,Y_i^K+1>0=^0k+X_i^T^k+_i^k,0.2cmk=1,...,K where

Y_i^k is the AA of taxa k in subject i in the entire ecosystem.
Y_i^K+1 is the reference taxon (specified by user).
X_i is the covariate matrix for all covariates including confounders.
^k is the regression coefficients along with their 95% confidence intervals that will be estimated by the MZILN() function.

High-dimensional X_i is handled by regularization.

References

Li et al.(2018) Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. Statistics in Biosciences 10(3): 587-608

Examples

Run this code

library(IFAA)
library(SummarizedExperiment)

## If you already have a SummarizedExperiment format data, you can ignore 
## the data processing steps below.

## load the example microbiome data. This could be relative abundance or absolute 
## abundance data. If you have a csv or tsv file for the microbiome data, you 
## can use read.csv() function or read.table() function in R to read the 
## data file into R.
data(dataM)
dim(dataM)
dataM[1:5, 1:8]

## load the example covariates data. If you have a csv or tsv file for the 
## covariates data, you can use read.csv() function or read.table() function 
## in R to read the data file into R.
data(dataC)
dim(dataC)
dataC[1:5, ]

## Merge microbiome data and covariate data by id, to avoid unmatching observations. 
data_merged<-merge(dataM,dataC,by="id",all=FALSE)

## Seperate microbiome data and covariate data, drop id variable from the microbiome data
dataM_sub<-data_merged[,colnames(dataM)[!colnames(dataM)%in%c("id")]]
dataC_sub<-data_merged[,colnames(dataC)]

## Create SummarizedExperiment object 
test_dat<-SummarizedExperiment(assays=list(MicrobData=t(dataM_sub)), colData=dataC_sub)

## If you already have a SummarizedExperiment format data, you can 
## ignore the above steps.

## Run MZILN function
results <- MZILN(experiment_dat = test_dat,
                refTaxa=c("rawCount11"),
                allCov=c("v1","v2","v3"),
                sampleIDname=c("id"),
                fdrRate=0.05)
## to extract the results for all ratios with rawCount11 as the denominator:
summary_res<-results$full_results
## to extract results for the ratio of a specific taxon (e.g., rawCount45) over rawCount11:
target_ratio=summary_res[summary_res$taxon=="rawCount45",]
## to extract all of the ratios having significant associations:
sig_ratios=subset(summary_res,sig_ind==TRUE)

Run the code above in your browser using DataLab