IFAA: Robust association identification and inference for absolute abundance in microbiome analyses

Description

Make inference on the association of microbiome with covariates

Usage

IFAA(
  experiment_dat,
  testCov = NULL,
  ctrlCov = NULL,
  sampleIDname = NULL,
  testMany = TRUE,
  ctrlMany = FALSE,
  nRef = 40,
  nRefMaxForEsti = 2,
  refTaxa = NULL,
  adjust_method = "BY",
  fdrRate = 0.15,
  paraJobs = NULL,
  bootB = 500,
  standardize = FALSE,
  sequentialRun = FALSE,
  refReadsThresh = 0.2,
  taxDropThresh = 0,
  SDThresh = 0.05,
  SDquantilThresh = 0,
  balanceCut = 0.2,
  verbose = TRUE,
  seed = 1
)

Value

A list containing 2 elements

full_results: The main results for IFAA containing the estimation and testing results for all associations between all taxa and all test covariates in testCov. It is a dataframe with each row representing an association, and eight columns named as "taxon", "cov", "estimate", "SE.est", "CI.low", "CI.up", "adj.p.value", and "sig_ind". The columns correspond to taxon name, covariate name, association estimates, standard error estimates, lower bound and upper bound of the 95% confidence interval, adjusted p value, and the indicator showing whether the association is significant after multiple testing adjustment.
metadata: The metadata is a list.
- covariatesData: A dataset containing covariates and confounders used in the analyses.
- final_ref_taxon: The final 2 reference taxon used for analysis.
- ref_taxon_count: The counts of selection for the associations of all taxa with test covariates in Phase 1.
- totalTimeMins: The average magnitude estimates for the associations of all taxa with test covariates in Phase 1.
- ref_taxon_est: Total time used for the entire analysis.
- seed: The seed used for the analysis for reproducibility.
- fdrRate: FDR rate used for the analysis.
- adjust_method: Multiple testing adjust method used for the analysis.

Arguments

experiment_dat: A SummarizedExperiment object containing microbiome data and covarites (see example on how to create a SummarizedExperiment object). The microbiome data can be absolute abundance or relative abundance with each column per sample and each row per taxon/OTU/ASV (or any other unit). No imputation is needed for zero-valued data points. The covarites data contains covariates and confounders with each row per sample and each column per variable. The covarites data has to be numeric or binary.
testCov: Covariates that are of primary interest for testing and estimating the associations. It corresponds to $X_i$ in the equation. Default is NULL which means all covariates are testCov.
ctrlCov: Potential confounders that will be adjusted in the model. It corresponds to $W_i$ in the equation. Default is NULL which means all covariates except those in testCov are adjusted as confounders.
sampleIDname: Name of the sample ID variable in the data. In the case that the data does not have an ID variable, this can be ignored. Default is NULL.
testMany: This takes logical value TRUE or FALSE. If TRUE, the testCov will contain all the variables in CovData provided testCov is set to be NULL. The default value is TRUE which does not do anything if testCov is not NULL.
ctrlMany: This takes logical value TRUE or FALSE. If TRUE, all variables except testCov are considered as control covariates provided ctrlCov is set to be NULL. The default value is FALSE.
nRef: The number of randomly picked reference taxa used in phase 1. Default number is 40.
nRefMaxForEsti: The maximum number of final reference taxa used in phase 2. The default is 2.
refTaxa: A vector of taxa or OTU or ASV names. These are reference taxa specified by the user to be used in phase 1. If the number of reference taxa is less than 'nRef', the algorithm will randomly pick extra reference taxa to make up 'nRef'. The default is NULL since the algorithm will pick reference taxa randomly.
adjust_method: The adjusting method for p value adjustment. Default is "BY" for dependent FDR adjustment. It can take any adjustment method for p.adjust function in R.
fdrRate: The false discovery rate for identifying taxa/OTU/ASV associated with testCov. Default is 0.15.
paraJobs: If sequentialRun is FALSE, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified as NULL, it will automatically detect the cores to decide the number of parallel jobs. Default is NULL.
bootB: Number of bootstrap samples for obtaining confidence interval of estimates in phase 2 for the high dimensional regression. The default is 500.
standardize: This takes a logical value TRUE or FALSE. If TRUE, the design matrix for X will be standardized in the analyses and the results. Default is FALSE.
sequentialRun: This takes a logical value TRUE or FALSE. Default is FALSE. This argument could be useful for debug.
refReadsThresh: The threshold of proportion of non-zero sequencing reads for choosing the reference taxon in phase 2. The default is 0.2 which means at least 20% non-zero sequencing reads.
taxDropThresh: The threshold of number of non-zero sequencing reads for each taxon to be dropped from the analysis. The default is 0 which means taxon without any sequencing reads will be dropped from the analysis.
SDThresh: The threshold of standard deviations of sequencing reads for been chosen as the reference taxon in phase 2. The default is 0.05 which means the standard deviation of sequencing reads should be at least 0.05 in order to be chosen as reference taxon.
SDquantilThresh: The threshold of the quantile of standard deviation of sequencing reads, above which could be selected as reference taxon. The default is 0.
balanceCut: The threshold of the proportion of non-zero sequencing reads in each group of a binary variable for choosing the final reference taxa in phase 2. The default number is 0.2 which means at least 20% non-zero sequencing reads in each group are needed to be eligible for being chosen as a final reference taxon.
verbose: Whether the process message is printed out to the console. The default is TRUE.
seed: Random seed for reproducibility. Default is 1. It can be set to be NULL to remove seeding.

Details

Most of the time, users just need to feed the first three inputs to the function: experiment_dat, testCov and ctrlCov. All other inputs can just take their default values. To model the association, the following equation is used:

(Y_i^k)|Y_i^k>0=^0k+X_i^T^k+W_i^T^k+Z_i^Tb_i+_i^k,0.2cmk=1,...,K+1 where

Y_i^k is the AA of taxa k in subject i in the entire ecosystem.
X_i is the covariate matrix.
W_i is the confounder matrix.
Z_i is the design matrix for random effects.
^k is the regression coefficients that will be estimated and tested with the IFAA() function.

The challenge in microbiome analysis is that Y_i^k can not be observed. What is observed is its small proportion: Y_i^k=C_iY^k_i, where C_i is an unknown number between 0 and 1 that denote the observed proportion.

The IFAA method can successfully addressed this challenge. The IFAA() will estimate the parameter ^k and their 95% confidence intervals. High-dimensional X_i is handled by regularization.

References

Li et al.(2021) IFAA: Robust association identification and Inference For Absolute Abundance in microbiome analyses. Journal of the American Statistical Association. 116(536):1595-1608

Examples

Run this code


# \donttest{
library(IFAA)
library(SummarizedExperiment)

## If you already have a SummarizedExperiment format data, you can ignore 
#  the data processing steps below.

## load the example microbiome data. This could be relative abundance or absolute 
## abundance data. If you have a csv or tsv file for the microbiome data, you 
## can use read.csv() function or read.table() function in R to read the 
## data file into R.
data(dataM)
dim(dataM)
dataM[1:5, 1:8]

## load the example covariates data. If you have a csv or tsv file for the 
## covariates data, you can use read.csv() function or read.table() function 
## in R to read the data file into R.
data(dataC)
dim(dataC)
dataC[1:5, ]

## Merge microbiome data and covariate data by id, to avoid unmatching observations. 
data_merged<-merge(dataM,dataC,by="id",all=FALSE)

## Seperate microbiome data and covariate data, drop id variable from microbiome data
dataM_sub<-data_merged[,colnames(dataM)[!colnames(dataM)%in%c("id")]]
dataC_sub<-data_merged[,colnames(dataC)]

## Create SummarizedExperiment object 
test_dat<-SummarizedExperiment(assays=list(MicrobData=t(dataM_sub)), colData=dataC_sub)

## If you already have a SummarizedExperiment format data, you can 
## ignore the above steps.

## run IFAA function
results <- IFAA(experiment_dat = test_dat,
                testCov = c("v1", "v2"),
                ctrlCov = c("v3"),
                sampleIDname = c("id"),
                fdrRate = 0.05)

## to extract all results:
summary_res<-results$full_results
## to extract significant results:
sig_results=subset(summary_res,sig_ind==TRUE)
# }

Run the code above in your browser using DataLab