lrDA: Log-ratio DA algorithm

Description

This function implements a simulation-based Data Augmentation (DA) algorithm to impute left-censored values (e.g. values below detection limit, rounded zeros) via coordinates representation of compositional data sets which incorporate the information of the relative covariance structure. Alternatively, this function can be used to impute missing data. Multiple imputation estimates can be also obtained from the output.

Usage

lrDA(X, label = NULL, dl = NULL,
        ini.cov=c("lrEM","complete.obs","multRepl"), delta = 0.65,
        imp.missing = FALSE, n.iters = 1000, m = 1, store.mi = FALSE, closure = NULL)

Arguments

Compositional data set (matrix or data.frame class).

label

Unique label (numeric or character) used to denote unobserved left-censored values in X.

Numeric vector or matrix of detection limits/thresholds. These must be given on the same scale as X.

ini.cov

Initial estimation of the log-ratio covariance matrix. It can be based on lrEM estimation ("lrEM", default), complete observations ("complete.obs") or multiplicative simple replacement ("multRepl").

delta

If ini.cov="multRepl", delta parameter for initial multiplicative simple replacement (multRepl) in proportions (default = 0.65).

imp.missing

If TRUE then unobserved data identified by label are treated as missing data (default = FALSE).

n.iters

Number of iterations for the DA algorithm (default = 1000).

Number of multiple imputations (default = 1).

store.mi

Logical value. If m>1 creates a list with m imputed data matrices. (store.mi=FALSE, default).

closure

Closure value used to add a residual part if needed when multiplicative simple replacement is used to initiate the DA algorithm, either directly (ini.cov="multRepl") or as part of lrEM estimation (ini.cov="lrEM") (see ?multRepl).

Value

A data.frame object containing the imputed compositional data set in the same scale as the original, or a list of imputed data sets if multiple imputation is carried out (m>1) and store.mi=TRUE.

Details

After convergence of the Markov chain Monte Carlo (MCMC) iterative process to its steady state, this function imputes unobserved compositional parts by simulated values from their posterior predictive distributions through coordinates representation, given the information from the observed data. For left-censoring problems, it allows for either single (vector form) or multiple (matrix form, same size as X) limits of detection by component. Any threshold value can be set for non-censored elements (e.g. use 0 if no threshold for a particular column or element of the data matrix).

It produces imputed data sets on the same scale as the input data set. If X is not closed to a constant sum, then the results are adjusted to provide a compositionally equivalent data set, expressed in the original scale, which leaves the absolute values of the observed components unaltered.

The common conjugate normal inverted-Wishart distribution with non-informative Jeffreys prior has been assumed for the model parameters in the coordinates space. Under this setting, convergence is expected to be fast (n.iters set to 1000 by default). Besides, considering EM parameter estimates as initial point for the DA algorithm (ini.cov="lrEM") assures faster convergence by starting near the centre of the posterior distribution.

By setting m greater than 1, the procedure also allows for multiple imputations of the censored values drawn at regular intervals after convergence. In this case, in addition to the burn-in period for convergence, n.iters determines the gap, large enough to prevent from correlated values, between successive imputations. The total number of iterations is then n.iters*m. By default, a single imputed data set results from averaging the m imputations in the space of coordinates. If store.mi=TRUE, a list with m imputed data sets is generated instead.

In the case of censoring patterns involving samples containing only one observed component, these are imputed by multiplicative simple replacement (multRepl) and a warning message identifying them is printed.

Missing data imputation

This function can be employed to impute missing data by setting imp.missing = TRUE. For this case, the argument label indicates the unique label for missing values. The argument dl is ignored as it is meaningless here.

References

Palarea-Albaladejo J, Martin-Fernandez JA, Olea, RA. A bootstrap estimation scheme for chemical compositional data with nondetects. Journal of Chemometrics 2014; 28: 585-599.

Palarea-Albaladejo J. and Martin-Fernandez JA. zCompositions -- R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligence Laboratory Systems 2015; 143: 85-96.

Examples

Run this code

# NOT RUN {
# Data set closed to 100 (percentages, common dl = 1%)
X <- matrix(c(26.91,8.08,12.59,31.58,6.45,14.39,
              39.73,26.20,0.00,15.22,6.80,12.05,
              10.76,31.36,7.10,12.74,31.34,6.70,
              10.85,46.40,31.89,10.86,0.00,0.00,
              7.57,11.35,30.24,6.39,13.65,30.80,
              38.09,7.62,23.68,9.70,20.91,0.00,
              27.67,7.15,13.05,32.04,6.54,13.55,
              44.41,15.04,7.95,0.00,10.82,21.78,
              11.50,30.33,6.85,13.92,30.82,6.58,
              19.04,42.59,0.00,38.37,0.00,0.00),byrow=TRUE,ncol=6)

# Imputation by single simulated values
X_lrDA <- lrDA(X,label=0,dl=rep(1,6),ini.cov="multRepl",n.iters=150)

# Imputation by multiple imputation (m = 5, one imputation every 150 iterations)
X_milrDA <- lrDA(X,label=0,dl=rep(1,6),ini.cov="multRepl",m=5,n.iters=150)

# Multiple limits of detection by component
mdl <- matrix(0,ncol=6,nrow=10)
mdl[2,] <- rep(1,6)
mdl[4,] <- rep(0.75,6)
mdl[6,] <- rep(0.5,6)
mdl[8,] <- rep(0.5,6)
mdl[10,] <- c(0,0,1,0,0.8,0.7)

X_lrDA2 <- lrDA(X,label=0,dl=mdl,ini.cov="multRepl",n.iters=150)

# Non-closed compositional data set
data(LPdata) # data (ppm/micrograms per gram)
dl <- c(2,1,0,0,2,0,6,1,0.6,1,1,0,0,632,10) # limits of detection (0 for no limit)
LPdata2 <- subset(LPdata,select=-c(Cu,Ni,La))  # select a subset for illustration purposes
dl2 <- dl[-c(5,7,10)]

# }
# NOT RUN {
 # May take a little while
LPdata2_lrDA <- lrDA(LPdata2,label=0,dl=dl2)
# }
# NOT RUN {
# }
# NOT RUN {
 # May take a little while
# Treating zeros as missing data for illustration purposes only
LPdata2_lrDAmiss <- lrDA(LPdata2,label=0,imp.missing=TRUE,closure=10^6)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab