aregImpute: Multiple Imputation using Additive Regression, Bootstrapping, and Predictive Mean Matching

Description

The transcan function creates flexible additive imputation models but provides only an approximation to true multiple imputation as the imputation models are fixed before all multiple imputations are drawn. This ignores variability caused by having to fit the imputation models. aregImpute takes all aspects of uncertainty in the imputations into account by using the bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution. Different bootstrap resamples are used for each of the multiple imputations, i.e., for the ith imputation of a sometimes missing variable, i=1,2,...n.impute, a flexible additive model is fitted on a sample with replacement from the original data and this model is used to predict all of the original missing and non-missing values for the target variable.

Two methods are used to fit the imputation models, ace and avas. Unless the identity transformation is specified, these methods simultaneously find transformations of the target variable and of all of the predictors, to get a good fit assuming additivity. ace maximizes R-squared, and avas attempts to maximize R-squared while stabilizing the variance of residuals. When a categorical variable is being predicted, only ace is used. Like transcans use of canonical regression, this is Fisher's optimum scoring method for categorical variables. For continuous variables, monotonic transformations of the target variable are assumed when avas is used. For ace, the default allows nonmonotonic transformations of target variables. When variables are used as predictors, the nonparametric transformations derived by ace or avas can be restricted by the user to be monotonic.

Instead of taking random draws from fitted imputation models using random residuals as is done by transcan, aregImpute uses predictive mean matching with optional weighted probability sampling of donors rather than using only the closest match. Predictive mean matching works for binary, categorical, and continuous variables without the need for iterative maximum likelihood fitting for binary and categorical variables, and without the need for computing residuals or for curtailing imputed values to be in the range of actual data. Predictive mean matching is especially attractive when the variable being imputed is also being transformed automatically. See Details below for more information about the algorithm.

A print method summarizes the results, and a plot method plots distributions of imputed values. Typically, fit.mult.impute will be called after aregImpute.

Usage

aregImpute(formula, data, subset, n.impute=5, group=NULL,
           method=c('ace','avas'), type=c('pmm','regression'),
           match=c('weighted','closest'), fweighted=0.2,
           defaultLinear=FALSE, x=FALSE, pr=TRUE, plotTrans=FALSE)
## S3 method for class 'aregImpute':
print(x, \dots)
## S3 method for class 'aregImpute':
plot(x, nclass=NULL, type=c('ecdf','hist'), 
     diagnostics=FALSE, maxn=10, ...)

Arguments

formula

an S model formula. You can specify restrictions for transformations of variables. The function automatically determines which variables are categorical (i.e., factor, category, or character vectors). Binary variables are autom

an object created by aregImpute. For aregImpute, set x to TRUE to save the data matrix containing the final (number n.impute) imputations in the result. This is needed if you want to

data

subset

These may be also be specified. You may not specify na.action as na.retain is always used.

n.impute

number of multiple imputations. n.impute=5 is frequently recommended but 10 or more doesn't hurt.

group

a character or factor variable the same length as the number of observations in data and containing no NAs. When group is present, causes a bootstrap sample of the observations corresponding to non-NA

method

method ("ace", the default, or "avas") for modeling a variable to be imputed. As avas does not allow the response variable to be categorical, "ace" is always used for such variables.

type

The default is "pmn" for predictive mean matching, which is a more nonparametric approach that will work for categorical as well as continuous predictors. Alternatively, use "regression" when all variables that are sometim

match

Defaults to match="weighted" to do weighted multinomial probability sampling using the tricube function (similar to lowess) as the weights. The argument of the tricube function is the absolute difference in transformed predicted values

fweighted

Smoothing parameter (multiple of mean absolute difference) used when match="weighted", with a default value of 0.2. Set fweighted to a number between 0.02 and 0.2 to force the donor to have a predicted value closer to the

defaultLinear

set to TRUE to force all continuous variables to be linear in any model. This is recommended when the sample size is small.

set to FALSE to suppress printing of iteration messages

plotTrans

set to TRUE to plot ace or avas transformations for each variable for each of the multiple imputations. This is useful for determining whether transformations are reasonable. If transformations are too noisy

nclass

number of bins to use in drawing histogram

diagnostics

Specify diagnostics=TRUE to draw plots of imputed values against sequential imputation numbers, separately for each missing observations and variable.

maxn

Maximum number of observations shown for diagnostics. Default is maxn=10, which limits the number of observations plotted to at most the first 10.

...

other arguments that are ignored

Value

a list of class "aregImpute" containing the following elements:
callthe function call expression
formulathe formula specified to aregImpute
methodthe method argument
ntotal number of observations in input dataset
pnumber of variables
nalist of subscripts of observations for which values were originally missing
nnanamed vector containing the numbers of missing values in the data
linearvector of names of variables restricted to be linear
categoricalvector of names of categorical variables
monotonevector of names of variables restricted to be monotonic
cat.levelslist containing character vectors specifying the levels of categorical variables
n.imputenumber of multiple imputations per missing value
imputeda list containing matrices of imputed values in the same format as those created by transcan. Categorical variables are coded using their integer codes. Variables having no missing values will have NULL matrices in the list.
rsqfor the last round of imputations, a vector containing the R-squares with which each sometimes-missing variable could be predicted from the others by ace or avas.

concept

bootstrap
predictive mean matching
imputation
NA

Details

The sequence of steps used by the aregImpute algorithm is the following. (1) For each variable containing m NAs where m > 0, initialize the NAs to values from a random sample (without replacement if a sufficient number of non-missing values exist) of size m from the non-missing values. (2) For 3+n.impute iterations do the following steps. The first 3 iterations provide a burn-in, and imputations are saved only from the last n.impute iterations. (3) For each variable containing any NAs, draw a sample with replacement from the observations in the entire dataset in which the current variable being imputed is non-missing. Fit a flexible additive model to predict this target variable while finding the optimum transformation of it (unless the identity transformation is forced). Use this fitted semiparametric model to predict the target variable in all of the original observations. Impute each missing value of the target variable with the observed value whose predicted transformed value is closest to the predicted transformed value of the missing value (if match="closest" and type="pmm"), or use a draw from a multinomial distribution with probabilities derived from distance weights, if match="weighted" (the default). (4) After these imputations are computed, use these random draw imputations the next time the curent target variable is used as a predictor of other sometimes-missing variables.

When match="closest", predictive mean matching does not work well when fewer than 3 variables are used to predict the target variable, because many of the multiple imputations for an observation will be identical. In the extreme case of one right-hand-side variable and assuming that only monotonic transformations of left and right-side variables are allowed, every bootstrap resample will give predicted values of the target variable that are monotonically related to predicted values from every other bootstrap resample. The same is true for Bayesian predicted values. This causes predictive mean matching to always match on the same donor observation.

When the missingness mechanism for a variable is so systematic that the distribution of observed values is truncated, predictive mean matching does not work. It will only yield imputed values that are near observed values, so intervals in which no values are observed will not be populated by imputed values. For this case, the only hope is to make regression assumptions and use extrapolation. With type="regression", aregImpute will use linear extrapolation to obtain a (hopefully) reasonable distribution of imputed values. The "regression" option causes aregImpute to impute missing values by adding a random sample of residuals (with replacement if there are more NAs than measured values) on the scale of the ace or avas transformed target variable. After random residuals are added, predicted random draws are obtained on the original untransformed scale using reverse linear interpolation on the table of original and ace or avas transformed target values (linear extrapolation when a random residual is large enough to put the random draw prediction outside the range of observed values). The bootstrap is used as with type="pmm" to factor in the uncertainty of the imputation model.

Examples

Run this code

# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
set.seed(3)
x1 <- factor(sample(c('a','b','c'),1000,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 <- rnorm(1000)
y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 <- x1[1:250]
orig.x2 <- x2[251:350]
x1[1:250] <- NA
x2[251:350] <- NA
d <- data.frame(x1,x2,x3,y)

# Use 100 imputations to better check against individual true values
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat <- function(u) {
 tab <- table(u)
 as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, 
                       data=d)
sqrt(diag(Varcov(fmi)))
fcc <- lm(y ~ x1 + x2 + x3)
summary(fcc)   # SEs are larger than from mult. imputation


# Example 2: Very discriminating imputation models,
# x1 and x2 have some NAs on the same rows, smaller n
set.seed(5)
x1 <- factor(sample(c('a','b','c'),100,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
x3 <- rnorm(100)
y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
orig.x1 <- x1[1:20]
orig.x2 <- x2[18:23]
x1[1:20] <- NA
x2[18:23] <- NA
#x2[21:25] <- NA
d <- data.frame(x1,x2,x3,y)
n <- naclus(d)
plot(n); naplot(n)  # Show patterns of NAs
# 100 imputations to study them; normally use 5 or 10
f  <- aregImpute(~y + x1 + x2 + x3, n.impute=100, defaultLinear=TRUE, data=d)
par(mfrow=c(2,3))
plot(f, diagnostics=TRUE, maxn=2)
# Note: diagnostics=TRUE makes graphs similar to those made by:
# r <- range(f$imputed$x2, orig.x2)
# for(i in 1:6) {  # use 1:2 to mimic maxn=2
#   plot(1:100, f$imputed$x2[i,], ylim=r,
#        ylab=paste("Imputations for Obs.",i))
#   abline(h=orig.x2[i],lty=2)
# }


table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))


fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, 
                       data=d)
sqrt(diag(Varcov(fmi)))
fcc <- lm(y ~ x1 + x2)
summary(fcc)   # SEs are larger than from mult. imputation

# Study relationship between smoothing parameter for weighting function
# (multiplier of mean absolute distance of transformed predicted
# values, used in tricube weighting function) and standard deviation
# of multiple imputations.  SDs are computed from average variances
# across subjects.  match="closest" same as match="weighted" with
# small value of fweighted.
# This example also shows problems with predicted mean
# matching almost always giving the same imputed values when there is
# only one predictor (regression coefficients change over multiple
# imputations but predicted values are virtually 1-1 functions of each
# other)

set.seed(23)
x <- runif(200)
y <- x + runif(200, -.05, .05)
r <- resid(lsfit(x,y))
rmse <- sqrt(sum(r^2)/(200-2))   # sqrt of residual MSE

y[1:20] <- NA
d <- data.frame(x,y)
f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)
# As an aside here is how to create a completed dataset for imputation
# number 3 as fit.mult.impute would do automatically.  In this degenerate
# case changing 3 to 1-2,4-10 will not alter the results.
completed <- d
imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE,
                           pr=FALSE, check=FALSE)
completed[names(imputed)] <- imputed
completed
sd <- sqrt(mean(apply(f$imputed$y, 1, var)))

ss <- c(0, .01, .02, seq(.05, 1, length=20))
sds <- ss; sds[1] <- sd

for(i in 2:length(ss)) {
  f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
  sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))
}

plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
     type='b')
abline(v=.2,  lty=2)  # default value of fweighted
abline(h=rmse, lty=2)  # root MSE of residuals from linear regression

# Do a similar experiment for the Titanic dataset
getHdata(titanic3)
h <- lm(age ~ sex + pclass + survived, data=titanic3)
rmse <- summary(h)$sigma
set.seed(21)
f <- aregImpute(~ age + sex + pclass + survived, n.impute=10,
                data=titanic3, match='closest')
sd <- sqrt(mean(apply(f$imputed$age, 1, var)))

ss <- c(0, .01, .02, seq(.05, 1, length=20))
sds <- ss; sds[1] <- sd

for(i in 2:length(ss)) {
  f <- aregImpute(~ age + sex + pclass + survived, data=titanic3,
                  n.impute=10, fweighted=ss[i])
  sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))
}

plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
     type='b')
abline(v=.2,   lty=2)  # default value of fweighted
abline(h=rmse, lty=2)  # root MSE of residuals from linear regression

Run the code above in your browser using DataLab