Learn R Programming

simputation (version 0.2.0)

impute_rhd: Impute missing data

Description

Use to fit and impute missing data.

Usage

impute_rhd(dat, formula, pool = c("complete", "univariate", "multivariate"), prob, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_shd(dat, formula, pool = c("complete", "univariate", "multivariate"), order = c("locf", "nocb"), backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_pmm(dat, formula, predictor = impute_lm, pool = c("complete", "univariate", "multivariate"), ...)
impute_knn(dat, formula, pool = c("complete", "univariate", "multivariate"), k = 5, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_lm(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)
impute_rlm(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)
impute_const(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_median(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_proxy(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_cart(dat, formula, add_residual = c("none", "observed", "normal"), cp, na.action = na.omit, ...)
impute_rf(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)

Arguments

dat
[data.frame], with variables to be imputed and their predictors.
formula
[formula] imputation model description (see Details below).
pool
Specify donor pool. See under 'Hot deck imputation'.
prob
[numeric] Sampling probability weights (passed through to sample). Must be of length nrow(dat).
backend
Choose the backend for imputation.
...
further arguments passed to
  • lm for impute_lm
  • rlm for impute_rlm
  • order for impute_shd
  • The predictor for impute_pmm
  • randomForest for impute_rf
order
Last Observation Carried Forward or Next Observarion Carried Backward
predictor
[function] Imputation to use for predictive part in predictive mean matching. Any of the impute_ functions of this package (it makes no sense to use a hot-deck imputation).
k
Number of nearest neighbours to draw the donor from.
add_residual
[character] Type of residual to add. "normal" means that the imputed value is drawn from N(mu,sd) where mu and sd are estimated from the model's residuals (mu should equal zero in most cases). If add_residual = "observed", residuals are drawn (with replacement) from the model's residuals. Ignored for non-numeric predicted variables.
na.action
[function] what to do with missings in training data. By default cases with missing values in predicted or predictors are omitted (see `Missings in training data').
cp
The complexity parameter used to prune the CART model. If omitted, no pruning takes place. If a single number, the same complexity parameter is used for each imputed variable. If of length # of variables imputed, the complexity parameters used must be in the same order as the predicted variables in the model formula.

Value

dat, but imputed where possible.

Hot deck imputation

  • impute_rhd The predictor variables in the model argument are used to split the data set into groups prior to imputation (use ~ 1 to specify that no grouping is applied).
  • impute_shd The predictor variables are used to sort the data.
  • impute_knn The predictors are used to determine Gower's distance between records (see gower_topn).
The pool argument is used to specify the donor pool as follows.
  • "complete". Only records for which the variables on the left-hand-side of the model formula are complete are used as donors. If a record has multiple missings, all imputations are taken from a single donor.
  • "univariate". Imputed variables are treated one by one and independently so the order of variable imputation is unimportant. If a record has multiple missings, separate donors are drawn for each missing value.
  • "multivariate". A donor pool is created for each missing data pattern. If a record has multiple missings, all imputations are taken from a single donor.

Using the VIM backend

The VIM package has efficient implementations of several popular imputation methods. In particular, its random and sequential hotdeck implementation is faster and more memory-efficient than that of the current package. Moreover, VIM offers more fine-grained control over the imputation process then simputation. If you have this package installed, it can be used by setting backend="VIM" for functions supporting this option. Alternatively, one can set options(simputation.hdbackend="VIM") so it becomes the default. Simputation will map the simputation call to a function in the VIM package. In particular:
  • impute_rhd is mapped to VIM::hotdeck where imputed variables are passed to the variable argument and the union of predictor and grouping variables are passed to domain_var. Extra arguments in ... are passed to VIM::hotdeck as well. Argument pool is ignored.
  • impute_shd is mapped to VIM::hotdeck where imputed variables are passed to the variable argument, predictor variables to ord_var and grouping variables to domain_var. Extra arguments in ... are passed to VIM::hotdeck as well. Arguments pool and order are ignored. In VIM the donor pool is determined on a per-variable basis, equivalent to setting pool="univariate" with the simputation backend. VIM is LOCF-based. Differences between simputation and VIM likely occurr when the sorting variables contain missings.
  • impute_knn is mapped to VIM::kNN where imputed variables are passed to variable, predictor variables are passed to dist_var and grouping variables are ignored with a message. Extra arguments in ... are passed to VIM::hotdeck as well. Argument pool is ignored. Note that simputation adheres stricktly to the Gower's original definition of the distance measure, while VIM uses a generalized variant that can take ordered factors into account.
By default, VIM's imputation functions add indicator variables to the original data to trace what values have been imputed. This is switched off by default for consistency with the rest of the simputation package, but it may be turned on again by setting imp_var=TRUE.

Specifying the imputation model

Formulas are of the form IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ] The left-hand-side of the formula object lists the variable or variables to be imputed. The interpretation of the independent variables on the right-hand-side depends on the underlying imputation model. If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group. Grouping using dplyr::group_by is also supported. If groups are defined in both the formula and using dplyr::group_by, the data is grouped by the union of grouping variables. Any missing value in one of the grouping variables results in an error. Grouping is ignored for impute_const.

Details

The functions are designed to be robust against failing imputations. This means that rather than emitting an error, functions show the following behaviour.
  • If a value cannot be imputed because one of its predictors is missing, the value will remain missing after imputation.
  • If a model cannot be fitted, e.g. because the imputed model is missing, a warning is emitted and for that variable no imputation will take place.

Missings in training data

For model-based imputation, including those based on (robust) linear models, cart models and random forests, there is an option called na.action that specifies what to do with missings in training data. The default action is to train models on data where both the predicted and predictor variables are available. Some of the interesting options are
  • na.omit: omit cases where predictor or predicted is missing. This is the default.
  • rpart::na.rpart: omit cases where the predicted is missing but keep cases where one or more predictors are missing. Relevant for impute_cart
  • randomForest::na.roughfix Temporarily impute all predictors and predicted with the column median (for numeric data) or the mode (for categorical data) in order to fit the model.

Model descriptions

Model
description
impute_lm
Use stats::lm to train the imputation model.
impute_rlm
Use MASS::rlm to train the imputation model.
impute_median
Median imputation. Predictors are treated as grouping variables for computing medians.
impute_const
Impute a constant value
impute_proxy
Copy a value from the predictor variable.
impute_rhd
Random hot deck. Predictors are used to group the donors.
impute_shd
Sequential hot deck. Predictors sort the data (use ~ 1 for no sorting).
impute_knn
k-nearest neighbour imputation. Predictors are used to determine Gower's distance.
impute_pmm
Predictive mean matching.
impute_cart
Use rpart::rpart to train a CART model.

See Also

Getting started with simputation, lm rlm rpart

Examples

Run this code

data(iris)
irisNA <- iris
irisNA[1:4, "Sepal.Length"] <- NA
irisNA[3:7, "Sepal.Width"] <- NA

# impute a single variable (Sepal.Length)
i1 <- impute_lm(irisNA, Sepal.Length ~ Sepal.Width + Species)

# impute both Sepal.Length and Sepal.Width, using robust linear regression
i2 <- impute_rlm(irisNA, Sepal.Length + Sepal.Width ~ Species + Petal.Length)

Run the code above in your browser using DataLab