impute_rhd(dat, formula, pool = c("complete", "univariate", "multivariate"), prob, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_shd(dat, formula, pool = c("complete", "univariate", "multivariate"), order = c("locf", "nocb"), backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_pmm(dat, formula, predictor = impute_lm, pool = c("complete", "univariate", "multivariate"), ...)
impute_knn(dat, formula, pool = c("complete", "univariate", "multivariate"), k = 5, backend = getOption("simputation.hdbackend", default = c("simputation", "VIM")), ...)
impute_lm(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)
impute_rlm(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)
impute_const(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_median(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_proxy(dat, formula, add_residual = c("none", "observed", "normal"), ...)
impute_cart(dat, formula, add_residual = c("none", "observed", "normal"), cp, na.action = na.omit, ...)
impute_rf(dat, formula, add_residual = c("none", "observed", "normal"), na.action = na.omit, ...)
[data.frame]
, with variables to be imputed and their
predictors.[formula]
imputation model description (see Details below).[numeric]
Sampling probability weights (passed through to
sample
). Must be of length nrow(dat)
.lm
for impute_lm
rlm
for impute_rlm
order
for impute_shd
predictor
for impute_pmm
randomForest
for impute_rf
[function]
Imputation to use for predictive part in
predictive mean matching. Any of the impute_
functions of this
package (it makes no sense to use a hot-deck imputation).[character]
Type of residual to add. "normal"
means that the imputed value is drawn from N(mu,sd)
where mu
and sd
are estimated from the model's residuals (mu
should equal
zero in most cases). If add_residual = "observed"
, residuals are drawn
(with replacement) from the model's residuals. Ignored for non-numeric
predicted variables.[function]
what to do with missings in training data.
By default cases with missing values in predicted or predictors are omitted
(see `Missings in training data').prune
the CART model. If
omitted, no pruning takes place. If a single number, the same complexity parameter is
used for each imputed variable. If of length #
of variables imputed, the complexity
parameters used must be in the same order as the predicted variables in the model
formula.dat
, but imputed where possible.
impute_rhd
The predictor variables in the model
argument are used to split the data
set into groups prior to imputation (use ~ 1
to specify that no grouping is applied).
impute_shd
The predictor variables are used to sort the data.
impute_knn
The predictors are used to determine Gower's distance
between records (see gower_topn
).
pool
argument is used to specify the donor pool as follows.
"complete"
. Only records for which the variables on the
left-hand-side of the model formula are complete are used as donors. If a
record has multiple missings, all imputations are taken from a single
donor.
"univariate"
. Imputed variables are treated one by one and
independently so the order of variable imputation is unimportant. If a
record has multiple missings, separate donors are drawn for each missing
value.
"multivariate"
. A donor pool is created for each missing data
pattern. If a record has multiple missings, all imputations are taken from
a single donor.
backend="VIM"
for functions supporting this option. Alternatively, one
can set options(simputation.hdbackend="VIM")
so it becomes the
default. Simputation will map the simputation call to a function in the
VIM package. In particular: impute_rhd
is mapped to VIM::hotdeck
where imputed
variables are passed to the variable
argument and the union of
predictor and grouping variables are passed to domain_var
.
Extra arguments in ...
are passed to VIM::hotdeck
as well.
Argument pool
is ignored.
impute_shd
is mapped to VIM::hotdeck
where
imputed variables are passed to the variable
argument, predictor
variables to ord_var
and grouping variables to domain_var
.
Extra arguments in ...
are passed to VIM::hotdeck
as well.
Arguments pool
and order
are ignored. In VIM
the donor pool
is determined on a per-variable basis, equivalent to setting pool="univariate"
with the simputation backend. VIM is LOCF-based. Differences between
simputation and VIM
likely occurr when the sorting variables contain missings.
impute_knn
is mapped to VIM::kNN
where imputed variables
are passed to variable
, predictor variables are passed to dist_var
and grouping variables are ignored with a message.
Extra arguments in ...
are passed to VIM::hotdeck
as well.
Argument pool
is ignored.
Note that simputation adheres stricktly to the Gower's original
definition of the distance measure, while VIM uses a generalized variant
that can take ordered factors into account. imp_var=TRUE
.IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
The left-hand-side of the formula object lists the variable or variables to
be imputed. The interpretation of the independent variables on the
right-hand-side depends on the underlying imputation model. If grouping
variables are specified, the data set is split according to the values of
those variables, and model estimation and imputation occur independently for
each group. Grouping using dplyr::group_by
is also supported. If groups are
defined in both the formula and using dplyr::group_by
, the data is
grouped by the union of grouping variables. Any missing value in one of the
grouping variables results in an error. Grouping is ignored for impute_const
.na.action
that specifies what to do with missings in training data. The default action
is to train models on data where both the predicted and predictor variables
are available. Some of the interesting options are na.omit
: omit cases where predictor or predicted
is missing. This is the default.
rpart::na.rpart
: omit cases where the predicted is
missing but keep cases where one or more predictors are missing. Relevant
for impute_cart
randomForest::na.roughfix
Temporarily impute all
predictors and predicted with the column median (for numeric data) or the
mode (for categorical data) in order to fit the model.
Model |
description |
impute_lm |
Use stats::lm to train the imputation model. |
impute_rlm |
Use MASS::rlm to train the imputation model. |
impute_median |
Median imputation. Predictors are treated as grouping variables for computing medians. |
impute_const |
Impute a constant value |
impute_proxy |
Copy a value from the predictor variable. |
impute_rhd |
Random hot deck. Predictors are used to group the donors. |
impute_shd |
Sequential hot deck. Predictors sort the data (use ~ 1 for no sorting). |
impute_knn |
k-nearest neighbour imputation. Predictors are used to determine Gower's distance. |
impute_pmm |
Predictive mean matching. |
impute_cart |
Use rpart::rpart to train a CART model. |
lm
rlm
rpart
data(iris)
irisNA <- iris
irisNA[1:4, "Sepal.Length"] <- NA
irisNA[3:7, "Sepal.Width"] <- NA
# impute a single variable (Sepal.Length)
i1 <- impute_lm(irisNA, Sepal.Length ~ Sepal.Width + Species)
# impute both Sepal.Length and Sepal.Width, using robust linear regression
i2 <- impute_rlm(irisNA, Sepal.Length + Sepal.Width ~ Species + Petal.Length)
Run the code above in your browser using DataLab