impute_lm: (Robust) Linear Regression Imputation

Description

Regression imputation methods including linear regression, robust linear regression with \(M\)-estimators, regularized regression with lasso/elasticnet/ridge regression.

Usage

impute_lm(
  dat,
  formula,
  add_residual = c("none", "observed", "normal"),
  na_action = na.omit,
  impute_all = FALSE,
  ...
)
impute_rlm(
  dat,
  formula,
  add_residual = c("none", "observed", "normal"),
  na_action = na.omit,
  impute_all = FALSE,
  ...
)
impute_en(
  dat,
  formula,
  add_residual = c("none", "observed", "normal"),
  na_action = na.omit,
  impute_all = FALSE,
  family = c("gaussian", "poisson"),
  s = 0.01,
  ...
)

Value

dat, but imputed where possible.

Arguments

dat

[data.frame], with variables to be imputed and their predictors.

formula

[formula] imputation model description (See Model description)

add_residual

[character] Type of residual to add. "normal" means that the imputed value is drawn from N(mu,sd) where mu and sd are estimated from the model's residuals (mu should equal zero in most cases). If add_residual = "observed", residuals are drawn (with replacement) from the model's residuals. Ignored for non-numeric predicted variables.

na_action

[function] what to do with missings in training data. By default cases with missing values in predicted or predictors are omitted (see `Missings in training data').

impute_all

[logical] If FALSE (default) then only missings in predicted variables are imputed. If TRUE, predictions are imputed for all records and if a prediction cannot be made then NA is imputed.

...

further arguments passed to

lm for impute_lm
rlm for impute_rlm
glmnet for impute_en

family

Response type for elasticnet / lasso regression. For family="gaussian" the imputed variables are general numeric variables. For family="poisson" the imputed variables are nonnegative counts. See glmnet for details.

s

The value of \(\lambda\) to use when computing predictions for lasso/elasticnet regression (parameter s of predict.glmnet). For impute\_en the (optional) parameter lambda is passed to glmnet when estimating the model (which is advised against).

Model specification

Formulas are of the form

IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]

The left-hand-side of the formula object lists the variable or variables to be imputed. The right-hand side excluding the optional GROUPING_VARIABLES model specification for the underlying predictor.

If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.

Grouping using dplyr::group_by is also supported. If groups are defined in both the formula and using dplyr::group_by, the data is grouped by the union of grouping variables. Any missing value in one of the grouping variables results in an error.

Grouping is ignored for impute_const.

Methodology

Linear regression model imputation with impute_lm can be used to impute numerical variables based on numerical and/or categorical predictors. Several common imputation methods, including ratio and (group) mean imputation can be expressed this way. See lm for details on possible model specification.

Robust linear regression through M-estimation with impute_rlm can be used to impute numerical variables employing numerical and/or categorical predictors. In \(M\)-estimation, the minimization of the squares of residuals is replaced with an alternative convex function of the residuals that decreases the influence of outliers.

Also see e.g. Huber (1981).

Lasso/elastic net/ridge regression imputation with impute_en can be used to impute numerical variables employing numerical and/or categorical predictors. For this method, the regression coefficients are found by minimizing the least sum of squares of residuals augmented with a penalty term depending on the size of the coefficients. For lasso regression (Tibshirani, 1996), the penalty term is the sum of squares of the coefficients. For ridge regression (Hoerl and Kennard, 1970), the penalty term is the sum of absolute values of the coefficients. Elasticnet regression (Zou and Hastie, 2010) allows switching from lasso to ridge by penalizing by a weighted sum of the sum-of-squares and sum of absolute values term.

References

Huber, P.J., 2011. Robust statistics (pp. 1248-1251). Springer Berlin Heidelberg.

Hoerl, A.E. and Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), pp.55-67.

Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp.267-288.

Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), pp.301-320.

Examples

Run this code


data(iris)
irisNA <- iris
irisNA[1:4, "Sepal.Length"] <- NA
irisNA[3:7, "Sepal.Width"] <- NA

# impute a single variable (Sepal.Length)
i1 <- impute_lm(irisNA, Sepal.Length ~ Sepal.Width + Species)

# impute both Sepal.Length and Sepal.Width, using robust linear regression
i2 <- impute_rlm(irisNA, Sepal.Length + Sepal.Width ~ Species + Petal.Length)

Run the code above in your browser using DataLab