missForest: Imputes a dataframe and returns imputation models to be used on new observations

Description

Imputes a dataframe and (if save_models = TRUE) returns imputation models to be used on new observations.

Usage

missForest(
  xmis,
  maxiter = 10,
  fixed_maxiter = FALSE,
  var_weights = NULL,
  decreasing = FALSE,
  initialization = "mean/mode",
  x_init = NULL,
  class.weights = NULL,
  return_integer_as_integer = FALSE,
  save_models = TRUE,
  predictor_matrix = NULL,
  proportion_usable_cases = c(1, 0),
  verbose = TRUE,
  convergence_error = "OOB",
  ...
)

Value

Object of class missForest with elements

ximp: dataframe with imputed values
init: x_init if custom initalization is used; otherwise list of mean/mode or median/mode for each variable
initialization: value of initialization parameter
impute_sequence: vector variable names in the order in which imputation has been run
maxiter: maxiter parameter as passed to the function
models: list of random forest models for each iteration
return_integer_as_integer: Parameter return_integer_as_integer as passed to the function
integer_columns: list of columns of integer type in the data
OOB_err: dataframe with out-of-bag errors for each iteration and each variable

Arguments

xmis: dataframe containing missing values of class dataframe ("tibble" class tbl_df is also supported). Matrix format is not supported. See details for column format.
maxiter: maximum number of iterations. By default the algorithm will stop when converge is reached or after running for maxiter, whichever occurs first.
fixed_maxiter: if set to TRUE, the algorithm will run for the exact number of iterations specified in maxiter, regardless of the convergence criteria. Default is FALSE.
var_weights: named vector of weights for each variable in the convergence criteria. The names should correspond to variable names. By default the weights are set to the proportion of missing values on each variable.
decreasing: (boolean) if TRUE the order in which the variables are imputed is by decreasing amount of missing values. (the variable with highest amount of missing values will be imputed first). If FALSE the variable with lowest amount of missing values will be imputed first.
initialization: initialization method before running RF models; supported: mean/mode, median/mode and custom. Default is mean/mode.
x_init: if initialization = custom; a complete dataframe to be used as initialization (see vignette for example).
class.weights: a named list containing class.weights parameter to be passed to ranger for categorical variables. The names of the list needs to respect the names of the categorical variables in the dataframe. (See ranger function documentation in ranger package for details).
return_integer_as_integer: Internally, integer columns are treated as double (double precision floating point numbers). If TRUE, the imputations will be rounded to closest integer and returned as integer (This might be desirable for count variables). If FALSE, integer columns will be returned as double (This might be desirable, for example, for patient age imputation). Default is FALSE. The same behaviour will be applied to new observations when using missForestPredict.
save_models: if TRUE, imputation models are saved and a new observation (or a test set) can be imputed using the models learned; saving models on a dataset with a high number of variables will occupy RAM memory on the machine. Default is TRUE.
predictor_matrix: predictor matrix indicating which variables to use in the imputation of each variable. See documentation for function create_predictor_matrix for details on the matrix format.
proportion_usable_cases: a vector with two components: the first one is a minimum threshold for p_obs and the second one is a maximum threshold for p_miss. Variables for which p_obs is greater than or equal to 1 (by default) will be filtered from the predictor matrix. Variables for which p_miss is lower than or equal to 0 (by default) will be filtered from the predictor matrix. For more details on p_obs and p_miss see the documentation for the prop_usable_cases function. If parameter predictor_matrix is specified, the proportion_usable_cases will be applied to this provided matrix.
verbose: (boolean) if TRUE then missForest returns OOB error estimates (MSE and NMSE) and runtime.
convergence_error: Which error should be used for the convergence criterion. Supported values: OOB and apparent. If a different value is provided, it defaults to OOB. See vignette for full details on convergence.
...: other arguments passed to ranger function (some arguments that are specific to each variable type are not supported). See vignette for num.trees example.

Details

An adaptation of the original missForest algorithm (Stekhoven et al. 2012) is used. Variables are initialized with a mean/mode, median/mode or custom imputation. Then, they are imputed iteratively "on the fly" for a maximum number of iterations or until the convergence criteria are met. The imputation sequence is either increasing or decreasing. At each iteration, a random forest model is build for each variable using as outcome on the observed (non-missing) values of the variable and as predictors the values of the other variables from previous iteration for the first variable in the sequence or current iteration for next variables in the sequence (on-the-fly). The ranger package (Wright et al. 2017) is used for building the random forest models.

The convergence criterion is based on the out-of-boostrap (OOB) error or the apparent error and uses NMSE (normalized mean squared error) for both continuous and categorical variables.

Imputation models for all variables and all iterations are saved (if save_models is TRUE) and can be later applied to new observations.

Both dataframe and tibble (tbl_df class) are supported as input. The imputed dataframe will be retured with the same class. Numeric and integer columns are supported and treated internally as continuous variables. Factor and character columns are supported and treated internally as categorical variables. Other types (like boolean or dates) are not supported. NA values are considered missing values.

References

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. tools:::Rd_expr_doi("10.1093/bioinformatics/btr597")
Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. tools:::Rd_expr_doi("10.18637/jss.v077.i01").

Examples

Run this code

data(iris)
iris_mis <- produce_NA(iris, proportion = 0.1)
imputation_object <- missForest(iris_mis, num.threads = 2)
iris_imp <- imputation_object$ximp

Run the code above in your browser using DataLab