autotune_missForest: Perform imputation using missForest form missForest package.

Description

Function use missForest package for data imputation. OBBerror (more in autotune_mice) is used to perform grid search.

Usage

autotune_missForest(
  df,
  col_type = NULL,
  percent_of_missing = NULL,
  cores = NULL,
  ntree_set = c(100, 200, 500, 1000),
  mtry_set = NULL,
  parallel = FALSE,
  col_0_1 = FALSE,
  optimize = TRUE,
  ntree = 100,
  mtry = NULL,
  verbose = FALSE,
  maxiter = 20,
  maxnodes = NULL,
  out_file = NULL
)

Value

Return data.frame with imputed values.

Arguments

df: data.frame. Df to impute with column names.
col_type: character vector. Vector containing column type names.
percent_of_missing: numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..)
cores: integer. Number of threads used by parallel calculations. By default approximately half of available CPU cores.
ntree_set: integer vector. Vector contains numbers of tree for grid search.
mtry_set: integer vector. Vector contains numbers of variables randomly sampled at each split.
parallel: logical. If TRUE parallel calculation is using.
col_0_1: decide if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False.
optimize: optimize inside function
ntree: ntree from missForest function
mtry: mtry form missforest function
verbose: If FALSE funtion didn't print on console.
maxiter: maxiter form missForest function.
maxnodes: maxnodes from missForest function.
out_file: Output log file location if file already exists log message will be added. If NULL no log will be produced.

Author

Daniel J. Stekhoven (2013), Stekhoven D. J., & Buehlmann, P. (2012).

Details

Function try to use parallel backend if it's possible. Half of the available cores are used or number pass as cores param. (Number of used cores can't be higher then number of variables in df. If it happened a number of cores will be set at ncol(df)-2 unless this number is <= 0 then cores =1). To perform parallel calculation function use registerDoParallel to create parallel backend. Creating backend can have significant time cost so for very small df cores=1 can speed up calculation. After calculation function turns off parallel backend.

Gride search is used to chose a sample for each tree and the number of trees can be turn off. Params in grid search have significant influence on imputation quality but function should work on any reasonable values of this parameter.

References

Daniel J. Stekhoven (2013). missForest: Nonparametric Missing Value Imputation using Random Forest. R package version 1.4. Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

Examples

Run this code

{
  raw_data <- data.frame(
    a = as.factor(sample(c("red", "yellow", "blue", NA), 1000, replace = TRUE)),
    b = as.integer(1:1000),
    c = as.factor(sample(c("YES", "NO", NA), 1000, replace = TRUE)),
    d = runif(1000, 1, 10),
    e = as.factor(sample(c("YES", "NO"), 1000, replace = TRUE)),
    f = as.factor(sample(c("male", "female", "trans", "other", NA), 1000, replace = TRUE)))

  # Prepering col_type
  col_type <- c("factor", "integer", "factor", "numeric", "factor", "factor")

  percent_of_missing <- 1:6
  for (i in percent_of_missing) {
    percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
  }


  imp_data <- autotune_missForest(raw_data, col_type, percent_of_missing,
   optimize = FALSE,parallel = FALSE)

  # Check if all missing value was imputed
  sum(is.na(imp_data)) == 0
  # TRUE
}

Run the code above in your browser using DataLab