Learn R Programming

NADIA (version 0.4.2)

autotune_missForest: Perform imputation using missForest form missForest package.

Description

Function use missForest package for data imputation. OBBerror (more in autotune_mice) is used to perform grid search.

Usage

autotune_missForest(
  df,
  col_type = NULL,
  percent_of_missing = NULL,
  cores = NULL,
  ntree_set = c(100, 200, 500, 1000),
  mtry_set = NULL,
  parallel = FALSE,
  col_0_1 = FALSE,
  optimize = TRUE,
  ntree = 100,
  mtry = NULL,
  verbose = FALSE,
  maxiter = 20,
  maxnodes = NULL,
  out_file = NULL
)

Value

Return data.frame with imputed values.

Arguments

df

data.frame. Df to impute with column names.

col_type

character vector. Vector containing column type names.

percent_of_missing

numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..)

cores

integer. Number of threads used by parallel calculations. By default approximately half of available CPU cores.

ntree_set

integer vector. Vector contains numbers of tree for grid search.

mtry_set

integer vector. Vector contains numbers of variables randomly sampled at each split.

parallel

logical. If TRUE parallel calculation is using.

col_0_1

decide if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False.

optimize

optimize inside function

ntree

ntree from missForest function

mtry

mtry form missforest function

verbose

If FALSE funtion didn't print on console.

maxiter

maxiter form missForest function.

maxnodes

maxnodes from missForest function.

out_file

Output log file location if file already exists log message will be added. If NULL no log will be produced.

Author

Daniel J. Stekhoven (2013), Stekhoven D. J., & Buehlmann, P. (2012).

Details

Function try to use parallel backend if it's possible. Half of the available cores are used or number pass as cores param. (Number of used cores can't be higher then number of variables in df. If it happened a number of cores will be set at ncol(df)-2 unless this number is <= 0 then cores =1). To perform parallel calculation function use registerDoParallel to create parallel backend. Creating backend can have significant time cost so for very small df cores=1 can speed up calculation. After calculation function turns off parallel backend.

Gride search is used to chose a sample for each tree and the number of trees can be turn off. Params in grid search have significant influence on imputation quality but function should work on any reasonable values of this parameter.

References

Daniel J. Stekhoven (2013). missForest: Nonparametric Missing Value Imputation using Random Forest. R package version 1.4. Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

Examples

Run this code
{
  raw_data <- data.frame(
    a = as.factor(sample(c("red", "yellow", "blue", NA), 1000, replace = TRUE)),
    b = as.integer(1:1000),
    c = as.factor(sample(c("YES", "NO", NA), 1000, replace = TRUE)),
    d = runif(1000, 1, 10),
    e = as.factor(sample(c("YES", "NO"), 1000, replace = TRUE)),
    f = as.factor(sample(c("male", "female", "trans", "other", NA), 1000, replace = TRUE)))

  # Prepering col_type
  col_type <- c("factor", "integer", "factor", "numeric", "factor", "factor")

  percent_of_missing <- 1:6
  for (i in percent_of_missing) {
    percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
  }


  imp_data <- autotune_missForest(raw_data, col_type, percent_of_missing,
   optimize = FALSE,parallel = FALSE)

  # Check if all missing value was imputed
  sum(is.na(imp_data)) == 0
  # TRUE
}

Run the code above in your browser using DataLab