mlim: missing data imputation with automated machine learning

Description

imputes data.frame with mixed variable types using automated machine learning (AutoML)

Usage

mlim(
  data = NULL,
  m = 1,
  algos = c("ELNET"),
  postimpute = FALSE,
  stochastic = m > 1,
  ignore = NULL,
  tuning_time = 900,
  max_models = NULL,
  maxiter = 10L,
  cv = 10L,
  matching = "AUTO",
  autobalance = TRUE,
  balance = NULL,
  seed = NULL,
  verbosity = NULL,
  report = NULL,
  tolerance = 0.001,
  doublecheck = TRUE,
  preimpute = "RF",
  cpu = -1,
  ram = NULL,
  flush = FALSE,
  preimputed.data = NULL,
  save = NULL,
  load = NULL,
  shutdown = TRUE,
  java = NULL,
  ...
)

Value

a data.frame, showing the estimated imputation error from the cross validation within the data.frame's attribution

Arguments

data

a data.frame (strictly) with missing data to be imputed. if 'load' argument is provided, this argument will be ignored.

m

integer, specifying number of multiple imputations. the default value is 1, carrying out a single imputation.

algos

character vector, specifying algorithms to be used for missing data imputation. supported algorithms are "ELNET", "RF", "GBM", "DL", "XGB", and "Ensemble". if more than one algorithm is specified, mlim changes behavior to save on runtime. for example, the default is "ELNET", which fine-tunes an Elastic Net model. In general, "ELNET" is expected to be the best algorithm because it fine-tunes very fast, it is very robust to over-fitting, and hence, it generalizes very well. However, if your data has many factor variables, each with several levels, it is recommended to have c("ELNET", "RF") as your imputation algorithms (and possibly add "Ensemble" as well, to make the most out of tuning the models).

Note that code"XGB" is only available in Mac OS and Linux. moreover, "GBM", "DL" and "XGB" take the full given "tuning_time" (see below) to tune the best model for imputing he given variable, whereas "ELNET" will produce only one fine-tuned model, often at less time than other algorithms need for developing a single model, which is why "ELNET" is work horse of the mlim imputation package.

postimpute

(EXPERIMENTAL FEATURE) logical. if TRUE, mlim uses algorithms rather than 'ELNET' for carrying out postimputation optimization. however, if FALSE, all specified algorihms will be used in the process of 'reimputation' together. the 'Ensemble' algorithm is encouraged when other algorithms are used. However, for general users unspecialized in machine learning, postimpute is NOT recommended because this feature is currently experimental, prone to over-fitting, and highly computationally extensive.

stochastic

logical. by default it is set to TRUE for multiple imputation and FALSE for single imputation. stochastic argument is currently under testing and is intended to avoid inflating the correlation between imputed valuables.

ignore

character vector of column names or index of columns that should should be ignored in the process of imputation.

tuning_time

integer. maximum runtime (in seconds) for fine-tuning the imputation model for each variable in each iteration. the default time is 900 seconds but for a large dataset, you might need to provide a larger model development time. this argument also influences max_models, see below. If you are using 'ELNET' algorithm (default), you can be generous with the 'tuning_time' argument because 'ELNET' tunes much faster than the rest and will only produce one model.

max_models

integer. maximum number of models that can be generated in the proecess of fine-tuning the parameters. this value default to 100, meaning that for imputing each variable in each iteration, up to 100 models can be fine-tuned. increasing this value should be consistent with increasing max_model_runtime_secs, allowing the model to spend more time in the process of individualized fine-tuning. as a result, the better tuned the model, the more accurate the imputed values are expected to be

maxiter

integer. maximum number of iterations. the default value is 15, but it can be reduced to 3 (not recommended, see below).

cv

logical. specify number of k-fold Cross-Validation (CV). values of 10 or higher are recommended. default is 10.

matching

logical. if TRUE, imputed values are coerced to the closest value to the non-missing values of the variable. if set to "AUTO", 'mlim' decides whether to match or not, based on the variable classes. the default is "AUTO".

autobalance

logical. if TRUE (default), binary and multinomial factor variables will be balanced before the imputation to obtain fairer and less-biased imputations, which are typically in favor of the majority class. if FALSE, imputation fairness will be sacrificed for overall accuracy, which is not recommended, although it is commonly practiced in other missing data imputation software. MLIM is highly concerned with imputation fairness for factor variables and autobalancing is generally recommended. in fact, higher overall accuracy does not mean a better imputation as long as minority classes are neglected, which increases the bias in favor of the majority class. if you do not wish to autobalance all the factor variables, you can manually specify the variables that should be balanced using the 'balance' argument (see below).

balance

character vector, specifying variable names that should be balanced before imputation. balancing the prevalence might decrease the overall accuracy of the imputation, because it attempts to ensure the representation of the rare outcome. this argument is optional and intended for advanced users that impute a severely imbalance categorical (nominal) variable.

seed

integer. specify the random generator seed

verbosity

character. controls how much information is printed to console. the value can be "warn" (default), "info", "debug", or NULL. to FALSE.

report

filename. if a filename is specified (e.g. report = "mlim.md"), the "md.log" R package is used to generate a Markdown progress report for the imputation. the format of the report is adopted based on the 'verbosity' argument. the higher the verbosity, the more technical the report becomes. if verbosity equals "debug", then a log file is generated, which includes time stamp and shows the function that has generated the message. otherwise, a reduced markdown-like report is generated. default is NULL.

tolerance

numeric. the minimum rate of improvement in estimated error metric of a variable to qualify the imputation for another round of iteration, if the maxiter is not yet reached. any improvement of imputation is desirable. however, specifying values above 0 can reduce the number of required iterations at a marginal increase of imputation error. for larger datasets, value of "1e-3" is recommended to reduce number of iterations. the default value is '1e-3'.

doublecheck

logical. default is TRUE (which is conservative). if FALSE, if the estimated imputation error of a variable does not improve, the variable will be not reimputed in the following iterations. in general, deactivating this argument will slightly reduce the imputation accuracy, however, it significantly reduces the computation time. if your dataset is large, you are advised to set this argument to FALSE. (EXPERIMENTAL: consider that by avoiding several iterations that marginally improve the imputation accuracy, you might gain higher accuracy by investing your computational resources in fine-tuning better algorithms such as "GBM")

preimpute

character. specifies the 'primary' procedure of handling the missing data. before 'mlim' begins imputing the missing observations, they should be prepared for the imputation algorithms and thus, they should be replaced with some values. the default procedure is a quick "RF", which models the missing data with parallel Random Forest model. this is a very fast procedure, which later on, will be replaced within the "reimputation" procedure (see below). possible other alternative is "mm", which carries out mean/mode replacement, as practiced by most imputation algorithms. "mm" is much faster than "RF". if your dataset is very large, consider pre-imputing it before hand using 'mlim.preimpute()' function and passing the preimputed dataset to mlim (see "preimputed.data" argument).

cpu

integer. number of CPUs to be dedicated for the imputation. the default takes all of the available CPUs.

ram

integer. specifies the maximum size, in Gigabytes, of the memory allocation. by default, all the available memory is used for the imputation. large memory size is particularly advised, especially for multicore processes. the more you give the more you get!

flush

logical (experimental). if TRUE, after each model, the server is cleaned to retrieve RAM. this feature is in testing mode and is currently set to FALSE by default, but it is recommended if you have limited amount of RAM or large datasets.

preimputed.data

data.frame. if you have used another software for missing data imputation, you can still optimize the imputation by handing the data.frame to this argument, which will bypass the "preimpute" procedure.

save

filename (with .mlim extension). if a filename is specified, an mlim object is saved after the end of each variable imputation. this object not only includes the imputed dataframe and estimated cross-validation error, but also includes the information needed for continuing the imputation, which is very useful feature for imputing large datasets, with a long runtime. this argument is activated by default and an mlim object is stored in the local directory named "mlim.rds".

load

filename (with .mlim extension). an object of class "mlim", which includes the data, arguments, and settings for re-running the imputation, from where it was previously stopped. the "mlim" object saves the current state of the imputation and is particularly recommended for large datasets or when the user specifies a computationally extensive settings (e.g. specifying several algorithms, increasing tuning time, etc.).

shutdown

logical. if TRUE, h2o server is closed after the imputation. the default is TRUE

java

character, specifying path to the executable 64bit Java JDK on the Microsoft Windows machines, if JDK is installed but the path environment variable is not set.

...

arguments that are used internally between 'mlim' and 'mlim.postimpute'. these arguments are not documented in the help file and are not intended to be used by end user.

Author

E. F. Haghish

Examples

Run this code


if (FALSE) {
data(iris)

# add stratified missing observations to the data. to make the example run
# faster, I add NAs only to a single variable.
dfNA <- iris
dfNA$Species <- mlim.na(dfNA$Species, p = 0.1, stratify = TRUE, seed = 2022)

# run the ELNET single imputation (fastest imputation via 'mlim')
MLIM <- mlim(dfNA, shutdown = FALSE)

# in single imputation, you can estimate the imputation accuracy via cross validation RMSE
mlim.summarize(MLIM)

### or if you want to carry out ELNET multiple imputation with 5 datasets.
### next, to carry out analysis on the multiple imputation, use the 'mlim.mids' function
### minimum of 5 datasets
MLIM2 <- mlim(dfNA, m = 5)
mids <- mlim.mids(MLIM2, dfNA)
fit <- with(data=mids, exp=glm(Species ~ Sepal.Length, family = "binomial"))
res <- mice::pool(fit)
summary(res)

# you can check the accuracy of the imputation, if you have the original dataset
mlim.error(MLIM2, dfNA, iris)
}

Run the code above in your browser using DataLab