impute_missings: Impute Missing Values Using Specified Method

Description

Fills in missing values (NA) in numeric data using a specified imputation method. Provides a unified interface to univariate, multivariate, ensemble, and diagnostic imputation approaches. The function automatically handles method-specific parameters and error recovery.

Usage

impute_missings(
  x,
  method = "rf_missForest",
  ImputationRepetitions = 10,
  seed = NULL,
  x_orig = NULL
)

Value

Returns a data frame with the same dimensions and column names as the input x, but with missing values filled in according to the specified method. If imputation fails, returns a data frame with all values set to NA.

Arguments

x: Data frame or matrix containing numeric data with missing values (NA). All columns must be numeric.
method: Character string specifying which imputation method to use. Default is "rf_missForest". See Details for all available methods.
ImputationRepetitions: Integer. Number of repetitions for methods ending with "_repeated". These methods perform multiple imputations and return the median across repetitions for increased stability. Default is 10. Ignored for non-repeated methods.
seed: Integer. Random seed for reproducibility. If missing, reads current system seed. Setting the parameter is recommended for better reproducibility. Must be the same as set in compare_imputation_methods for reprodicible results.
x_orig: Data frame or matrix. Original complete data required only for poisoned and calibrating methods (used for validation/benchmarking). Must have same dimensions as x. Default is NULL.

Author

Jorn Lotsch, Alfred Ultsch

Details

This function provides access to multiple imputation algorithms through a single interface. Simply specify the desired method name via the method parameter.

Available Methods:

Univariate methods (replace each missing value independently):

"median" - Column median
"mean" - Column mean
"mode" - Column mode (most frequent value)
"rSample" - Random sample from observed values

Bagging methods (bootstrap aggregating with decision trees):

"bag" - Single bagged tree imputation
"bag_repeated" - Repeated bagging with median aggregation

Random forest methods (ensemble of decision trees):

"rf_mice" - Random forest via mice package
"rf_mice_repeated" - Repeated RF via mice
"rf_missForest" - Random forest via missForest package (recommended)
"rf_missForest_repeated" - Repeated RF via missForest
"miceRanger" - Random forest via miceRanger package
"miceRanger_repeated" - Repeated RF via miceRanger

Tree-based methods:

"cart" - Classification and regression trees
"cart_repeated" - Repeated CART with median aggregation

Regression methods:

"linear" - Lasso regression (L1-regularized linear model)
"pmm" - Predictive mean matching
"pmm_repeated" - Repeated PMM with median aggregation

k-Nearest neighbors methods:

"knn3", "knn5", "knn7", "knn9", "knn10" - k-NN with specified number of neighbors

Multiple imputation methods:

"ameliaImp" - Single imputation via Amelia II
"ameliaImp_repeated" - Multiple imputations via Amelia II
"miImp" - Multiple imputation via mi package

Poisoned methods (require x_orig, for validation only):

"plus" - Add systematic positive offset
"plusminus" - Add alternating positive/negative offset
"factor" - Multiply by constant factor

Calibrating methods (require x_orig, for benchmarking):

"tinyNoise_0.000001" through "tinyNoise_1" - Add small random noise with specified magnitude (available magnitudes: 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1)

Repeated methods: Methods ending with "_repeated" perform multiple independent imputations and return the median value across all repetitions. This typically provides more stable and robust results but requires more computation time. The number of repetitions is controlled by the ImputationRepetitions parameter.

Method selection guidance:

For quick results: Use "median" or "mean"
For moderate missing data: Use "rf_missForest" or "knn5"
For high-quality results: Use "rf_missForest_repeated" or "pmm_repeated"
For systematic comparison: Use compare_imputation_methods

References

Lotsch J, Ultsch A. (2025). A model-agnostic framework for dataset-specific selection of missing value imputation methods in pain-related numerical data. Can J Pain (in minor revision)

Examples

Run this code

# Load example data
data_iris <- iris[,1:4]

# Add some misisngs
set.seed(42)
for(i in 1:4) data_iris[sample(1:nrow(data_iris), 0.05*nrow(data_iris)), i] <- NA

# Simple univariate imputation with median
data_iris_imputed_median <- impute_missings(
  data_iris,
  method = "median"
)

# Show data
head(data_iris_imputed_median)

Run the code above in your browser using DataLab