compare_imputation_methods: Compare Imputation Methods for Missing Value Analysis

Description

Performs a comprehensive comparative analysis of different imputation methods on a dataset by artificially inserting missings, applying various imputation techniques, and evaluating their performance through multiple metrics and visualizations. Optionally produces a final imputed dataset using the best-performing method.

Usage

compare_imputation_methods(
      data,
      imputation_methods = all_imputation_methods,
      imputation_repetitions = 20,
      perfect_methods_in_ABC = FALSE,
      n_iterations = 20,
      n_proc = getOption("mc.cores", 2L),
      percent_missing = 0.1,
      seed,
      mnar_shape = 1,
      mnar_ity = 0,
      low_only = FALSE,
      fixed_seed_for_inserted_missings = FALSE,
      max_attempts = 1000,
      overall_best_z_delta = FALSE,
      produce_final_imputations = TRUE,
      plot_results = TRUE,
      verbose = TRUE
    )

Value

Returns a list containing:

all_imputation_runs: List containing all imputation results generated across repeated simulation runs and missing-data patterns.
zdelta_metrics: Standardized z-delta error metrics, including raw values, medians, and variable-wise summaries quantifying deviations between original and imputed data.
method_performance_summary: Comprehensive performance summary of all imputation methods, including ranking metrics and Activity-Based Classification (ABC) results.
best_overall_method: Character. Name of the best-performing imputation method for the analyzed dataset.
best_univariate_method: Character. Name of the top-performing univariate (single-variable) imputation method.
best_multivariate_method: Character. Name of the top-performing multivariate (multi-variable) imputation method.
best_uni_or_multivariate_method: Character. Name of the leading combined uni/multivariate imputation method.
best_poisoned_method: Character. Name of the top-performing stress-test (formerly "poisoned") method.
abc_results_table: Data frame containing the ABC (Activity-Based Classification) analysis results, including method categories and performance scores.
fig_zdelta_distributions: ggplot object displaying the distribution of standardized z-delta values for the best-performing methods.
fig_summary_comparison: ggplot object providing a combined summary figure integrating ABC classification and z-delta plots for comparative visualization.
final_imputed_data: Data frame containing the final dataset with all missing values filled in using the best-performing method (only if produce_final_imputations = TRUE). Returns NULL if no complete dataset could be produced or if imputation was disabled.
final_imputation_method: Character. Name of the imputation algorithm automatically selected and applied to create the final complete dataset. Returns NULL if imputation was disabled or failed.

Arguments

data: Data frame or matrix containing numeric data. May contain existing missing values (NA).
imputation_methods: Character vector of imputation method names to compare. Default is all_imputation_methods. Must include at least two non-calibrating methods. Available options include: Univariate methods: "median", "mean", "mode", "rSample"; Multivariate methods: "bag", "bag_repeated", "rf_mice", "rf_mice_repeated", "rf_missForest", "rf_missForest_repeated", "miceRanger", "miceRanger_repeated", "cart", "cart_repeated", "linear", "pmm", "pmm_repeated", "knn3", "knn5", "knn7", "knn9", "knn10", "ameliaImp", "ameliaImp_repeated", "miImp"; Diagnostic methods: "plus", "plusminus", "factor"; Calibrating methods: "tinyNoise_0.000001", "tinyNoise_0.00001", "tinyNoise_0.0001", "tinyNoise_0.001", "tinyNoise_0.01", "tinyNoise_0.05", "tinyNoise_0.1", "tinyNoise_0.2", "tinyNoise_0.5", "tinyNoise_1". It is recommended that all imputation methods be used in a complete comparison (Default).
imputation_repetitions: Integer. Number of times each imputation method is repeated for each iteration. Default is 20.
perfect_methods_in_ABC: Whether to include perfect imputation methods in comparative selections. Default is FALSE.
n_iterations: Integer. Number of different missing data patterns to test. Default is 20.
n_proc: Integer. Number of processor cores to use for parallel processing. Default is getOption("mc.cores", 2L).
percent_missing: Numeric. Proportion of values to randomly set as missing in each iteration (0 to 1). Default is 0.1 (10%).
seed: Integer. Random seed for reproducibility. If missing, reads current system seed. Setting the parameter is recommended for better reproducibility.
mnar_shape: Numeric. Shape parameter for MNAR (Missing Not At Random) mechanism. Default is 1 (MCAR - Missing Completely At Random).
mnar_ity: Numeric. Degree of missingness mechanism (0-1). Default is 0 (completely random).
low_only: Logical. If TRUE, only insert missings in lower values. Default is FALSE.
fixed_seed_for_inserted_missings: Logical. If TRUE, use same seed for inserting missings across all iterations. Default is FALSE.
max_attempts: Integer. Maximum attempts to create valid missing pattern without completely empty cases. Default is 1000.
overall_best_z_delta: Logical. If TRUE, compare all methods against the overall best; if FALSE, compare against best within category. Default is FALSE.
produce_final_imputations: Logical. If TRUE, produce final imputed dataset using the best-performing univariate or multivariate method from the ABC analysis. The function will try methods in order of their ranking until one succeeds in producing a complete dataset with no missing values. Default is TRUE.
plot_results: Logical. If TRUE, show summary plots. Default is TRUE.
verbose: Logical. If TRUE, print best method information and turn on messaging. Default is TRUE.

Author

Jorn Lotsch, Alfred Ultsch

Details

This function implements a model-agnostic framework for dataset-specific selection of missing value imputation methods. The analysis workflow:

Artificially inserts missing values into complete data
Applies multiple imputation methods
Calculates performance metrics (zDelta values)
Ranks methods using ABC analysis
Generates comprehensive visualizations
Optionally produces final imputed dataset using the best method

The zDelta metric represents standardized absolute differences between original and imputed values, providing a robust measure of imputation quality.

The MNAR mechanism allows testing methods under realistic scenarios:

mnar_ity = 0: Missing Completely At Random (MCAR)
mnar_ity > 0: Missing Not At Random with specified degree
low_only = TRUE: Missings preferentially in lower values
mnar_shape: Controls shape of missingness probability distribution

Final Imputation Process: When produce_final_imputations = TRUE, the function automatically:

Extracts the ranked list of methods from ABC analysis results
Filters to only univariate and multivariate methods (excludes poisoned/calibrating methods)
Tries each method in order of performance ranking
Stops at the first method that successfully produces a complete dataset with no missing values
Prints informative console output showing which method was used, its ABC category, score, and ranking

If all methods fail to produce a complete dataset, the function returns NULL for both imputed_data and method_used_for_imputation and prints a warning message.

References

Lotsch J, Ultsch A. (2025). A model-agnostic framework for dataset-specific selection of missing value imputation methods in pain-related numerical data. Can J Pain (in minor revision)

Examples

Run this code

    # Load example data
    data_iris <- iris[,1:4]

    # Add some missings
    set.seed(42)
    for(i in 1:4) data_iris[sample(1:nrow(data_iris), 0.05*nrow(data_iris)), i] <- NA

    # Basic comparison with a subset of methods
    results <- compare_imputation_methods(
      data = data_iris,
      imputation_methods = c("mean", "median", "rSample"),
      n_iterations = 2,
      imputation_repetitions = 2,
      produce_final_imputations = FALSE,
      plot_results = FALSE,
      verbose = FALSE
    )

    # Print results
    # print(results)

    # Cleanup to avoid open sockets during R CMD check
    future::plan(future::sequential)

Run the code above in your browser using DataLab