Learn R Programming

opImputation (version 0.6)

compare_imputation_methods: Compare Imputation Methods for Missing Value Analysis

Description

Performs a comprehensive comparative analysis of different imputation methods on a dataset by artificially inserting missings, applying various imputation techniques, and evaluating their performance through multiple metrics and visualizations. Optionally produces a final imputed dataset using the best-performing method.

Usage

compare_imputation_methods(
      data,
      imputation_methods = all_imputation_methods,
      imputation_repetitions = 20,
      perfect_methods_in_ABC = FALSE,
      n_iterations = 20,
      n_proc = getOption("mc.cores", 2L),
      percent_missing = 0.1,
      seed,
      mnar_shape = 1,
      mnar_ity = 0,
      low_only = FALSE,
      fixed_seed_for_inserted_missings = FALSE,
      max_attempts = 1000,
      overall_best_z_delta = FALSE,
      produce_final_imputations = TRUE,
      plot_results = TRUE,
      verbose = TRUE
    )

Value

Returns a list containing:

all_imputation_runs

List containing all imputation results generated across repeated simulation runs and missing-data patterns.

zdelta_metrics

Standardized z-delta error metrics, including raw values, medians, and variable-wise summaries quantifying deviations between original and imputed data.

method_performance_summary

Comprehensive performance summary of all imputation methods, including ranking metrics and Activity-Based Classification (ABC) results.

best_overall_method

Character. Name of the best-performing imputation method for the analyzed dataset.

best_univariate_method

Character. Name of the top-performing univariate (single-variable) imputation method.

best_multivariate_method

Character. Name of the top-performing multivariate (multi-variable) imputation method.

best_uni_or_multivariate_method

Character. Name of the leading combined uni/multivariate imputation method.

best_poisoned_method

Character. Name of the top-performing stress-test (formerly "poisoned") method.

abc_results_table

Data frame containing the ABC (Activity-Based Classification) analysis results, including method categories and performance scores.

fig_zdelta_distributions

ggplot object displaying the distribution of standardized z-delta values for the best-performing methods.

fig_summary_comparison

ggplot object providing a combined summary figure integrating ABC classification and z-delta plots for comparative visualization.

final_imputed_data

Data frame containing the final dataset with all missing values filled in using the best-performing method (only if produce_final_imputations = TRUE). Returns NULL if no complete dataset could be produced or if imputation was disabled.

final_imputation_method

Character. Name of the imputation algorithm automatically selected and applied to create the final complete dataset. Returns NULL if imputation was disabled or failed.

Arguments

data

Data frame or matrix containing numeric data. May contain existing missing values (NA).

imputation_methods

Character vector of imputation method names to compare. Default is all_imputation_methods. Must include at least two non-calibrating methods. Available options include: Univariate methods: "median", "mean", "mode", "rSample"; Multivariate methods: "bag", "bag_repeated", "rf_mice", "rf_mice_repeated", "rf_missForest", "rf_missForest_repeated", "miceRanger", "miceRanger_repeated", "cart", "cart_repeated", "linear", "pmm", "pmm_repeated", "knn3", "knn5", "knn7", "knn9", "knn10", "ameliaImp", "ameliaImp_repeated", "miImp"; Diagnostic methods: "plus", "plusminus", "factor"; Calibrating methods: "tinyNoise_0.000001", "tinyNoise_0.00001", "tinyNoise_0.0001", "tinyNoise_0.001", "tinyNoise_0.01", "tinyNoise_0.05", "tinyNoise_0.1", "tinyNoise_0.2", "tinyNoise_0.5", "tinyNoise_1". It is recommended that all imputation methods be used in a complete comparison (Default).

imputation_repetitions

Integer. Number of times each imputation method is repeated for each iteration. Default is 20.

perfect_methods_in_ABC

Whether to include perfect imputation methods in comparative selections. Default is FALSE.

n_iterations

Integer. Number of different missing data patterns to test. Default is 20.

n_proc

Integer. Number of processor cores to use for parallel processing. Default is getOption("mc.cores", 2L).

percent_missing

Numeric. Proportion of values to randomly set as missing in each iteration (0 to 1). Default is 0.1 (10%).

seed

Integer. Random seed for reproducibility. If missing, reads current system seed. Setting the parameter is recommended for better reproducibility.

mnar_shape

Numeric. Shape parameter for MNAR (Missing Not At Random) mechanism. Default is 1 (MCAR - Missing Completely At Random).

mnar_ity

Numeric. Degree of missingness mechanism (0-1). Default is 0 (completely random).

low_only

Logical. If TRUE, only insert missings in lower values. Default is FALSE.

fixed_seed_for_inserted_missings

Logical. If TRUE, use same seed for inserting missings across all iterations. Default is FALSE.

max_attempts

Integer. Maximum attempts to create valid missing pattern without completely empty cases. Default is 1000.

overall_best_z_delta

Logical. If TRUE, compare all methods against the overall best; if FALSE, compare against best within category. Default is FALSE.

produce_final_imputations

Logical. If TRUE, produce final imputed dataset using the best-performing univariate or multivariate method from the ABC analysis. The function will try methods in order of their ranking until one succeeds in producing a complete dataset with no missing values. Default is TRUE.

plot_results

Logical. If TRUE, show summary plots. Default is TRUE.

verbose

Logical. If TRUE, print best method information and turn on messaging. Default is TRUE.

Author

Jorn Lotsch, Alfred Ultsch

Details

This function implements a model-agnostic framework for dataset-specific selection of missing value imputation methods. The analysis workflow:

  1. Artificially inserts missing values into complete data

  2. Applies multiple imputation methods

  3. Calculates performance metrics (zDelta values)

  4. Ranks methods using ABC analysis

  5. Generates comprehensive visualizations

  6. Optionally produces final imputed dataset using the best method

The zDelta metric represents standardized absolute differences between original and imputed values, providing a robust measure of imputation quality.

The MNAR mechanism allows testing methods under realistic scenarios:

  • mnar_ity = 0: Missing Completely At Random (MCAR)

  • mnar_ity > 0: Missing Not At Random with specified degree

  • low_only = TRUE: Missings preferentially in lower values

  • mnar_shape: Controls shape of missingness probability distribution

Final Imputation Process: When produce_final_imputations = TRUE, the function automatically:

  1. Extracts the ranked list of methods from ABC analysis results

  2. Filters to only univariate and multivariate methods (excludes poisoned/calibrating methods)

  3. Tries each method in order of performance ranking

  4. Stops at the first method that successfully produces a complete dataset with no missing values

  5. Prints informative console output showing which method was used, its ABC category, score, and ranking

If all methods fail to produce a complete dataset, the function returns NULL for both imputed_data and method_used_for_imputation and prints a warning message.

References

Lotsch J, Ultsch A. (2025). A model-agnostic framework for dataset-specific selection of missing value imputation methods in pain-related numerical data. Can J Pain (in minor revision)

See Also

impute_missings for single imputation operations

create_diagnostic_missings for creating diagnostic missing values

Examples

Run this code
    # Load example data
    data_iris <- iris[,1:4]

    # Add some missings
    set.seed(42)
    for(i in 1:4) data_iris[sample(1:nrow(data_iris), 0.05*nrow(data_iris)), i] <- NA

    # Basic comparison with a subset of methods
    results <- compare_imputation_methods(
      data = data_iris,
      imputation_methods = c("mean", "median", "rSample"),
      n_iterations = 2,
      imputation_repetitions = 2,
      produce_final_imputations = FALSE,
      plot_results = FALSE,
      verbose = FALSE
    )

    # Print results
    # print(results)

    # Cleanup to avoid open sockets during R CMD check
    future::plan(future::sequential)

Run the code above in your browser using DataLab