Learn R Programming

opImputation: A model-agnostic framework for dataset-specific selection of missing value imputation methods in pain-related numerical data

opImputation is an R package for comparing, benchmarking, and applying missing‑value imputation strategies to numerical tabular data. It is developed for biomedical and clinical research but is broadly applicable to any numerical dataset containing missing values.


Features

  • Model‑agnostic benchmarking: Compare imputation methods from diverse algorithmic families
  • Dataset‑specific selection: Automatically identify the best‑performing method for your data
  • Automated imputation: Optionally produce a final imputed dataset using the top method
  • Parallel processing: Efficient computation using the future framework (future.apply, progressr)
  • Reproducible analysis: Seedable, standardized workflows
  • Extensible integration: Add new methods or external benchmarking data easily

Installation

# From GitHub
if (!requireNamespace("devtools")) install.packages("devtools")
devtools::install_github("JornLotsch/opImputation")

# From CRAN (when available, uploaded to CRAN on October 24, 2025. Release pending incoming checks)
install.packages("opImputation")

Package metadata:

TypeR Package
TitleOptimal Selection of Imputation Methods for Bio‑Medical Data
Version0.4
DependsR (≥ 3.5.0)
Importsparallel, Rfit, methods, stats, caret, ABCanalysis, ggplot2, future, future.apply, progressr, missForest, utils, mice, miceRanger, multiUS, Amelia, mi, reshape2, DataVisualizations, abind, cowplot, twosamples, ggh4x, ggrepel, tools
LicenseGPL‑3
AuthorsJörn Lötsch, Alfred Ultsch
Creator, MaintainerJörn Lötsch
Repositoryhttps://github.com/JornLotsch/opImputation
Date2025‑05‑03

Usage

Basic example

library(opImputation)

# Load example data
data_iris <- iris[,1:4]

# Add some misisngs
set.seed(42)
for(i in 1:4) data_iris[sample(1:nrow(data_iris), 0.05*nrow(data_iris)), i] <- NA

# Basic comparison with a subset of methods
results <- compare_imputation_methods(
  data = data_iris,
  imputation_methods = c("mean", "median", "knn5", "rf_missForest"),
  n_iterations = 10,
  imputation_repetitions = 10,
  seed = 42
)

# Retrieve automatically generated final imputation
imputed_data <- results$imputed_data
print(results$method_used_for_imputation)

Main functions

compare_imputation_methods

Call

ArgumentDescription
dataNumeric data frame or matrix. May contain existing missing values.
imputation_methodsCharacter vector of imputation method names to compare. Default: all_imputation_methods. Must include at least two non‑calibrating methods.
imputation_repetitionsInteger. Number of repeated imputations for each method and iteration (default = 20).
perfect_methods_in_ABCLogical. If TRUE, calibration methods are included in the final categorization of methods (default value = FALSE). For testing purposes only; do not set to TRUE in real test environments.
n_iterationsNumber of missing data patterns to test (default = 20).
n_procNumber of CPU cores for parallel processing (default: getOption("mc.cores",2L)).
percent_missingNumeric. Proportion of data to randomly set missing (0‑1; default = 0.1).
seedInteger. Random seed for reproducibility (recommended).
mnar_shapeShape parameter for the Missing Not At Random (MNAR) mechanism (default = 1).
mnar_ityDegree of MNAR dependency (0–1; default = 0 → MCAR).
low_onlyLogical. If TRUE, insert missings only in lower‑valued observations.
fixed_seed_for_inserted_missingsLogical. Repeat identical random pattern across iterations.
max_attemptsMaximum attempts to avoid creating empty rows (default = 1000).
overall_best_z_deltaLogical. Compare to global best or category best method (default = FALSE).
produce_final_imputationsLogical. If TRUE, generates final imputed dataset using the best‑ranked valid method (default = TRUE).
plot_resultsLogical. If TRUE, create summary plots (default = TRUE).
verboseLogical. If TRUE, print best method information and turn on messaging (default = TRUE).

Returns

Return ElementDescription
all_imputation_runsList containing all imputation results generated across repeated simulation runs and missing‑data patterns.
zdelta_metricsStandardized Δz (z‑delta) error metrics, including raw values, medians, and variable‑wise summaries quantifying deviations between original and imputed data.
method_performance_summaryComprehensive performance summary of all imputation methods, including ranking metrics and Activity‑Based Classification (ABC) results.
best_overall_methodName of the best‑performing imputation method for the analyzed dataset.
best_univariate_methodName of the top‑performing univariate (single‑variable) imputation method.
best_multivariate_methodName of the top‑performing multivariate (multi‑variable) imputation method.
best_uni_or_multivariate_methodName of the leading combined uni/multivariate imputation method.
best_poisoned_methodName of the top‑performing stress‑test method.
abc_results_tableData frame containing the ABC (Activity‑Based Classification) analysis results, including method categories and performance scores.
fig_zdelta_distributionsFigure displaying the distribution of standardized Δz values for the best‑performing methods.
fig_summary_comparisonCombined figure integrating ABC classification and summary Δz plots for comparative visualization.
final_imputed_dataFinal dataset with all missing values filled in using the best‑performing method (if produce_final_imputations = TRUE).
final_imputation_methodName of the imputation algorithm automatically selected and applied to create the final complete dataset.

impute_missings

Call

ArgumentDescription
xNumeric data frame or matrix with missing values.
methodImputation method name (default = "rf_missForest").
ImputationRepetitionsNumber of repetitions for methods ending with _repeated (default = 10).
seedRandom seed for reproducibility (recommended).
x_origOriginal dataset required for “poisoned” or “calibrating” methods.

Returns

A numeric data frame of the same dimensions and column names, with all missing values imputed.


Output and diagnostics

Performance evaluation is based on the standardized Δz (z‑delta) metric—
a robust measure of the absolute deviation between true and imputed values.
ABC (Activity‑Based Classification) categorizes imputation methods by their relative performance,
highlighting “A‑class” models as top performers.

Example output table from res_abc$df_abc_results[,1:3](generic dataset):

abc_scoreabc_categorymethod
36.5755Aplusminus
21.8880Acart_repeated
17.5513Apmm_repeated
16.8750Arf_mice_repeated
16.0810AmiceRanger
15.3061AmiceRanger_repeated
9.7959Acart
9.0947Bpmm
7.2345Brf_missForest
7.0602Bamelia_repeated
4.8430BmiImp
4.8430Brf_mice
3.2741Cplus
3.2153Crf_missForest_repeated
1.9199Camelia
1.4161Cknn3
1.3021Clinear
0.0000Cbag
0.0000Cbag_repeated
0.0000Cfactor
0.0000Cknn10
0.0000Cknn5
0.0000Cknn7
0.0000Cknn9
0.0000Cmean
0.0000Cmedian
0.0000Cmode
0.0000CrSample

Legend:

  • abc_score: zDelta values. Quantitative measure of imputation performance (higher = better).
  • abc_category: ABC‑derived ranking class (“A” = top, “B” = medium, “C” = low).
  • method: Name of the evaluated imputation algorithm.

“A‑class” methods (top seven in this example) represent the highest‑performing algorithms for the tested dataset.
Lower tiers correspond to progressively weaker or calibration‑only approaches.


Example summary plot

Diagnostic summary from the Iris dataset: ABC curves and variable‑specific Δz distributions (different dataset than that used in above table).

A: Standardized mean ranks for all imputation methods with ABC category coloring.
B: Mean standardized Δz deviations for diagnostic missings.
C: Variable‑level Δz distributions across methods.


When to use opImputation

  • Biomedical or clinical datasets with incomplete numerical data
  • Multivariate analysis or machine‑learning preprocessing
  • Benchmarking and transparent method selection
  • Fully automated, reproducible imputation pipelines

Citation

If you use opImputation, please cite:

Lötsch J, Ultsch A. (2025).
A model‑agnostic framework for dataset‑specific selection of missing value imputation methods in pain‑related numerical data.
Can J Pain (in minor revision)


Authors and license

  • Jörn Lötsch (author, creator, maintainer)
  • Alfred Ultsch (author)
  • License: GPL‑3

About this project

opImputation is an R-based framework for automated, transparent, and reproducible selection of suitable missing‑value imputation methods for a given dataset. It provides a model‑agnostic approach that evaluates multiple candidate algorithms directly on the dataset to identify the most appropriate technique for that specific data structure, missingness pattern, and variable type.

The framework includes commonly used imputation families such as statistical, regression‑based, ensemble, tree‑based, and multiple‑imputation approaches. Its modular structure enables easy extension with additional or emerging methods, including deep‑learning‑based algorithms, ensuring long‑term flexibility across research domains.

Quantitative evaluation of candidate methods is based on computed ABC (cABC) analysis, which identifies statistically top‑performing algorithms through standardized performance metrics. For details on the theoretical background of cABC analysis, see: Ultsch A, Lötsch J. PLoS One. 2015; 10(6): e0129767.
Lötsch J, Ultsch A. Sci Rep. 2023; 13(1): 5470.

Copy Link

Version

Install

install.packages('opImputation')

Monthly Downloads

175

Version

0.6

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Jorn Lotsch

Last Published

November 7th, 2025

Functions in opImputation (0.6)

create_diagnostic_missings

Create Diagnostic Missing Values in Data
LipidsPsychiatricPat

Chromatography mass spectrometry of lipid mediators measured in blood samples.
CodeinLogMetabolitesUrine

Codeine metabolite concentrations in urine.
QSTpainEJPtransf

Psychophysical data from a clinical quantitative sensory testing study.
PainThresholds

Psychophysical data from an investigation of pain thresholds.
compare_imputation_methods

Compare Imputation Methods for Missing Value Analysis
impute_missings

Impute Missing Values Using Specified Method