run_missingness_benchmark: Run missingness benchmark

Description

Benchmarks model performance under feature missingness. The function:

Filters to complete cases for target_col and feature_cols (baseline complete data),
Splits into training/validation,
Masks feature values at each rate using Bernoulli (cell-wise) missingness,
Imputes missing features using MICE on training data and applies the fitted imputation model to validation data via mice::mice.mids(newdata = ...) (reduces leakage),
Trains Random Forest (ranger) and kNN regression (FNN::knn.reg),
Returns MAPE and R-squared for each model and mask rate.

Feature columns must be numeric (or coercible to numeric without introducing new missing values). This mirrors workflows where features are treated as numeric arrays.

Usage

run_missingness_benchmark(
  data,
  target_col,
  feature_cols = NULL,
  mask_rates = c(0.05, 0.1, 0.2, 0.3),
  rf_n_estimators = 200,
  knn_k = 5,
  test_size = 0.2,
  seed = 42
)

Value

A data.frame with columns MaskRate, Model, MAPE, and R2.

Arguments

data: A data.frame (or object coercible to data.frame) containing the dataset.
target_col: Single character string: name of the outcome column.
feature_cols: Character vector of feature column names. If NULL, uses all columns except target_col.
mask_rates: Numeric vector in (0, 1): proportion of feature entries to mask per rate.
rf_n_estimators: Integer: number of trees for the random forest.
knn_k: Integer: number of neighbors for kNN regression.
test_size: Numeric in (0, 1): fraction of rows assigned to validation split.
seed: Integer: seed for data split and model reproducibility.

Author

Shubh Saraswat, Hasin Shahed Shad, and Xiaohua Douglas Zhang

Details

Validation imputation is performed using mice::mice.mids(newdata = ...), which generates imputations for new data according to the model stored in the training mids object.

MAPE is computed using Metrics::mape() on non-zero targets only to avoid instability when actual values are zero.

Examples

Run this code

data("CGMExampleData")
run_missingness_benchmark(
  CGMExampleData,
  target_col = "LBORRES",
  feature_cols = c("TimeDifferenceMinutes", "TimeSeries", "USUBJID"),
  mask_rates = c(0.05, 0.10),
  rf_n_estimators = 100,
  knn_k = 3
)

Run the code above in your browser using DataLab