fit_missingness_propensity: Fit missingness propensity model P(delta=1 | X) from pooled data

Description

Fits the missingness propensity \(\pi(x)=\mathbb{P}(\delta=1\mid x)\) under a marginal missingness model using pooled observations. Estimation can be carried out using logistic regression, Generalized Random Forests (GRF), or gradient boosting (xgboost). Both continuous and discrete covariates are supported; categorical variables are automatically expanded into dummy variables via model.matrix().

Usage

fit_missingness_propensity(
  dat,
  delta_col = "delta",
  x_cols,
  method = c("logistic", "grf", "boosting"),
  eps = 1e-06,
  ...
)

Value

A list containing:

method: The estimation method used.
fit: The fitted missingness propensity model.
predict: A function predict(x_new) that returns the estimated missingness propensity \(\hat\pi(x)=\mathbb{P}(\delta=1\mid x)\) evaluated at new covariate values x_new, with predictions clipped to \([\epsilon,1-\epsilon]\).

Arguments

dat

A data.frame containing delta_col and x_cols. Can be any user-supplied dataset; generate_clustered_mar() is used only in examples.

delta_col

Name of missingness indicator column (1 observed, 0 missing).

x_cols

Character vector of covariate column names used to predict missingness.

method

One of "logistic", "grf", "boosting".

eps

Clipping level applied to the estimated missingness propensity \(\hat\pi(x)\), truncating predictions to \([\epsilon,1-\epsilon]\).

...

Extra arguments passed to the learner:

logistic: passed to stats::glm.

grf

passed to grf::probability_forest.

boosting

passed to xgboost::xgb.train via params= and nrounds=.

Examples

Run this code

dat <- generate_clustered_mar(
  n = 80, m = 4, d = 2,
  alpha0 = -0.4, alpha = c(-1.0, 0.8),
  target_missing = 0.30,
  seed = 1
)
x_cols <- c("X1", "X2")

## Logistic regression
fit_log <- fit_missingness_propensity(dat, "delta", x_cols, method = "logistic")
p_log <- fit_log$predict(dat[, x_cols, drop = FALSE])
head(p_log)
# \donttest{
## Compare with other methods
## True propensity under the generator
s <- attr(dat, "alpha_shift")
eta <- (-0.4) + (-1.0) * dat$X1 + 0.8 * dat$X2
pi_true <- 1 / (1 + exp(-pmin(pmax(eta, -30), 30)))

fit_grf <- fit_missingness_propensity(
  dat, "delta", x_cols,
  method = "grf", num.trees = 800, num.threads = 1
)
fit_xgb <- fit_missingness_propensity(
  dat, "delta", x_cols,
  method = "boosting",
  nrounds = 300,
  params = list(max_depth = 3, eta = 0.05, subsample = 0.8, colsample_bytree = 0.8),
  nthread = 1
)

p_grf <- fit_grf$predict(dat[, x_cols, drop = FALSE])
p_xgb <- fit_xgb$predict(dat[, x_cols, drop = FALSE])

op <- par(mfrow = c(1, 3))
plot(pi_true, p_log, pch = 16, cex = 0.5,
     xlab = "True pi(x)", ylab = "Estimated pi-hat(x)", main = "Logistic"); abline(0, 1, lwd = 2)
plot(pi_true, p_grf, pch = 16, cex = 0.5,
     xlab = "True pi(x)", ylab = "Estimated pi-hat(x)", main = "GRF"); abline(0, 1, lwd = 2)
plot(pi_true, p_xgb, pch = 16, cex = 0.5,
     xlab = "True pi(x)", ylab = "Estimated pi-hat(x)", main = "Boosting"); abline(0, 1, lwd = 2)
par(op)
# }

Run the code above in your browser using DataLab