hcp_conformal_region: HCP conformal prediction region with repeated subsampling and repeated data splitting

Description

Constructs a marginal conformal prediction region for a new covariate value $x_{n+1}$ under clustered data with missing outcomes, following the HCP framework:

(1) Model fitting. Fit a pooled conditional density model $\widehat\pi(y\mid x)$ using fit_cond_density_quantile, together with a marginal missingness propensity model $\widehat p(x)=\mathbb{P}(\delta=1\mid x)$ using fit_missingness_propensity, both estimated on a subject-level training split.
(2) Subsampled calibration. Repeatedly construct calibration sets by randomly drawing one observation per subject from the calibration split.
(3) Weighted conformal scoring. Compute weighted conformal $p$-values over a candidate grid using the nonconformity score $R(x,y)=-\widehat\pi(y\mid x)$ and inverse-propensity weights $w(x)=1/\widehat p(x)$ under a MAR assumption.
(4) Aggregation. Aggregate dependent $p$-values across subsamples (B) and data splits (S) using either the Cauchy combination test (CCT/ACAT) or the arithmetic mean.

The prediction region is returned as a subset of the supplied grid: $$\widehat C(x_{n+1};\alpha)=\{y\in\mathcal Y:\ p_{\text{final}}(y)>\alpha\}.$$

Usage

hcp_conformal_region(
  dat,
  id_col,
  y_col = "Y",
  delta_col = "delta",
  x_cols,
  x_test,
  y_grid,
  alpha = 0.1,
  train_frac = 0.5,
  S = 5,
  B = 5,
  combine_B = c("cct", "mean"),
  combine_S = c("cct", "mean"),
  seed = NULL,
  return_details = FALSE,
  dens_method = c("rq", "qrf"),
  dens_taus = seq(0.05, 0.95, by = 0.02),
  dens_h = NULL,
  enforce_monotone = TRUE,
  tail_decay = TRUE,
  prop_method = c("logistic", "grf", "boosting"),
  prop_eps = 1e-06,
  ...
)

Value

If return_details=FALSE (default), a list with:

region: Length-K list; region[[k]] is the subset of y_grid with p_final[k, ] > alpha.
lo_hi: K x 2 matrix with columns c("lo","hi") giving min/max of region[[k]] (NA if empty).
p_final: K x length(y_grid) matrix of final p-values on y_grid.
y_grid: The candidate grid used.

If return_details=TRUE, also includes:

p_split: An array with dimensions c(S, K, length(y_grid)) of split-level p-values.
split_meta: Train subject IDs for each split.

Arguments

dat: A data.frame containing clustered observations. Must include id_col, y_col, delta_col, and all columns in x_cols.
id_col: Subject/cluster identifier column name.
y_col: Outcome column name.
delta_col: Missingness indicator column name (1 observed, 0 missing).
x_cols: Covariate column names used for both density estimation and missingness propensity.
x_test: New covariate value(s). A numeric vector (treated as one row), or a numeric matrix/data.frame with nrow(x_test)=K test points and ncol(x_test)=length(x_cols) covariates.
y_grid: Numeric vector of candidate $y$ values at which to evaluate conformal $p$-values.
alpha: Miscoverage level in (0,1). Region keeps $y$ with $p(y)>\alpha$.
train_frac: Fraction of subjects assigned to training in each split.
S: Number of independent subject-level splits.
B: Number of subsamples per split (one observation per subject per subsample).
combine_B: Combine $p$-values across B subsamples: "cct" (default) or "mean".
combine_S: Combine $p$-values across S splits: "cct" (default) or "mean".
seed: Optional seed for reproducibility.
return_details: Logical; if TRUE, also return split-level p-values and split metadata.
dens_method: Density/quantile engine for fit_cond_density_quantile: "rq" or "qrf".
dens_taus: Quantile grid passed to fit_cond_density_quantile.
dens_h: Bandwidth(s) passed to fit_cond_density_quantile.
enforce_monotone: Passed to fit_cond_density_quantile.
tail_decay: Passed to fit_cond_density_quantile.
prop_method: Missingness propensity method for fit_missingness_propensity: "logistic", "grf", or "boosting".
prop_eps: Clipping level for propensity predictions used by fit_missingness_propensity.
...: Extra arguments passed to fit_missingness_propensity.

Examples

Run this code

dat <- generate_clustered_mar(n = 200, m = 4, d = 2, target_missing = 0.30, seed = 1)
y_grid <- seq(-4, 4, length.out = 200)
x_test <- matrix(c(0.2, -1.0), nrow = 1); colnames(x_test) <- c("X1", "X2")

res <- hcp_conformal_region(
  dat, id_col = "id",
  y_col = "Y", delta_col = "delta",
  x_cols = c("X1", "X2"),
  x_test = x_test,
  y_grid = y_grid,
  alpha = 0.1,
  S = 2, B = 2,
  seed = 1
)

## interval endpoints on the y-grid (outer envelope)
c(lo = min(res$region[[1]]), hi = max(res$region[[1]]))

Run the code above in your browser using DataLab