generate_clustered_mar: Simulate clustered continuous outcomes with covariate-dependent MAR missingness

Description

Simulates clustered data $\{(X_{i,j},Y_{i,j},\delta_{i,j})\}$ under a hierarchical subject-level model with covariate-dependent Missing at Random (MAR) missingness: $\delta \perp Y \mid X$. Covariates $X_{i,j}$ are fully observed, while outcomes $Y_{i,j}$ may be missing.

Data are generated according to the following mechanisms:

Between-subject level: subject random intercepts $b_i\sim N(0,\sigma_b^2)$ induce within-cluster dependence, corresponding to latent subject-specific laws $P_i$.
Outcomes: for each measurement $j=1,\ldots,m_i$, $$ Y_{i,j} = X_{i,j}^\top \beta + b_i + \varepsilon_{i,j}, $$ where, for each subject i, the within-cluster errors $\{\varepsilon_{i,j}\}_{j=1}^{m_i}$ are mutually independent with $\varepsilon_{i,j}\sim N(0,\sigma_\varepsilon^2)$ when rho = 0. When rho != 0, they follow a stationary first-order autoregressive process (AR(1)) within the cluster: $$ \varepsilon_{i,j} = \rho\,\varepsilon_{i,j-1} + \eta_{i,j}, \quad \eta_{i,j}\sim N\!\left(0,\sigma_\varepsilon^2(1-\rho^2)\right), $$ which implies $\mathrm{Var}(\varepsilon_{i,j})=\sigma_\varepsilon^2$ and $\mathrm{Cov}(\varepsilon_{i,j},\varepsilon_{i,j+k}) = \sigma_\varepsilon^2\rho^{|k|}$ for all k.
MAR missingness: outcomes are observed with probability $$ \Pr(\delta_{i,j}=1\mid X_{i,j}) = \mathrm{logit}^{-1}(\alpha_0+\alpha^\top X_{i,j}), $$ which depends only on covariates, ensuring $\delta \perp Y \mid X$. If target_missing is provided, the intercept $\alpha_0$ is automatically calibrated (via a deterministic root-finding procedure on the expected missing proportion) so that the marginal missing proportion is close to target_missing.

Usage

generate_clustered_mar(
  n,
  m = 4L,
  d = 2L,
  beta = NULL,
  sigma_b = 0.7,
  sigma_eps = 1,
  rho = 0,
  hetero_gamma = 0,
  x_dist = c("normal", "bernoulli", "uniform"),
  x_params = NULL,
  alpha0 = -0.2,
  alpha = NULL,
  target_missing = NULL,
  seed = NULL
)

Value

A data.frame in long format with one row per measurement:

id: Cluster index.
j: Within-cluster index.
Y: Observed outcome; NA if missing.
Y_full: Latent complete outcome.
delta: Observation indicator (1 observed, 0 missing).
X1..Xd: Covariates.

Attributes:

m_i: Integer vector of cluster sizes $(m_1,\ldots,m_n)$.
target_missing: Target marginal missing proportion used for calibration, defined as the empirical average of missing probabilities over all observations.
alpha_shift: Calibrated global intercept shift $s$ added to the missingness linear predictor $\alpha_0 + s + \alpha^\top X_{i,j}$ (present only when target_missing is provided).
missing_rate: Sample missing rate $N^{-1}\sum I(\delta_{i,j}=0)$. This may deviate from target_missing due to Bernoulli sampling variability.

Arguments

n: Number of clusters (subjects).
m: Cluster size. Either a single positive integer (common $m_i=m$) or an integer vector of length n specifying $m_i$ for each subject.
d: Covariate dimension.
beta: Population regression coefficients for $Y\mid X$ (length d). If NULL, defaults to seq(0.5, 0.5 + 0.1*(d-1), by=0.1).
sigma_b: SD of subject random intercept $b_i$.
sigma_eps: Marginal SD of within-subject errors $\varepsilon_{i,j}$.
rho: AR(1) correlation parameter within cluster for $\varepsilon_{i,j}$.
hetero_gamma: Optional heteroskedasticity parameter; a value of 0 yields the standard homoskedastic model, while nonzero values induce covariate-dependent error variance through the first covariate $X_1$.
x_dist: Distribution for covariates: "normal", "bernoulli", or "uniform".
x_params: Optional list of distribution parameters for x_dist.
alpha0: Missingness intercept $\alpha_0$. If target_missing is not NULL, the effective intercept becomes $\alpha_0 + s$, where $s$ is a calibrated shift.
alpha: Missingness slopes (length d). If NULL, defaults to zeros.
target_missing: Target marginal missing proportion defined as the empirical average of the fitted missing probabilities $1-\pi(X_{i,j})$ over all observations, where $\pi(x)=\Pr(\delta=1\mid X=x)$. If NULL, no calibration.
seed: Optional RNG seed.

Examples

Run this code

dat <- generate_clustered_mar(
  n = 200, m = 5, d = 2,
  alpha0 = -0.2, alpha = c(-1.0, 0.0),
  target_missing = 0.30,
  seed = 1
)
mean(dat$delta == 0)      # ~0.30
attr(dat, "alpha_shift")  # calibrated shift

Run the code above in your browser using DataLab