Simulates clustered data \(\{(X_{i,j},Y_{i,j},\delta_{i,j})\}\) under a hierarchical subject-level model with covariate-dependent Missing at Random (MAR) missingness: \(\delta \perp Y \mid X\). Covariates \(X_{i,j}\) are fully observed, while outcomes \(Y_{i,j}\) may be missing.
Data are generated according to the following mechanisms:
Between-subject level: subject random intercepts \(b_i\sim N(0,\sigma_b^2)\) induce within-cluster dependence, corresponding to latent subject-specific laws \(P_i\).
Outcomes: for each measurement \(j=1,\ldots,m_i\),
$$
Y_{i,j} = X_{i,j}^\top \beta + b_i + \varepsilon_{i,j},
$$
where, for each subject i, the within-cluster errors
\(\{\varepsilon_{i,j}\}_{j=1}^{m_i}\) are mutually independent with
\(\varepsilon_{i,j}\sim N(0,\sigma_\varepsilon^2)\) when rho = 0.
When rho != 0, they follow a stationary first-order autoregressive process
(AR(1)) within the cluster:
$$
\varepsilon_{i,j} = \rho\,\varepsilon_{i,j-1} + \eta_{i,j}, \quad
\eta_{i,j}\sim N\!\left(0,\sigma_\varepsilon^2(1-\rho^2)\right),
$$
which implies \(\mathrm{Var}(\varepsilon_{i,j})=\sigma_\varepsilon^2\) and
\(\mathrm{Cov}(\varepsilon_{i,j},\varepsilon_{i,j+k})
= \sigma_\varepsilon^2\rho^{|k|}\) for all k.
MAR missingness: outcomes are observed with probability
$$
\Pr(\delta_{i,j}=1\mid X_{i,j}) = \mathrm{logit}^{-1}(\alpha_0+\alpha^\top X_{i,j}),
$$
which depends only on covariates, ensuring \(\delta \perp Y \mid X\).
If target_missing is provided, the intercept \(\alpha_0\) is automatically
calibrated (via a deterministic root-finding procedure on the expected missing proportion)
so that the marginal missing proportion is close to target_missing.
generate_clustered_mar(
n,
m = 4L,
d = 2L,
beta = NULL,
sigma_b = 0.7,
sigma_eps = 1,
rho = 0,
hetero_gamma = 0,
x_dist = c("normal", "bernoulli", "uniform"),
x_params = NULL,
alpha0 = -0.2,
alpha = NULL,
target_missing = NULL,
seed = NULL
)A data.frame in long format with one row per measurement:
Cluster index.
Within-cluster index.
Observed outcome; NA if missing.
Latent complete outcome.
Observation indicator (1 observed, 0 missing).
Covariates.
Attributes:
m_iInteger vector of cluster sizes \((m_1,\ldots,m_n)\).
target_missingTarget marginal missing proportion used for calibration, defined as the empirical average of missing probabilities over all observations.
alpha_shiftCalibrated global intercept shift \(s\) added to the missingness linear predictor
\(\alpha_0 + s + \alpha^\top X_{i,j}\) (present only when target_missing is provided).
missing_rateSample missing rate \(N^{-1}\sum I(\delta_{i,j}=0)\).
This may deviate from target_missing due to Bernoulli sampling variability.
Number of clusters (subjects).
Cluster size. Either a single positive integer (common \(m_i=m\)) or
an integer vector of length n specifying \(m_i\) for each subject.
Covariate dimension.
Population regression coefficients for \(Y\mid X\) (length d).
If NULL, defaults to seq(0.5, 0.5 + 0.1*(d-1), by=0.1).
SD of subject random intercept \(b_i\).
Marginal SD of within-subject errors \(\varepsilon_{i,j}\).
AR(1) correlation parameter within cluster for \(\varepsilon_{i,j}\).
Optional heteroskedasticity parameter; a value of 0 yields the standard homoskedastic model, while nonzero values induce covariate-dependent error variance through the first covariate \(X_1\).
Distribution for covariates: "normal", "bernoulli", or "uniform".
Optional list of distribution parameters for x_dist.
Missingness intercept \(\alpha_0\). If target_missing is not NULL,
the effective intercept becomes \(\alpha_0 + s\), where \(s\) is a calibrated shift.
Missingness slopes (length d). If NULL, defaults to zeros.
Target marginal missing proportion defined as the empirical
average of the fitted missing probabilities \(1-\pi(X_{i,j})\) over all observations,
where \(\pi(x)=\Pr(\delta=1\mid X=x)\).
If NULL, no calibration.
Optional RNG seed.
dat <- generate_clustered_mar(
n = 200, m = 5, d = 2,
alpha0 = -0.2, alpha = c(-1.0, 0.0),
target_missing = 0.30,
seed = 1
)
mean(dat$delta == 0) # ~0.30
attr(dat, "alpha_shift") # calibrated shift
Run the code above in your browser using DataLab