Learn R Programming

HCPclust (version 0.1.1)

generate_clustered_mar: Simulate clustered continuous outcomes with covariate-dependent MAR missingness

Description

Simulates clustered data \(\{(X_{i,j},Y_{i,j},\delta_{i,j})\}\) under a hierarchical subject-level model with covariate-dependent Missing at Random (MAR) missingness: \(\delta \perp Y \mid X\). Covariates \(X_{i,j}\) are fully observed, while outcomes \(Y_{i,j}\) may be missing.

Data are generated according to the following mechanisms:

  • Between-subject level: subject random intercepts \(b_i\sim N(0,\sigma_b^2)\) induce within-cluster dependence, corresponding to latent subject-specific laws \(P_i\).

  • Outcomes: for each measurement \(j=1,\ldots,m_i\), $$ Y_{i,j} = X_{i,j}^\top \beta + b_i + \varepsilon_{i,j}, $$ where, for each subject i, the within-cluster errors \(\{\varepsilon_{i,j}\}_{j=1}^{m_i}\) are mutually independent with \(\varepsilon_{i,j}\sim N(0,\sigma_\varepsilon^2)\) when rho = 0. When rho != 0, they follow a stationary first-order autoregressive process (AR(1)) within the cluster: $$ \varepsilon_{i,j} = \rho\,\varepsilon_{i,j-1} + \eta_{i,j}, \quad \eta_{i,j}\sim N\!\left(0,\sigma_\varepsilon^2(1-\rho^2)\right), $$ which implies \(\mathrm{Var}(\varepsilon_{i,j})=\sigma_\varepsilon^2\) and \(\mathrm{Cov}(\varepsilon_{i,j},\varepsilon_{i,j+k}) = \sigma_\varepsilon^2\rho^{|k|}\) for all k.

  • MAR missingness: outcomes are observed with probability $$ \Pr(\delta_{i,j}=1\mid X_{i,j}) = \mathrm{logit}^{-1}(\alpha_0+\alpha^\top X_{i,j}), $$ which depends only on covariates, ensuring \(\delta \perp Y \mid X\). If target_missing is provided, the intercept \(\alpha_0\) is automatically calibrated (via a deterministic root-finding procedure on the expected missing proportion) so that the marginal missing proportion is close to target_missing.

Usage

generate_clustered_mar(
  n,
  m = 4L,
  d = 2L,
  beta = NULL,
  sigma_b = 0.7,
  sigma_eps = 1,
  rho = 0,
  hetero_gamma = 0,
  x_dist = c("normal", "bernoulli", "uniform"),
  x_params = NULL,
  alpha0 = -0.2,
  alpha = NULL,
  target_missing = NULL,
  seed = NULL
)

Value

A data.frame in long format with one row per measurement:

id

Cluster index.

j

Within-cluster index.

Y

Observed outcome; NA if missing.

Y_full

Latent complete outcome.

delta

Observation indicator (1 observed, 0 missing).

X1..Xd

Covariates.

Attributes:

m_i

Integer vector of cluster sizes \((m_1,\ldots,m_n)\).

target_missing

Target marginal missing proportion used for calibration, defined as the empirical average of missing probabilities over all observations.

alpha_shift

Calibrated global intercept shift \(s\) added to the missingness linear predictor \(\alpha_0 + s + \alpha^\top X_{i,j}\) (present only when target_missing is provided).

missing_rate

Sample missing rate \(N^{-1}\sum I(\delta_{i,j}=0)\). This may deviate from target_missing due to Bernoulli sampling variability.

Arguments

n

Number of clusters (subjects).

m

Cluster size. Either a single positive integer (common \(m_i=m\)) or an integer vector of length n specifying \(m_i\) for each subject.

d

Covariate dimension.

beta

Population regression coefficients for \(Y\mid X\) (length d). If NULL, defaults to seq(0.5, 0.5 + 0.1*(d-1), by=0.1).

sigma_b

SD of subject random intercept \(b_i\).

sigma_eps

Marginal SD of within-subject errors \(\varepsilon_{i,j}\).

rho

AR(1) correlation parameter within cluster for \(\varepsilon_{i,j}\).

hetero_gamma

Optional heteroskedasticity parameter; a value of 0 yields the standard homoskedastic model, while nonzero values induce covariate-dependent error variance through the first covariate \(X_1\).

x_dist

Distribution for covariates: "normal", "bernoulli", or "uniform".

x_params

Optional list of distribution parameters for x_dist.

alpha0

Missingness intercept \(\alpha_0\). If target_missing is not NULL, the effective intercept becomes \(\alpha_0 + s\), where \(s\) is a calibrated shift.

alpha

Missingness slopes (length d). If NULL, defaults to zeros.

target_missing

Target marginal missing proportion defined as the empirical average of the fitted missing probabilities \(1-\pi(X_{i,j})\) over all observations, where \(\pi(x)=\Pr(\delta=1\mid X=x)\). If NULL, no calibration.

seed

Optional RNG seed.

Examples

Run this code
dat <- generate_clustered_mar(
  n = 200, m = 5, d = 2,
  alpha0 = -0.2, alpha = c(-1.0, 0.0),
  target_missing = 0.30,
  seed = 1
)
mean(dat$delta == 0)      # ~0.30
attr(dat, "alpha_shift")  # calibrated shift

Run the code above in your browser using DataLab