simulation_prediction_binary: Simulate Binary Longitudinal Data for Prediction

Description

Generates synthetic longitudinal data with binary outcomes, designed for evaluating classification and prediction models. The function creates a latent continuous variable based on covariates and random effects, then converts it into binary outcomes using various link functions (corresponding to the residual argument).

Usage

simulation_prediction_binary(
  train_prop = 0.7,
  n_subject = 1000,
  n_obs_per_sub = 5,
  seed = NULL,
  nonlinear = FALSE,
  residual = c("normal", "logistic", "t3", "t2"),
  randeff = c("MVN", "MVN_mixture", "skewed_MVN", "MVT3", "MVT2")
)

Value

A list containing the following components:

subject_id_train: A vector of subject IDs for the training set.
Z_train: A matrix of random predictors (time/intercept) for the training set.
X_train: A matrix of covariates for the training set.
Y_train: A vector of observed binary outcomes (0 or 1) for the training set.
subject_id_test: A vector of subject IDs for the testing set.
Z_test: A matrix of random predictors for the testing set.
X_test: A matrix of covariates for the testing set.
Y_test: A vector of true probabilities for the testing set. These represent the ground truth propensity scores (0 to 1) used for evaluation.
X_pop: A matrix of covariates for the entire population.
y_pop: A vector of true probabilities for the entire population.
I: A logical vector indicating which observations belong to the training set.
X_src: Duplicate of X_train, provided for convenience.
Y_src: Vector of true probabilities for the training set (unlike Y_train which is binary).

Arguments

train_prop

A numeric value between 0 and 1 indicating the proportion of the population to be used for the training set. Default: 0.7.

n_subject

An integer specifying the total number of subjects in the population. Default: 1000.

n_obs_per_sub

An integer specifying the number of observations per subject. Default: 5.

seed

An optional integer for setting the random seed to ensure reproducibility. Default: NULL.

nonlinear

A logical value. If TRUE, the latent variable is generated using a complex nonlinear function of the covariates. If FALSE, it is a linear combination. Default: FALSE.

residual

A character string specifying the link function (CDF) used to generate probabilities from the latent variable. This effectively acts as the error distribution assumption in a Generalized Linear Mixed Model (GLMM) context:

"normal": Uses the standard normal CDF (Probit link).
"logistic": Uses the logistic CDF (Logit link).
"t3": Uses the Student's t (df=3) CDF.
"t2": Uses the Student's t (df=2) CDF.

randeff

A character string specifying the distribution of the random effects added to the latent variable. Options are:

"MVN": Multivariate Normal distribution.
"MVN_mixture": Mixture of Multivariate Normal distributions.
"skewed_MVN": Multivariate Skew-normal distribution.
"MVT3": Multivariate t-distribution with 3 degrees of freedom.
"MVT2": Multivariate t-distribution with 2 degrees of freedom.

Details

The function simulates a latent continuous variable \(Y^*\) based on fixed effects (linear or nonlinear X) and random effects (Z * Bi). This latent variable is scaled and then transformed into a probability \(p\) using the CDF specified by residual.

For the training set, the observed outcome Y_train is sampled from a Bernoulli distribution with probability \(p\). For the testing set, the function returns the probability \(p\) itself (Y_test), allowing for precise evaluation of the model's ability to estimate propensity scores or risk.

Examples

Run this code

# Simulate data with logistic link (Logit) and mixture of normal random effects
sim_bin <- simulation_prediction_binary(
  train_prop = 0.7,
  n_subject = 500,
  residual = "logistic",
  randeff = "MVN_mixture",
  seed = 123
)

Run the code above in your browser using DataLab