Generates synthetic longitudinal data with binary outcomes, designed for evaluating
classification and prediction models. The function creates a latent continuous variable based on
covariates and random effects, then converts it into binary outcomes using various link functions
(corresponding to the residual argument).
simulation_prediction_binary(
train_prop = 0.7,
n_subject = 1000,
n_obs_per_sub = 5,
seed = NULL,
nonlinear = FALSE,
residual = c("normal", "logistic", "t3", "t2"),
randeff = c("MVN", "MVN_mixture", "skewed_MVN", "MVT3", "MVT2")
)A list containing the following components:
A vector of subject IDs for the training set.
A matrix of random predictors (time/intercept) for the training set.
A matrix of covariates for the training set.
A vector of observed binary outcomes (0 or 1) for the training set.
A vector of subject IDs for the testing set.
A matrix of random predictors for the testing set.
A matrix of covariates for the testing set.
A vector of true probabilities for the testing set. These represent the ground truth propensity scores (0 to 1) used for evaluation.
A matrix of covariates for the entire population.
A vector of true probabilities for the entire population.
A logical vector indicating which observations belong to the training set.
Duplicate of X_train, provided for convenience.
Vector of true probabilities for the training set (unlike Y_train which is binary).
A numeric value between 0 and 1 indicating the proportion of the population to be used
for the training set. Default: 0.7.
An integer specifying the total number of subjects in the population. Default: 1000.
An integer specifying the number of observations per subject. Default: 5.
An optional integer for setting the random seed to ensure reproducibility. Default: NULL.
A logical value. If TRUE, the latent variable is generated using a complex
nonlinear function of the covariates. If FALSE, it is a linear combination. Default: FALSE.
A character string specifying the link function (CDF) used to generate probabilities from the latent variable. This effectively acts as the error distribution assumption in a Generalized Linear Mixed Model (GLMM) context:
"normal": Uses the standard normal CDF (Probit link).
"logistic": Uses the logistic CDF (Logit link).
"t3": Uses the Student's t (df=3) CDF.
"t2": Uses the Student's t (df=2) CDF.
A character string specifying the distribution of the random effects added to the latent variable. Options are:
"MVN": Multivariate Normal distribution.
"MVN_mixture": Mixture of Multivariate Normal distributions.
"skewed_MVN": Multivariate Skew-normal distribution.
"MVT3": Multivariate t-distribution with 3 degrees of freedom.
"MVT2": Multivariate t-distribution with 2 degrees of freedom.
The function simulates a latent continuous variable \(Y^*\) based on fixed effects (linear or nonlinear X)
and random effects (Z * Bi). This latent variable is scaled and then transformed into a probability \(p\)
using the CDF specified by residual.
For the training set, the observed outcome Y_train is sampled from a Bernoulli distribution
with probability \(p\). For the testing set, the function returns the probability \(p\) itself (Y_test),
allowing for precise evaluation of the model's ability to estimate propensity scores or risk.
# Simulate data with logistic link (Logit) and mixture of normal random effects
sim_bin <- simulation_prediction_binary(
train_prop = 0.7,
n_subject = 500,
residual = "logistic",
randeff = "MVN_mixture",
seed = 123
)
Run the code above in your browser using DataLab