Generates synthetic multivariate categorical data from a latent class model with L latent classes.
Each observed variable follows a multinomial distribution within classes, with flexible control over
class separation via the IQ parameter and class size distributions.
sim.LCA(
N = 1000,
I = 10,
L = 3,
poly.value = 5,
IQ = "random",
distribution = "random",
params = NULL,
is.sort = TRUE
)A list containing:
Integer matrix (\(N \times I\)) of simulated observations. Rows are observations (named "O1", "O2", ...),
columns are variables named "I1", "I2", ... Values range from 0 to poly.value[i]-1.
Array (\(L \times I \times K\)) of true class-specific category probabilities,
where \(K = \text{max}(poly.value)\) (i.e., the maximum number of categories across variables).
Dimensions: classes x variables x categories.
Note: For variables with poly.value[i] < K, unused category dimensions contain NA.
Dimension names: "L1", "L2", ... (classes); "I1", "I2", ... (variables);
"poly0", "poly1", ... (categories).
Integer vector (length \(N\)) of true class assignments (1 to L). Named with observation IDs (e.g., "O1").
Numeric vector (length \(L\)) of true class proportions. Named with class labels (e.g., "L1").
Integer vector (length \(I\)) specifying number of categories per variable.
Binary matrix (\(N \times L\)) of true class membership indicators (one-hot encoded).
Row i, column l = 1 if observation i belongs to class l, else 0.
Row/column names match Z and class labels.
A list containing all input arguments.
Integer; total number of observations to simulate. Must be \(> L\). Default: 1000.
Integer; number of categorical observed variables. Must be \(\geq 1\). Default: 10.
Integer; number of latent classes. Must be \(\geq 2\) when IQ is numeric. Default: 3.
Integer or integer vector; number of categories (levels) for each observed variable.
If scalar, all variables share the same number of categories. If vector, must have length I.
Minimum valid value is 2 when IQ is numeric. Default: 5.
Character or numeric; controls category probability distributions:
"random"(default) Dirichlet-distributed probabilities (\(\alpha=3\)).
Numericin \((0.5, 1)\). Forces high discriminative power (see details in section below).
Character; distribution of class sizes. Options: "random" (default) or "uniform".
List with fixed parameters for simulation:
par\(L \times I \times K_{\max}\) array of conditional response probabilities per latent class.
P.ZVector of length \(L\) with latent class prior probabilities.
ZVector of length \(N\) containing the latent classes of observations. A fixed
observation classes Z is applied directly to simulate data only when P.Z
is NULL and Z is a N length vector.
A logical value. If TRUE (Default), the latent classes will be ordered in descending
order according to P.Z. All other parameters will be adjusted accordingly
based on the reordered latent classes.
Controls the discriminative power of observed variables:
IQ = "random"(Default) Category probabilities for each variable-class combination are drawn from a symmetric Dirichlet distribution (\(\alpha = 3\)), resulting in moderate class separation.
IQ = numeric(0.5 < IQ < 1) Forces high discriminative power for each variable:
For each variable, two categories per class are assigned extreme probabilities: one category gets probability \(IQ\), another gets \(1-IQ\).
Remaining categories share the residual probability \(1 - IQ - (1-IQ) = 0\).
Note: This requires poly.value >= 2 for all variables.
Category assignments are randomized within classes to avoid structural patterns.
Higher IQ values (closer to 1) yield stronger class separation but increase simulation failure risk.
"random"(Default) Class proportions drawn from Dirichlet distribution (\(\alpha = 3\) for all classes),
ensuring no empty classes. Sizes are rounded to integers with adjustment for exact N.
"uniform"Equal probability of class membership (\(1/L\) per class), sampled with replacement.
May produce empty classes if N is small relative to L.
The simulation enforces a critical constraint: every category of every observed variable must appear at least once in the dataset. If initial generation violates this (e.g., a rare category is missing), parameters and responses are regenerated until satisfied. This ensures compatibility with standard LCA estimation.
Probability Generation:
Dirichlet Sampling (IQ="random"):
For each variable-class combination, probabilities are drawn from
\(\text{Dirichlet}(\alpha_1=3, \dots, \alpha_k=3)\) where \(k = \text{poly.value}[i]\).
High-Discrimination Mode (IQ=numeric):
For each variable i:
Generate special probabilities par.special of length L containing:
\(IQ\), \(1-IQ\), and \(L-2\) values uniformly sampled from \([1-IQ, IQ]\).
For each class l, assign par.special[l] to one category, distribute
remaining probability \(1 - \text{par.special}[l]\) uniformly (via Dirichlet) across other categories.
Shuffle category assignments to avoid position bias.
Data Generation:
Class assignments Z are generated first according to distribution.
For each observation p and variable i:
Retrieve cumulative probabilities for class Z[p]
Draw uniform random number \(u \sim \text{Uniform}(0,1)\)
Assign category k where \(P(\text{category} \leq k-1) < u \leq P(\text{category} \leq k)\)
Entire dataset is regenerated if any category of any variable has zero observations.
Critical Constraints:
When IQ is numeric: \(0.5 < IQ < 1\) and min(poly.value) >= 2
N must be sufficiently large to observe all categories, especially when IQ is high
or poly.value is large. Simulation may fail for small N.
For distribution="uniform", empty classes possible when \(N < L\).
# Example 1: Default settings (moderate separation, random class sizes)
sim_data <- sim.LCA(N = 30, I = 5, L = 3)
# Example 2: High-discrimination items (IQ=0.85), uniform class sizes
sim_high_disc <- sim.LCA(
N = 30,
I = 4,
L = 2,
poly.value = c(3,4,3,5), # Variable category counts
IQ = 0.85,
distribution = "uniform"
)
# Example 3: Binary items (poly.value=2) with high separation
sim_binary <- sim.LCA(N = 300, I = 10, L = 2, poly.value = 2, IQ = 0.9)
Run the code above in your browser using DataLab