sim.LCA: Simulate Data for Latent Class Analysis

Description

Generates synthetic multivariate categorical data from a latent class model with L latent classes. Each observed variable follows a multinomial distribution within classes, with flexible control over class separation via the IQ parameter and class size distributions.

Usage

sim.LCA(
  N = 1000,
  I = 10,
  L = 3,
  poly.value = 5,
  IQ = "random",
  distribution = "random",
  params = NULL,
  is.sort = TRUE
)

Value

A list containing:

response: Integer matrix (\(N \times I\)) of simulated observations. Rows are observations (named "O1", "O2", ...), columns are variables named "I1", "I2", ... Values range from 0 to poly.value[i]-1.
par: Array (\(L \times I \times K\)) of true class-specific category probabilities, where \(K = \text{max}(poly.value)\) (i.e., the maximum number of categories across variables). Dimensions: classes x variables x categories. Note: For variables with poly.value[i] < K, unused category dimensions contain NA. Dimension names: "L1", "L2", ... (classes); "I1", "I2", ... (variables); "poly0", "poly1", ... (categories).
Z: Integer vector (length \(N\)) of true class assignments (1 to L). Named with observation IDs (e.g., "O1").
P.Z: Numeric vector (length \(L\)) of true class proportions. Named with class labels (e.g., "L1").
poly.value: Integer vector (length \(I\)) specifying number of categories per variable.
P.Z.Xn: Binary matrix (\(N \times L\)) of true class membership indicators (one-hot encoded). Row i, column l = 1 if observation i belongs to class l, else 0. Row/column names match Z and class labels.
arguments: A list containing all input arguments.

Arguments

N

Integer; total number of observations to simulate. Must be \(> L\). Default: 1000.

I

Integer; number of categorical observed variables. Must be \(\geq 1\). Default: 10.

L

Integer; number of latent classes. Must be \(\geq 2\) when IQ is numeric. Default: 3.

poly.value

Integer or integer vector; number of categories (levels) for each observed variable. If scalar, all variables share the same number of categories. If vector, must have length I. Minimum valid value is 2 when IQ is numeric. Default: 5.

IQ

Character or numeric; controls category probability distributions:

"random": (default) Dirichlet-distributed probabilities (\(\alpha=3\)).

Numeric

in \((0.5, 1)\). Forces high discriminative power (see details in section below).

distribution

Character; distribution of class sizes. Options: "random" (default) or "uniform".

params

List with fixed parameters for simulation:

par: \(L \times I \times K_{\max}\) array of conditional response probabilities per latent class.

P.Z

Vector of length \(L\) with latent class prior probabilities.

Z

Vector of length \(N\) containing the latent classes of observations. A fixed observation classes Z is applied directly to simulate data only when P.Z is NULL and Z is a N length vector.

is.sort

A logical value. If TRUE (Default), the latent classes will be ordered in descending order according to P.Z. All other parameters will be adjusted accordingly based on the reordered latent classes.

Item Quality (IQ) Parameter

Controls the discriminative power of observed variables:

IQ = "random"

(Default) Category probabilities for each variable-class combination are drawn from a symmetric Dirichlet distribution (\(\alpha = 3\)), resulting in moderate class separation.

IQ = numeric

(0.5 < IQ < 1) Forces high discriminative power for each variable:

For each variable, two categories per class are assigned extreme probabilities: one category gets probability \(IQ\), another gets \(1-IQ\).
Remaining categories share the residual probability \(1 - IQ - (1-IQ) = 0\). Note: This requires poly.value >= 2 for all variables.
Category assignments are randomized within classes to avoid structural patterns.

Higher IQ values (closer to 1) yield stronger class separation but increase simulation failure risk.

Class Size Distribution

"random": (Default) Class proportions drawn from Dirichlet distribution (\(\alpha = 3\) for all classes), ensuring no empty classes. Sizes are rounded to integers with adjustment for exact N.

"uniform"

Equal probability of class membership (\(1/L\) per class), sampled with replacement. May produce empty classes if N is small relative to L.

Response Validation

The simulation enforces a critical constraint: every category of every observed variable must appear at least once in the dataset. If initial generation violates this (e.g., a rare category is missing), parameters and responses are regenerated until satisfied. This ensures compatibility with standard LCA estimation.

Details

Probability Generation:

Dirichlet Sampling (IQ="random"): For each variable-class combination, probabilities are drawn from \(\text{Dirichlet}(\alpha_1=3, \dots, \alpha_k=3)\) where \(k = \text{poly.value}[i]\).
High-Discrimination Mode (IQ=numeric): For each variable i:
1. Generate special probabilities par.special of length L containing: \(IQ\), \(1-IQ\), and \(L-2\) values uniformly sampled from \([1-IQ, IQ]\).
2. For each class l, assign par.special[l] to one category, distribute remaining probability \(1 - \text{par.special}[l]\) uniformly (via Dirichlet) across other categories.
3. Shuffle category assignments to avoid position bias.

Data Generation:

Class assignments Z are generated first according to distribution.
For each observation p and variable i:
1. Retrieve cumulative probabilities for class Z[p]
2. Draw uniform random number \(u \sim \text{Uniform}(0,1)\)
3. Assign category k where \(P(\text{category} \leq k-1) < u \leq P(\text{category} \leq k)\)
Entire dataset is regenerated if any category of any variable has zero observations.

Critical Constraints:

When IQ is numeric: \(0.5 < IQ < 1\) and min(poly.value) >= 2
N must be sufficiently large to observe all categories, especially when IQ is high or poly.value is large. Simulation may fail for small N.
For distribution="uniform", empty classes possible when \(N < L\).

Examples

Run this code

# Example 1: Default settings (moderate separation, random class sizes)
sim_data <- sim.LCA(N = 30, I = 5, L = 3)

# Example 2: High-discrimination items (IQ=0.85), uniform class sizes
sim_high_disc <- sim.LCA(
  N = 30,
  I = 4,
  L = 2,
  poly.value = c(3,4,3,5),  # Variable category counts
  IQ = 0.85,
  distribution = "uniform"
)

# Example 3: Binary items (poly.value=2) with high separation
sim_binary <- sim.LCA(N = 300, I = 10, L = 2, poly.value = 2, IQ = 0.9)

Run the code above in your browser using DataLab