gendata_simu_multi: Generate Simulated Multi-Study Factor Analysis Data

Description

Generate simulated data for multi-study factor analysis under different error distributions. The data follows a factor model with common factors (shared across studies) and study-specific factors (unique to each study), plus noise.

Usage

gendata_simu_multi(
  seed = 1,
  nvec = c(100, 300),
  p = 50,
  q = 3,
  qs = rep(2, length(nvec)),
  err.type = c("gaussian", "mvt", "exp", "t", "mixnorm", "pareto"),
  rho = c(1, 1),
  sigma2_eps = 0.1,
  nu = 1
)

Value

A list containing the simulated data and true parameter values (for model evaluation):

Xlist: List of matrices. Each element is a data matrix (ns × p) for study s, where ns = `nvec[s]` (sample size of study s), p = number of variables.
mu0: Matrix (p × S). True mean vector for each variable (row) in each study (column), where S = `length(nvec)` (number of studies).
A0: Matrix (p × q). True common factor loadings (shared across all studies) — constructed as the first q columns of an orthogonal matrix (`A1`) generated internally. This is the "ground truth" that modeling functions (e.g., MultiRFM) aim to estimate.
Blist0: List of matrices. Each element is a true study-specific factor loadings matrix (p × qs[s]) for study s. Constructed from orthogonal matrices (similar to `A0`) and scaled by `rho[2]`. Another "ground truth" for model evaluation.
Flist: List of matrices. Each element is a true common factor score matrix (ns × q) for study s, generated from a standard normal distribution. These are the latent common factor values used to generate `Xlist`.
Hlist: List of matrices. Each element is a true study-specific factor score matrix (ns × qs[s]) for study s, generated from a standard normal distribution. Latent specific factor values used to generate `Xlist`.
q: Integer. Number of common factors used for data generation (same as input `q`, for reference).
qs: Numeric vector. Number of study-specific factors used for data generation (same as input `qs`, for reference).

Arguments

seed

Integer, default = 1. Random seed for reproducibility of simulated data.

nvec

Numeric vector (length >= 2). Sample sizes of each study (e.g., `c(150, 200)` for 2 studies with 150 and 200 samples).

p

Integer, default = 50. Number of variables (features) in the data.

q

Integer, default = 3. Number of common factors (shared across all studies).

qs

Numeric vector with length equal to `length(nvec)`, default = `rep(2, length(nvec))`. Number of study-specific factors for each study (e.g., `c(2,2)` for 2 studies each with 2 specific factors).

err.type

Character, default = "gaussian". Error distribution type, one of: - "gaussian": Gaussian (normal) distribution;

- "mvt": Multivariate t-distribution;

- "exp": Exponential distribution (centered to mean 0);

- "t": Univariate t-distribution (independent across variables);

- "mixnorm": Mixture of two normal distributions;

- "pareto": Pareto distribution (centered to mean 0).

rho

Numeric vector of length 2, default = `c(1,1)`. Scaling factors for: - `rho1`: Common factor loadings (matrix `A0`); - `rho2`: Study-specific factor loadings (matrix list `Blist0`).

sigma2_eps

Numeric, default = 0.1. Variance of the error term (controls noise level).

nu

Integer, default = 1. Degrees of freedom for t-distribution ("mvt" or "t" `err.type`). Ignored for other error distributions.

Author

Wei Liu

Details

The simulated data follows the multi-study factor model:

Xs = mu0s + Fs x A0 + Hs x B0s + epsilons

True parameters (`A0`, `Blist0`, `mu0`) are generated with orthogonal constraints to ensure identifiability.

Examples

Run this code

# Example 1: Gaussian error (2 studies, 100/200 samples, 50 variables)
set.seed(123)
sim_data <- gendata_simu_multi(
  seed = 123,
  nvec = c(100, 200),
  p = 50,
  q = 3,          # 3 common factors
  qs = c(2, 2),   # 2 specific factors per study
  err.type = "gaussian",
  rho = c(1, 1),
  sigma2_eps = 0.1
)
str(sim_data)  # Check structure of simulated data

# Extract true parameters for model evaluation
true_A <- sim_data$A0        # True common loadings
true_B1 <- sim_data$Blist0[[1]]  # True specific loadings (study 1)

Run the code above in your browser using DataLab