simul.dedic.facmod: Generate synthetic data from a dedicated factor model

Description

This function simulates data from a dedicated factor model. The parameters of the model are either passed by the user or simulated by the function.

Usage

simul.dedic.facmod(N, dedic, alpha, sigma, R, R.corr = TRUE,
  max.corr = 0.85, R.max.trial = 1000)

Value

The function returns a data frame with N observations simulated from the corresponding dedicated factor model. The parameters used to generate the data are saved as attributes: dedic, alpha, sigma and R.

Arguments

N: Number of observations in data set.
dedic: Vector of indicators. The number of manifest variables is equal to the length of this vector, and the number of factors is equal to the number of unique nonzero elements. Each integer element indicates on which latent factor the corresponding variable loads uniquely.
alpha: Vector of factor loadings, should be of same length as dedic. If missing, values are simulated (see details below).
sigma: Idiosyncratic variances, should be of same length as dedic. If missing, values are simulated (see details below).
R: Covariance matrix of the latent factors. If missing, values are simulated (see details below).
R.corr: If TRUE, covariance matrix R is rescaled to be a correlation matrix.
max.corr: Maximum correlation allowed between the latent factors.
R.max.trial: Maximum number of trials allowed to sample from the truncated distribution of the covariance matrix of the latent factors (accept/reject sampling scheme, to make sure max.corr is not exceeded).

Author

Rémi Piatek remi.piatek@gmail.com

Details

The function simulates data from the following dedicated factor model, for $i = 1, ..., N$: $$Y_i = \alpha \theta_i + \epsilon_i$$ $$\theta_i \sim \mathcal{N}(0, R)$$ $$\epsilon_i \sim \mathcal{N}(0, \Sigma)$$ where the $K$-vector $\theta_i$ contains the latent factors, and $\alpha$ is the $(M \times K)$-matrix of factor loadings. Each row $m$ of $\alpha$ contains only zeros, besides its element indicated by the $m$th element of dedic that is equal to the $m$th element of alpha (denoted $\alpha_m^\Delta$ below). The $M$-vector $\epsilon_i$ is the vector of error terms, with $\Sigma = diag($sigma$)$. $M$ is equal to the length of the vector dedic, and $K$ is equal to the maximum value of this vector.

Only N and dedic are required, all the other parameters can be missing, completely or partially. Missing values (NA) are independently sampled from the following distributions, for each manifest variable $m = 1, ..., M$:

Factor loadings: $$\alpha_m^\Delta = (-1)^{\phi_m}\sqrt{a_m}$$ $$\phi_m \sim \mathcal{B}er(0.5)$$ $$a_m \sim \mathcal{U}nif (0.04, 0.64)$$

Idiosyncratic variances: $$\sigma^2_m \sim \mathcal{U}nif (0.2, 0.8)$$

For the variables that do not load on any factors (i.e., for which the corresponding elements of dedic are equal to 0), it is specified that $\alpha_m^\Delta = 0$ and $\sigma^2_m = 1$.

Covariance matrix of the latent factors: $$\Omega \sim \mathcal{I}nv-\mathcal{W}ishart(K+5, I_K)$$ which is rescaled to be a correlation matrix if R.corr = TRUE: $$R = \Lambda^{-1/2} \Omega \Lambda^{-1/2}$$ $$\Lambda = diag(\Omega)$$

Note that the distribution of the covariance matrix is truncated such that all the off-diagonal elements of the implied correlation matrix $R$ are below max.corr in absolute value. The truncation is also applied if the covariance matrix is used instead of the correlation matrix (i.e., if R.corr = FALSE).

The distributions and the corresponding default values used to simulate the model parameters are specified as in the Monte Carlo study of CFSHP, see section 4.1 (p.43).

References

G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): ``Bayesian Exploratory Factor Analysis'', Journal of Econometrics, 183(1), pages 31-57, tools:::Rd_expr_doi("10.1016/j.jeconom.2014.06.008").

Examples

Run this code

# generate 1000 observations from model with 4 factors and 20 variables
# (5 variables loading on each factor)
dat <- simul.dedic.facmod(N = 1000, dedic = rep(1:4, each = 5))

# generate data set with 5000 observations from the following model:
dedic <- rep(1:3, each = 4)        # 3 factors and 12 manifest variables
alpha <- rep(c(1, NA, NA, NA), 3)  # set first loading to 1 for each factor,
                                   #   sample remaining loadings from default
sigma <- rep(0.5, 12)              # idiosyncratic variances all set to 0.5
R <- toeplitz(c(1, .6, .3))        # Toeplitz matrix
dat <- simul.dedic.facmod(N = 5000, dedic, alpha, sigma, R)

Run the code above in your browser using DataLab