Last chance! 50% off unlimited learning
Sale ends in
This function extends the techniques of Headrick and Beasley (2004, 10.1081/SAC-120028431) to
create correlated systems of statistical equations containing continuous variables with normal,
non-normal, or mixture distributions. The method allows the user to control the distributions
for the stochastic disturbance (error) terms calc_betas
. If the system contains mixture variables and corr.yx
is specified in terms of non-mixture
and mixture variables, the betas
will be calculated in terms of non-mixture and mixture independent variables. If corr.yx
Finally, the user specifies the correlations across error terms. The assumptions are that
1) there are at least 2 equations and a total of at least 1 independent variable, 2) the independent variables are uncorrelated
with the error terms, 3) each equation has an error term,
and 4) all error terms have either a non-mixture or mixture distribution. The outcomes corrsys
or corrsys2
). However, the user does not have to provide the beta coefficients
(except for the intercepts). Since the equations do not include random slopes (i.e. for the checkpar
. Summaries of the simulation results can be found with summary_sys
.
The functions calc_corr_y
, calc_corr_yx
, and calc_corr_ye
use equations
from Headrick and Beasley (2004) to calculate the expected correlations for the outcomes, among a given outcome and covariates from
the other outcomes, and for the error terms. The vignette Theory and Equations for Correlated Systems of Continuous Variables
gives the equations, and the vignette Correlated Systems of Statistical Equations with Non-Mixture and Mixture Continuous
Variables gives examples. There are also vignettes in SimCorrMix
which provide more details on continuous
non-mixture and mixture variables.
nonnormsys(n = 10000, M = NULL, method = c("Fleishman", "Polynomial"),
error_type = c("non_mix", "mix"), means = list(), vars = list(),
skews = list(), skurts = list(), fifths = list(), sixths = list(),
Six = list(), mix_pis = list(), mix_mus = list(), mix_sigmas = list(),
mix_skews = list(), mix_skurts = list(), mix_fifths = list(),
mix_sixths = list(), mix_Six = list(), same.var = NULL,
betas.0 = NULL, corr.x = list(), corr.yx = list(), corr.e = NULL,
seed = 1234, use.nearPD = TRUE, eigmin = 0, adjgrad = FALSE,
B1 = NULL, tau = 0.5, tol = 0.1, steps = 100, msteps = 10,
errorloop = FALSE, epsilon = 0.001, maxit = 1000, quiet = FALSE)
the sample size (i.e. the length of each simulated variable; default = 10000)
the number of dependent variables
the PMT method used to generate all continuous variables, including independent variables (covariates) and error terms; "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation
"non_mix" if all error terms have continuous non-mixture distributions, "mix" if all error terms have continuous mixture distributions
a list of length M
of vectors of means for the non-mixture (corr.x
(considering the components of mixture variables)
a list of length M
of vectors of variances for means
a list of length M
of vectors of skew values for error_type = "non_mix"
);
same order as in corr.x
and means
a list of length M
of vectors of standardized kurtoses (kurtosis - 3) for error_type = "non_mix"
);
same order and dimension as skews
a list of length M
of vectors of standardized fifth cumulants for error_type = "non_mix"
);
same order and dimension as skews
; not necessary for method = "Fleishman"
a list of length M
of vectors of standardized sixth cumulants for error_type = "non_mix"
);
same order and dimension as skews
; not necessary for method = "Fleishman"
a list of length M
, where Six[[p]][[j]]
is a vector of sixth cumulant correction values to aid in finding a valid PDF for
Six[[p]]
is for error_type = "non_mix"
); use Six[[p]][[j]] = NULL
if no correction desired for Six[[p]] = NULL
if no correction desired for any non-mixture covariate or error term in equation p;
keep Six = list()
if no corrections desired for all covariates or if method = "Fleishman"
a list of length M
, where mix_pis[[p]][[j]]
is a vector of mixing probabilities that sum to 1 for mix_pis[[p]]
is for error_type = "mix"
);
if mix_pis[[p]] = NULL
; components should be ordered as in corr.x
a list of length M
, where mix_mus[[p]][[j]]
is a vector of means of the component distributions for
mix_mus[[p]]
is for error_type = "mix"
);
if mix_mus[[p]] = NULL
a list of length M
, where mix_sigmas[[p]][[j]]
is a vector of standard deviations of the component distributions for
mix_sigmas[[p]]
is for error_type = "mix"
);
if mix_sigmas[[p]] = NULL
a list of length M
, where mix_skews[[p]][[j]]
is a vector of skew values of the component distributions for
mix_skews[[p]]
is for error_type = "mix"
);
if mix_skews[[p]] = NULL
a list of length M
, where mix_skurts[[p]][[j]]
is a vector of standardized kurtoses of the component distributions for
mix_skurts[[p]]
is for error_type = "mix"
);
if mix_skurts[[p]] = NULL
a list of length M
, where mix_fifths[[p]][[j]]
is a vector of standardized fifth cumulants of the component distributions for
mix_fifths[[p]]
is for error_type = "mix"
);
if mix_fifths[[p]] = NULL
; not necessary for method = "Fleishman"
a list of length M
, where mix_sixths[[p]][[j]]
is a vector of standardized sixth cumulants of the component distributions for
mix_sixths[[p]]
is for error_type = "mix"
);
if mix_sixths[[p]] = NULL
; not necessary for method = "Fleishman"
a list of length M
, where mix_Six[[p]]
is a list of length equal to the total number of component distributions for
the error_type = "mix"
);
mix_Six[[p]][[j]]
is a vector of sixth cumulant corrections for the j-th component distribution (i.e., if there are 2
continuous mixture independent variables for length(mix_Six[[p]]) = 5
and mix_Six[[p]][[3]]
would correspond to the 1st component of
mix_Six[[p]][[j]] = NULL
if no correction desired for that component;
use mix_Six[[p]] = NULL
if no correction desired for any component of mix_Six = list()
if no corrections desired for all covariates or if method = "Fleishman"
either a vector or a matrix; if a vector, same.var
includes column numbers of corr.x[[1]][[1]]
corresponding to independent variables that should be identical across equations; these terms must have the same indices for all p = 1, ..., M
;
i.e., if the 1st variable represents height which should be the same for each equation, then
same.var[1] = 1
and the 1st term for all other outcomes must also be height;
if a matrix, columns 1 and 2 are outcome p and column index in corr.x[[p]][[p]]
for 1st instance of variable,
columns 3 and 4 are outcome q and column index in corr.x[[q]][[q]]
for subsequent instances of variable; i.e., if
1st term for all outcomes is height and M = 3
, then same.var = matrix(c(1,
1, 2, 1, 1, 1, 3, 1), 2, 4, byrow = TRUE)
; the
independent variable index corresponds to continuous non-mixture and component of continuous mixture covariate
vector of length M
containing intercepts, if NULL
all set equal to 0; if length 1, all intercepts set to betas.0
list of length M
, each component a list of length M
; corr.x[[p]][[q]]
is matrix of correlations
for independent variables in equations p (skews
) and 2nd components of continuous mixture (same order as in mix_pis
);
if p = q, corr.x[[p]][[q]]
is a correlation matrix with nrow(corr.x[[p]][[q]])
= # of non-mixture + # of mixture components for outcome corr.x[[p]][[q]]
is a non-symmetric matrix of correlations where rows correspond to covariates for nrow(corr.x[[p]][[q]])
= # of non-mixture + # of mixture components for outcome ncol(corr.x[[p]][[q]])
= # of non-mixture + # of mixture components for outcome corr.x[[p]][[q]] = NULL
if equation q has no corr.x[[p]] = NULL
if equation p has no
a list of length M
, where the p-th component is a 1 row matrix of correlations between betas
are desired in terms of these (and not the components), then corr.yx
should be specified in terms of correlations between outcomes and non-mixture or mixture variables, and the number of columns of the matrices
of corr.yx
should not match the dimensions of the matrices in corr.x
; if the betas
are desired in terms of
the components, then corr.yx
should be specified in terms of correlations between outcomes and non-mixture or components of
mixture variables, and the number of columns of the matrices of corr.yx
should match the dimensions of the matrices in corr.x
;
use corr.yx[[p]] = NULL
if equation p has no
correlation matrix for continuous non-mixture or components of mixture error terms
the seed value for random number generation (default = 1234)
TRUE to convert the overall intermediate correlation matrix formed by the Matrix::nearPD
if necessary;
if FALSE and adjgrad = FALSE
the negative eigenvalues are replaced with eigmin
if necessary
minimum replacement eigenvalue if overall intermediate correlation matrix is not positive-definite (default = 0)
TRUE to use adj_grad
to convert overall intermediate correlation matrix to a positive-definite matrix and next
5 inputs can be used
the initial matrix for algorithm; if NULL, uses a scaled initial matrix with diagonal elements sqrt(nrow(Sigma))/2
parameter used to calculate theta (default = 0.5)
maximum error for Frobenius norm distance between new matrix and original matrix (default = 0.1)
maximum number of steps for k (default = 100)
maximum number of steps for m (default = 10)
if TRUE, uses corr_error
to attempt to correct the correlation of the independent
variables within and across outcomes to be within epsilon
of the target correlations corr.x
until the number of iterations
reaches maxit
(default = FALSE)
the maximum acceptable error between the final and target correlation matrices (default = 0.001) in the error loop
the maximum number of iterations to use (default = 1000) in the error loop
if FALSE prints messages, if TRUE suppresses messages
A list with the following components:
Y
matrix with n
rows and M
columns of outcomes
X
list of length M
containing
X_all
list of length M
containing
E
matrix with n
rows containing continuous non-mixture or components of continuous mixture error terms
E_mix
matrix with n
rows containing continuous mixture error terms
betas
a matrix of the slope coefficients calculated with calc_betas
, rows represent the
outcomes
constants
a list of length M
with data.frames of the constants for the
SixCorr
a list of length M
of lists of sixth cumulant correction values used to obtain valid
pdf's for the
valid.pdf
a list of length M
of vectors where the i-th element is "TRUE" if the constants for the i-th
continuous variable generate a valid pdf, else "FALSE"
Sigma_X0
matrix of intermediate correlations calculated by intercorr
Sigma_X
matrix of intermediate correlations after nearPD
or adj_grad
function has been used;
applied to generate the normal variables transformed to get the desired distributions
Error_Time
the time in minutes required to use the error loop
Time
the total simulation time in minutes
niter
a matrix of the number of iterations used in the error loop
Mixture distributions describe random variables that are drawn from more than one component distribution. For a random variable
The
All continuous variables are simulated using either Fleishman's third-order (method
= "Fleishman",
10.1007/BF02293811) or Headrick's fifth-order (method
= "Polynomial", 10.1016/S0167-9473(02)00072-5)
power method transformation (PMT). It works by matching standardized cumulants -- the first four (mean,
variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean, variance,
skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method. The transformation is expressed
as follows:
where find_constants
for non-mixture and components of mixture variables. Continuous mixture
variables are generated componentwise and then transformed to the desired mixture variables using random multinomial variables
generated based on mixing probabilities. The correlation matrices are specified in terms of correlations with components of the
mixture variables.
Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
accuracy. In addition, the range of feasible standardized kurtosis values, given skew and standardized fifth (calc_lower_skurt
).
For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of
1) The most likely cause for function errors is that the parameter inputs are mispecified. Using checkpar
prior to simulation can help decrease these errors.
2) No solutions to fleish
or
poly
converged when using find_constants
. If this happens,
the simulation will stop. It may help to first use find_constants
for each continuous variable to
determine if a sixth cumulant correction value is needed. If the standardized cumulants are obtained from calc_theory
, the user
may need to use rounded values as inputs (i.e. skews = round(skews, 8)
). For example, in order to ensure that skew is exactly 0 for symmetric distributions.
3) The kurtosis for a continuous variable may be outside the region of possible values. There is an associated lower kurtosis boundary for
associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use
calc_lower_skurt
to determine the boundary for a given set of cumulants.
4) No solutions to calc_betas
converged when trying to find the beta coefficients. Try different correlation
matrices.
Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions. Computers & Mathematics with Applications, 15(10):819-28.
Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107-127. 10.1177/096228029600500202.
Fialkowski AC (2017). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.1. https://CRAN.R-project.org/package=SimMultiCorrData.
Fialkowski AC (2018). SimCorrMix: Simulation of Correlated Data of Multiple Variable Types including Continuous and Count Mixture Distributions. R package version 0.1.0. https://github.com/AFialkowski/SimCorrMix
Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43:521-532. 10.1007/BF02293811.
Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. 10.1016/S0167-9473(02)00072-5. (ScienceDirect)
Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions. Journal of Modern Applied Statistical Methods, 3(1):65-71. 10.22237/jmasm/1083370080.
Headrick TC, Beasley TM (2004). A Method for Simulating Correlated Non-Normal Systems of Linear Statistical Equations. Communications in Statistics - Simulation and Computation, 33(1). 10.1081/SAC-120028431
Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229-249. 10.1080/10629360600605065.
Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power Method. Psychometrika, 64:25-35. 10.1007/BF02294317.
Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using Mathematica. Journal of Statistical Software, 19(3):1 - 17. 10.18637/jss.v019.i03.
Higham N (2002). Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22:329-343.
McCulloch CE, Searle SR, Neuhaus JM (2008). Generalized, Linear, and Mixed Models (2nd ed.). Wiley Series in Probability and Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc.
Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.
Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research, 5:155-178. 10.1177/096228029600500204.
Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465-471. 10.1007/BF02293687.
calc_betas
, calc_corr_y
, calc_corr_yx
,
calc_corr_ye
, checkpar
, summary_sys
# NOT RUN {
M <- 3
B <- calc_theory("Beta", c(4, 1.5))
skews <- lapply(seq_len(M), function(x) c(0, B[3]))
skurts <- lapply(seq_len(M), function(x) c(0, B[4]))
fifths <- lapply(seq_len(M), function(x) c(0, B[5]))
sixths <- lapply(seq_len(M), function(x) c(0, B[6]))
Six <- lapply(seq_len(M), function(x) list(NULL, 0.03))
corr.e <- matrix(c(1, 0.4, 0.4^2, 0.4, 1, 0.4, 0.4^2, 0.4, 1), M, M,
byrow = TRUE)
means <- lapply(seq_len(M), function(x) c(0, B[1]))
vars <- lapply(seq_len(M), function(x) c(1, B[2]^2))
corr.x <- list(list(matrix(1, 1, 1), matrix(0.4, 1, 1), matrix(0.4, 1, 1)),
list(matrix(0.4, 1, 1), matrix(1, 1, 1), matrix(0.4, 1, 1)),
list(matrix(0.4, 1, 1), matrix(0.4, 1, 1), matrix(1, 1, 1)))
corr.yx <- list(matrix(0.4, 1), matrix(0.5, 1), matrix(0.6, 1))
Sys1 <- nonnormsys(10000, M, "Polynomial", "non_mix", means, vars,
skews, skurts, fifths, sixths, Six, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e)
# }
# NOT RUN {
# Example: system of three equations for 2 independent variables, where each
# error term has unit variance, from Headrick & Beasley (2002)
# Y_1 = beta_10 + beta_11 * X_11 + beta_12 * X_12 + sigma_1 * e_1
# Y_2 = beta_20 + beta_21 * X_21 + beta_22 * X_22 + sigma_2 * e_2
# Y_3 = beta_30 + beta_31 * X_31 + beta_32 * X_32 + sigma_3 * e_3
# X_11 = X_21 = X_31 = Exponential(2)
# X_12 = X_22 = X_32 = Laplace(0, 1)
# e_1 = e_2 = e_3 = Cauchy(0, 1)
seed <- 1234
M <- 3
Stcum1 <- calc_theory("Exponential", 2)
Stcum2 <- calc_theory("Laplace", c(0, 1))
Stcum3 <- c(0, 1, 0, 25, 0, 1500) # taken from paper
means <- lapply(seq_len(M), function(x) c(0, 0, 0))
vars <- lapply(seq_len(M), function(x) c(1, 1, 1))
skews <- lapply(seq_len(M), function(x) c(Stcum1[3], Stcum2[3], Stcum3[3]))
skurts <- lapply(seq_len(M), function(x) c(Stcum1[4], Stcum2[4], Stcum3[4]))
fifths <- lapply(seq_len(M), function(x) c(Stcum1[5], Stcum2[5], Stcum3[5]))
sixths <- lapply(seq_len(M), function(x) c(Stcum1[6], Stcum2[6], Stcum3[6]))
# No sixth cumulant corrections will be used in order to match the results
# from the paper.
corr.yx <- list(matrix(c(0.4, 0.4), 1), matrix(c(0.5, 0.5), 1),
matrix(c(0.6, 0.6), 1))
corr.x <- list()
corr.x[[1]] <- corr.x[[2]] <- corr.x[[3]] <- list()
corr.x[[1]][[1]] <- matrix(c(1, 0.1, 0.1, 1), 2, 2)
corr.x[[1]][[2]] <- matrix(c(0.1974318, 0.1859656, 0.1879483, 0.1858601),
2, 2, byrow = TRUE)
corr.x[[1]][[3]] <- matrix(c(0.2873190, 0.2589830, 0.2682057, 0.2589542),
2, 2, byrow = TRUE)
corr.x[[2]][[1]] <- t(corr.x[[1]][[2]])
corr.x[[2]][[2]] <- matrix(c(1, 0.35, 0.35, 1), 2, 2)
corr.x[[2]][[3]] <- matrix(c(0.5723303, 0.4883054, 0.5004441, 0.4841808),
2, 2, byrow = TRUE)
corr.x[[3]][[1]] <- t(corr.x[[1]][[3]])
corr.x[[3]][[2]] <- t(corr.x[[2]][[3]])
corr.x[[3]][[3]] <- matrix(c(1, 0.7, 0.7, 1), 2, 2)
corr.e <- matrix(0.4, nrow = 3, ncol = 3)
diag(corr.e) <- 1
# Check the parameter inputs
checkpar(M, "Polynomial", "non_mix", means, vars, skews,
skurts, fifths, sixths, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e)
# Generate the system
Sys1 <- nonnormsys(10000, M, "Polynomial", "non_mix", means, vars, skews,
skurts, fifths, sixths, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e, seed = seed)
# Summarize the results
Sum1 <- summary_sys(Sys1$Y, Sys1$E, E_mix = NULL, Sys1$X, X_all = list(), M,
"Polynomial", means, vars, skews, skurts, fifths, sixths, corr.x = corr.x,
corr.e = corr.e)
# Calculate theoretical correlations for comparison to simulated values
calc_corr_y(Sys1$betas, corr.x, corr.e, vars)
Sum1$rho.y
calc_corr_ye(Sys1$betas, corr.x, corr.e, vars)
Sum1$rho.ye
calc_corr_yx(Sys1$betas, corr.x, vars)
Sum1$rho.yx
# }
Run the code above in your browser using DataLab