Stores information necessary to simulate datasets based on the NORTA procedure (Cario and Nelson 1997).
simdesign_norta(
cor_target_final = NULL,
cor_initial = NULL,
dist = list(),
tol_initial = 0.001,
n_obs_initial = 10000,
seed_initial = 1,
conv_norm_type = "O",
method = "svd",
name = "NORTA based simulation design",
...
)
List object with class attribute "simdesign_norta" (S3 class), inheriting
from "simdesign". It contains the same entries as a simdesign
object but in addition the following entries:
cor_target_final
cor_initial
Initial correlation matrix of multivariate normal distribution
dist
tol_initial
n_obs_initial
conv_norm_type
method
Target correlation matrix for simulated datasets. At least one of
cor_target_final
or cor_initial
must be specified.
Correlation matrix for underlying multivariate standard normal distribution
on which the final data is based on. At least one of cor_target_final
or
cor_initial
must be specified. If NULL, then cor_initial
will be
numerically optimized by simulation for the NORTA procedure using
cor_target_final
.
List of functions of marginal distributions for simulated variables.
Must have the same length as the specified correlation matrix
(cor_target_final
and / or cor_inital
), and the order of the entries
must correspond to the variables in the correlation matrix. See details for
the specification of the marginal distributions.
If cor_initial
is numerically optimized, specifies the tolerance for the
difference to the target correlation cor_target_final
. Parameter passed to
optimize_cor_for_pair
.
If cor_initial
is numerically optimized, specifies the number of draws in
simulation during optimization used to estimate correlations.
Parameter passed to optimize_cor_for_pair
.
Seed used for draws of the initial distribution used during optimization to estimate correlations.
If cor_initial
is numerically optimized and found not to be a proper
correlation matrix (i.e. not positive-definite), specifies the metric used to
find the nearest positive-definite correlation matrix.
Parameter passed to Matrix::nearPD
(conv.norm.type), see there for details.
method
argument of mvtnorm::rmvnorm
.
Character, optional name of the simulation design.
Further arguments are passed to the simdesign
constructor.
Data will be generated using the following procedure:
An underlying data matrix Z
is sampled from a
multivariate standard Normal distribution with correlation structure given by
cor_initial
.
Z
is then transformed into a dataset X
by applying
the functions given in dist
to the columns of Z
. The resulting dataset
X
will then have the desired marginal distributions, and approximate the
target correlation cor_target_final
, if specified.
X
is further transformed by the transformation transform_initial
(note that this may affect the correlation of the final dataset and is not
respected by the optimization procedure), and post-processed if specified.
A list of functions dist
is used to define the marginal distributions of
the variables. Each entry must be a quantile function, i.e. a function
that maps [0, 1]
to the domain of a probability distribution. Each entry
must take a single input vector, and return a single numeric vector.
Examples for acceptable entries include all standard quantile functions
implemented in R (e.g. qnorm
, qbinom
, ...), user defined functions
wrapping these (e.g. function(x) = qnorm(x, mean = 10, sd = 4)
), or
empirical quantile functions. The helper function
quantile_functions_from_data can be used to automatically
estimate empirical quantile functions from a given data to reproduce it using
the NORTA approach.See the example in the NORTA vignette of this package for
workflow details.
Not every valid correlation matrix (i.e. symmetric, positive-definite matrix
with elements in [-1, 1]
and unity diagonal) for a number of variables
is feasible for given desired marginal distributions (see e.g.
Ghosh and Henderson 2003). Therefore, if cor_target_final
is specified
as target correlation, this class optimises cor_initial
in such a
way, that the final simulated dataset has a correlation which approximates
cor_target_final
. However, the actual correlation in the end may differ
if cor_target_final
is infeasible for the given specification, or the
NORTA procedure cannot exactly reproduce the target correlation. In general,
however, approximations should be acceptable if target correlations and
marginal structures are derived from real datasets.
See e.g. Ghosh and Henderson 2003 for the motivation why this works.
This S3 class implements a simulation design based on the NORmal-To-Anything (NORTA) procedure by Cario and Nelson (1997). See the corresponding NORTA vignette for usage examples how to approximate real datasets.
Cario, M. C. and Nelson, B. L. (1997) Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois.
Ghosh, S. and Henderson, S. G. (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation.
simdesign
,
simulate_data
,
simulate_data_conditional
,
quantile_functions_from_data