Creates a data frame of discrete and continuous variables based on several arguments.
questionnaire_gen(
n_obs,
cat_prop = NULL,
n_vars = NULL,
n_X = NULL,
n_W = NULL,
cor_matrix = NULL,
cov_matrix = NULL,
c_mean = NULL,
c_sd = NULL,
theta = FALSE,
family = NULL,
full_output = FALSE,
verbose = TRUE
)By default, the function returns a data.frame object where the
first column ("subject") is a \(1,\ldots,n\) ordered list of the \(n\)
observations and the other columns correspond to the questionnaire answers.
If theta = TRUE, the first column after "subject" will be the latent
variable \(\theta\); in any case, the continuous variables always come
before the categorical ones.
If full_output = TRUE, the output will be a list containing the
following objects:
a data frame containing the background questionnaire answers (i.e., the same object as described above).
identical to the input argument of the same name. Read the Details section for more information.
identical to the input argument of the same name. Read the Details section for more information.
identical to the input argument of the same name. Read the Details section for more information.
a list containing the probabilities for each category
of the categorical variables (cat_prop_W contains the cumulative
probabilities).
identical to the input argument of the same name. Read the Details section for more information.
identical to the input argument of the same name. Read the Details section for more information.
identical to the input argument of the same name.
identical to the input argument of the same name.
named vector containing the number of total variables, the number of continuous background variables (i.e., the total number of background variables except \(\theta\)) and the number of categorical variables.
vector containing the number of categorical variables.
vector containing the number of continuous variables (except \(\theta\)).
vector with the standard deviations of all the variables
vector containing the standard deviations of \(\theta\), the background continuous variables (\(X\)) and the Normally-distributed variables \(Z\) which will generate the background categorical variables (\(W\)).
identical to the input argument of the same name.
list containing the variances of the categorical variables.
list containing the variances of the continuous variables (including \(\theta\))
This list is printed only if `theta = TRUE`, `family = "gaussian"` and `full_output = TRUE`. It contains one vector named `betas` and one tabled named `cov_YXW`. The former displays the true linear regression coefficients of \(theta\) on the background questionnaire answers; the latter contains the covariance matrix between all these variables.
number of observations to generate.
list of cumulative proportions for each item. If theta
= TRUE, the first element of cat_prop must be a scalar 1, which
corresponds to the theta.
total number of variables in the questionnaire, including the continuous and the discrete covariates (\(X\) and \(W\), respectively), as well as the latent trait (\(Y\), which is equivalent to \(\theta\)).
number of continuous background variables. If not provided, a random number of continuous variables will be generated.
either a scalar corresponding to the number of categorical background variables or a list of scalars representing the number of categories for each categorical variable. If not provided, a random number of categorical variables will be generated.
latent correlation matrix. The first row/column corresponds
to the latent trait (\(Y\)). The other rows/columns correspond to the
continuous (\(X\) or \(Z\)) or the discrete (\(W\)) background
variables, in the same order as cat_prop.
latent covariance matrix, formatted as cor_matrix.
is a vector of population means for each continuous variable (\(Y\) and \(X\)). Defaults to 0.
is a vector of population standard deviations for each continuous variable (\(Y\) and \(X\)). Defaults to 1.
if TRUE, the first continuous variable will be labeled
'theta'. Otherwise, it will be labeled 'q1'.
distribution of the background variables. Can be NULL (default) or 'gaussian'.
if TRUE, output will be a list containing the
questionnaire data as well as several objects that might be of interest for
further analysis of the data.
if `FALSE`, output messages will be suppressed (useful for simulations). Defaults to `TRUE`
In essence, this function begins by checking the validity of the
arguments provided and randomly generating those that are not. Then, it
will call one of two internal functions,
questionnaire_gen_polychoric or questionnaire_gen_family. The
former corresponds to the exact functionality of questionnaire_gen on
lsasim 1.0.1, where the polychoric correlations are used to generate the
background questionnaire data. If family != NULL, however,
questionnaire_gen_family is called to generate data based on a joint
probability distribution. Additionally, if full_output == TRUE, the
external function beta_gen is called to generate the correlation
coefficients based on the true covariance matrix. The latter argument also
changes the class of the output of this function.
What follows are some notes on the input parameters.
cat_prop is a list where length(cat_prop) is the number of
items to be generated. Each element of the list is a vector containing the
marginal cumulative proportions for each category, summing to 1. For
continuous items, the associated element in the list should be 1.
cor_matrix and cov_matrix are the correlation and covariance
matrices that are the same size as length(cat_prop). The
correlations related to the correlation between variables on the latent
scale.
c_mean and c_sd are each vectors whose length is equal to the number
of continuous variables as specified by cat_prop. The default is to
keep the continuous variables with mean zero and standard deviation of one.
theta is a logical indicator that determines if the first continuous
item should be labeled theta. If theta == TRUE but there are
no continuous variables generated, a random number of background variables
will be generated.
If cat_prop is a named list, those names will be used as variable
names for the returned data.frame. Generic names will be provided
to the variables if cat_prop is not named.
As an alternative to providing cat_prop, the user can call this
function by specifying the total number of variables using n_vars or
the specific number of continuous and categorical variables through
n_X and n_W. All three arguments should be provided as
scalars; n_W may also be provided as a list, where each element
contains the number of categories for one background variable.
Alternatively, n_W may be provided as a one-element list, in which
case it will be interpreted as all the categorical variables having the
same number of categories.
If family == "gaussian", the questionnaire will be generated
assuming that all the variables are jointly-distributed as a multivariate
normal. The default behavior is family == NULL, where the data is
generated using the polychoric correlation matrix, with no distributional
assumptions.
When data is generated using the Gaussian distribution, the matrices
provided correspond to the relations between the latent variable
\(\theta\), the continuous covariates \(X\) and the continuous
covariates---\(Z ~ N(0, 1)\)---that will later be discretized into
categorical covariates \(W\). That is why there will be a difference
between labels and lengths between cov_matrix and vcov_YXW.
For more information, check the references cited later in this document.
Matta, T. H., Rutkowski, L., Rutkowski, D., & Liaw, Y. L. (2018). lsasim: an R package for simulating large-scale assessment data. Large-scale Assessments in Education, 6(1), 15.
beta_gen
# Using polychoric correlations
props <- list(c(1), c(.25, .6, 1)) # one continuous, one with 3 categories
questionnaire_gen(n_obs = 10, cat_prop = props,
cor_matrix = matrix(c(1, .6, .6, 1), nrow = 2),
c_mean = 2, c_sd = 1.5, theta = TRUE)
# Using the multinomial distribution
# two categorical variables W: one has 2 categories, the other has 3
props <- list(1, c(.25, 1), c(.2, .8, 1))
yw_cov <- matrix(c(1, .5, .5, .5, 1, .8, .5, .8, 1), nrow = 3)
questionnaire_gen(n_obs = 10, cat_prop = props, cov_matrix = yw_cov,
family = "gaussian")
# Not providing covariance matrix
questionnaire_gen(n_obs = 10,
cat_prop = list(c(.25, 1), c(.6, 1), c(.2, 1)),
family = "gaussian")
Run the code above in your browser using DataLab