cluster_gen: Generate cluster sample

Description

Generate cluster sample

Usage

cluster_gen(
  n,
  N = 1,
  cluster_labels = NULL,
  resp_labels = NULL,
  cat_prop = NULL,
  n_X = NULL,
  n_W = NULL,
  c_mean = NULL,
  sigma = NULL,
  cor_matrix = NULL,
  separate_questionnaires = TRUE,
  collapse = "none",
  sum_pop = sapply(N, sum),
  calc_weights = TRUE,
  sampling_method = "mixed",
  rho = NULL,
  theta = FALSE,
  verbose = TRUE,
  print_pop_structure = verbose,
  ...
)

Value

list with background questionnaire data, grouped by level or not

Arguments

n: numeric vector or list with the number of sampled observations (clusters or subjects) on each level
N: population size of each sampled cluster element on each level. Either a numeric vector or a list of numeric vectors. If N is a list, it must have the same length as n and each element of N must have the same length as the corresponding element of n
cluster_labels: character vector with the names of each cluster level
resp_labels: character vector with the names of the questionnaire respondents on each level
cat_prop: list of cumulative proportions for each item. If theta = TRUE, the first element of cat_prop must be a scalar 1, which corresponds to the theta.
n_X: list of n_X per cluster level
n_W: list of n_W per cluster level
c_mean: vector of means for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 0, but may change if rho is set.
sigma: vector of standard deviations for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 1, but may change if rho is set.
cor_matrix: Correlation matrix between all variables (except weights). By default, correlations are randomly generated.
separate_questionnaires: if TRUE, each level will have its own questionnaire
collapse: if TRUE, function output contains only one data frame with all answers. It can also be "none", "partial" and "full" for finer control on 3+ levels
sum_pop: total population at each level (sampled or not)
calc_weights: if TRUE, sampling weights are calculated
sampling_method: can be "SRS" for Simple Random Sampling, "PPS" for Probabilities Proportional to Size, "mixed" to use PPS for schools and SRS otherwise, or a vector with the sampling method for each level
rho: intraclass correlation (scalar, vector or list, as appropriate)
theta: if TRUE, the first continuous variable will be labeled 'theta'. Otherwise, it will be labeled 'q1'.
verbose: if TRUE, prints output messages
print_pop_structure: if TRUE, prints the population hierarchical structure (as long as it differs from the sample structure)
...: Additional parameters to be passed to questionnaire_gen()

Details

This function relies heavily in two sub-functions---cluster_gen_separate and cluster_gen_together---which can be called independently. This does not make cluster_gen a simple wrapper function, as it performs several operations prior to calling its sub-functions, such as randomly generating n_X and n_W if they are not determined by user. n can have unitary length, in which case all clusters will have the same size. N is not the population size across all elements of a level, but the population size for each element of one level. Regarding the additional parameters to be passed to questionnaire_gen(), they can be passed either in the same format as questionnaire_gen() or as more complex objects that contain information for each cluster level.

Examples

Run this code

# Simple structure of 3 schools with 5 students each
cluster_gen(c(3, 5))

# Complex structure of 2 schools with different number of students,
# sampling weights and custom number of questions
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cluster_gen(n, N, n_X = 5, n_W = 2)

# Condensing the output
set.seed(0); cluster_gen(c(2, 4))
set.seed(0); cluster_gen(c(2, 4), collapse=TRUE) # same, but in one dataset

# Condensing the output: 3 levels
str(cluster_gen(c(2, 2, 1), collapse="none"))
str(cluster_gen(c(2, 2, 1), collapse="partial"))
str(cluster_gen(c(2, 2, 1), collapse="full"))

# Controlling the intra-class correlation and the grand mean
x <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(x$school[[s]]$q1))  # means per school != 10
mean(sapply(1:5, function(s) mean(x$school[[s]]$q1))) # closer to c_mean

# Making the intraclass variance explode by forcing "incompatible" rho and c_mean
x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)
anova(x)