cluster_gen: Generate cluster sample

Description

Generate cluster sample

Usage

cluster_gen(
  n,
  N = 1,
  cluster_labels = NULL,
  resp_labels = NULL,
  cat_prop = NULL,
  n_X = NULL,
  n_W = NULL,
  c_mean = NULL,
  sigma = NULL,
  cor_matrix = NULL,
  separate_questionnaires = TRUE,
  collapse = "none",
  sum_pop = sapply(N, sum),
  calc_weights = TRUE,
  sampling_method = "mixed",
  rho = NULL,
  theta = FALSE,
  verbose = TRUE,
  print_pop_structure = verbose,
  ...
)

Value

list with background questionnaire data, grouped by level or not

Arguments

n: numeric vector with the number of sampled observations (clusters or subjects) on each level
N: list of numeric vector with the population size of each *sampled* cluster element on each level
cluster_labels: character vector with the names of each cluster level
resp_labels: character vector with the names of the questionnaire respondents on each level
cat_prop: list of cumulative proportions for each item. If theta = TRUE, the first element of cat_prop must be a scalar 1, which corresponds to the theta.
n_X: list of `n_X` per cluster level
n_W: list of `n_W` per cluster level
c_mean: vector of means for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 0, but can change if `rho` is set.
sigma: vector of standard deviations for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 1, but can change if `rho` is set.
cor_matrix: Correlation matrix between all variables (except weights). By default, correlations are randomly generated.
separate_questionnaires: if `TRUE`, each level will have its own questionnaire
collapse: if `TRUE`, function output contains only one data frame with all answers. It can also be "none", "partial" and "full" for finer control on 3+ levels
sum_pop: total population at each level (sampled or not)
calc_weights: if `TRUE`, sampling weights are calculated
sampling_method: can be "SRS" for Simple Random Sampling or "PPS" for Probabilities Proportional to Size
rho: estimated intraclass correlation
theta: if TRUE, the first continuous variable will be labeled 'theta'. Otherwise, it will be labeled 'q1'.
verbose: if `TRUE`, prints output messages
print_pop_structure: if `TRUE`, prints the population hierarchical structure (as long as it differs from the sample structure)
...: Additional parameters to be passed to `questionnaire_gen()`

Details

This function relies heavily in two subfunctions---`cluster_gen_separate` and `cluster_gen_together`---which can be called independently. This does not make `cluster_gen` a simple wrapper function, as it performs several operations prior to calling its subfunctions, such as randomly generating `n_X` and `n_W` if they are not determined by user. `n` can have unitary length, in which case all clusters will have the same size. `N` is *not* the population size across all elements of a level, but the population size for each element of one level. Regarding the additional parameters to be passed to `questionnaire_gen()`, they can be passed either in the same format as `questionnaire_gen()` or as more complex objects that contain information for each cluster level.

Examples

Run this code

# Simple structure of 3 schools with 5 students each
cluster_gen(c(3, 5))

# Complex structure of 2 schools with different number of students,
# sampling weights and custom number of questions
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cluster_gen(n, N, n_X = 5, n_W = 2)

# Condensing the output
set.seed(0); cluster_gen(c(2, 4))
set.seed(0); cluster_gen(c(2, 4), collapse=TRUE) # same, but in one dataset

# Condensing the output: 3 levels
str(cluster_gen(c(2, 2, 1), collapse="none"))
str(cluster_gen(c(2, 2, 1), collapse="partial"))
str(cluster_gen(c(2, 2, 1), collapse="full"))

# Controlling the intra-class correlation and the grand mean
x <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(x$school[[s]]$q1))  # means per school != 10
mean(sapply(1:5, function(s) mean(x$school[[s]]$q1))) # closer to c_mean

# Making the intraclass variance explode by forcing "incompatible" rho and c_mean
x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)
anova(x)