gen_informative_sample: Generate a finite population and take an informative single or two-stage sample.

Description

Used to compare performance of sample design-weighted and unweighted estimation procedures.

Usage

gen_informative_sample(
  clustering = TRUE,
  two_stage = FALSE,
  theta = c(0.2, 0.7, 1),
  M = 3,
  theta_star = matrix(c(0.3, 0.3, 0.3, 0.31, 0.72, 2.04, 0.58, 0.83, 1), 3, 3, byrow =
    TRUE),
  gp_type = "rq",
  N = 10000,
  T = 15,
  L = 10,
  R = 8,
  I = 4,
  n = 750,
  noise_to_signal = 0.05,
  incl_gradient = "medium"
)

Value

A list object named dat_sim containing objects related to the generated sample finite population, the informative sample and the non-informative, iid, sample. Some important objects, include:

H: A vector of length N, the population size, with cluster assignments for each establishment (unit) in 1,..M clusters.
map.tot: A data.frame object including unit label identifiers (under establishment), the cluster assignment (if clustering = TRUE), the block (iftwo_stage = TRUE) and stratum assignments and the sample inclusion probabilities.
map.obs: A data.frame object configured the same as map.tot, only confined to those establishments/units selected into the informative sample of size n.
map.iid: A data.frame object configured the same as map.tot, only confined to those establishments/units selected into the non-informative, iid sample of size n.
(y,bb): N x T matrix objects containing data responses and de-noised ' functions, respectively, for each of the N population units. The order of the N units is consistent with map.
(y_obs,bb_obs): N x T matrix objects containing observed responses and de-noised ' functions, respectively, for each of the n units sampled under an informative sampling design. The order of the n units is consistent with map_obs.
(y_iid,bb_iid): N x T matrix objects containing observed responses and de-noised ' functions, respectively, for each of the n units sampled under a non-informative / iid sampling design. The order of the n units is consistent with map_iid.

Arguments

clustering: Boolean input on whether want population generated from clusters of covariance parameters. Defaults to clustering = FALSE
two_stage: Boolean input on whether want two stage sampling, with first stage defining set of L blocks, where membership in blocks determined by quantiles of observation unit variance functions. (They are structured like strata, though they are sub-sampled).
theta: A numeric vector of global covariance parameters in the case of clustering = FALSE. The length, P, of theta must be consistent with the selected gp_type. Defaults to theta = c(0.30.7,1.0) in the case of clustering = FALSE.
M: Scalar input denoting number of clusters to employ if clustering = TRUE. Defaults to M = 3
theta_star: An P x M matrix of cluster location values associated with the choice of M and the selected gp_type. Defaults to matrix(c(0.3,0.3,0.3,0.31,0.72,2.04,0.58,0.83,1.00),3,3,byrow=TRUE)).
gp_type: Input of choice for covariance matrix formulation to be used to generate the functions for the N population units. Choices are c("se","rq"), where "se" denotes the squared exponential covariance function and "rq" denotes the rational quadratic. Defaults to gp_type = "se"
N: A scalar input denoting the number of population units (or establishments).
T: A scalar input denoting the number of time points in each of N, T x 1 functions that contribute to the N x T population data matrix, y. Defaults to T = 15.
L: A scalar input that denotes the number of blocks in which to assign the population units to be sub-sampled in the first stage of sampling. Defaults to L = 10.
R: A scalar input that denotes the number of blocks to sample from L = 10 with probability proportional to the average variance of member functions in each block.
I: A scalar input denoting the number of strata to form within each block. Population units are divided into equally-sized strata based on variance quantiles. Defaults to I = 4.
n: Sample size to be generated. Both an informative sample under either single (two_stage = FALSE) or 2-stage (two_stage = TRUE) sample is taken, along with a non-informative, iid sample of the same size (n) from the finite population (generated with (clustering = TRUE) or without clustering). Defaults to n = 770.
noise_to_signal: A numeric input in the interval, (0,1), denoting the ratio of noise variance to the average variance of the generated functions, bb_i. Defaults to noise_to_signal = 0.05
incl_gradient: A character input on whether stratum probabilities from lowest-to-highest is to "high", in which case they are proportional to the exponential of the cluster number. If set to "medium" , the inclusion probabilities are proportional to the square of the cluster number. Note that population units are assigned to each stratum proportional to a progressively increasing quantile variance. The incl_gradient setting is used for both two_stage = TRUE, in which case it is applied to strata within block, as well as two_stage = FALSE, in which case a simple stratified random sample is conducted. Defaults to incl_gradient = "medium"

Author

Terrance Savitsky tds151@gmail.com

Examples

Run this code

if (FALSE) {
library(growfunctions)
## use gen_informative_sample() to generate an 
## N X T population drawn from a dependent GP
## By default, 3 clusters are used to generate 
## the population.
## A single stage stratified random sample of size n 
## is drawn from the population using I = 4 strata. 
## The resulting sample is informative in that the 
## distribution for this sample is
## different from the population from which 
## it was drawn because the strata inclusion
## probabilities are proportional to a feature 
## of the response, y (in the case, the variance.
## The stratified random sample over-samples 
## large variance strata).
## (The user may also select a 2-stage 
## sample with the first stage
## sampling "blocks" of the population and 
## the second stage sampling strata within blocks). 
dat_sim        <- gen_informative_sample(N = 10000, 
                                n = 500, T = 10,
                                noise_to_signal = 0.1)

## extract n x T observed sample under informative
## stratified sampling design.
y_obs                       <- dat_sim$y_obs
T                           <- ncol(y_obs)
}