rescaled.bootstrap: Draw bootstrap replicates

Description

Draw bootstrap replicates from survey data using either the rescaled bootstrap for stratified multistage sampling, presented by J. Preston (2009) or the Rao-Wu boostrap by J. N. K. Rao and C. F. J. Wu (1988)

Usage

rescaled.bootstrap(
  dat,
  method = c("Preston", "Rao-Wu"),
  REP = 1000,
  strata = "DB050>1",
  cluster = "DB060>DB030",
  fpc = "N.cluster>N.households",
  single.PSU = c("merge", "mean"),
  return.value = c("data", "replicates"),
  run.input.checks = TRUE,
  already.selected = NULL,
  seed = NULL
)

Value

returns the complete data set including the bootstrap replicates or just the bootstrap replicates, depending on return.value="data" or return.value="replicates" respectively.

Arguments

dat: either data frame or data table containing the survey sample
method: for bootstrap replicates, either "Preston" or "Rao-Wu"
REP: integer indicating the number of bootstraps to be drawn
strata: string specifying the column name in dat that is used for stratification. For multistage sampling multiple column names can be specified by strata=c("strata1","strata2","strata3") or strata=c("strata1>strata2>strata3"). See Details for more information.
cluster: string specifying the column name in dat that is used for clustering. For instance given a household sample the column containing the household ID should be supplied. For multistage sampling multiple column names can be specified by cluster=c("cluster1","cluster2","cluster3") or cluster=c("cluster1>cluster2>cluster3"). See Details for more information.
fpc: string specifying the column name in dat that contains the number of PSUs at the first stage. For multistage sampling the number of PSUs at each stage must be specified by strata=c("fpc1","fpc2","fpc3") or strata=c("fpc1>fpc2>fpc3").
single.PSU: either "merge" or "mean" defining how single PSUs need to be dealt with. For single.PSU="merge" single PSUs at each stage are merged with the strata or cluster with the next least number of PSUs. If multiple of those exist one will be select via random draw. For single.PSU="mean" single PSUs will get the mean over all bootstrap replicates at the stage which did not contain single PSUs.
return.value: either "data", "replicates" and/or "selection" specifying the return value of the function. For "data" the survey data is returned as class data.table, for "replicates" only the bootstrap replicates are returned as data.table. For "selection" list of data.tables with length of length(strata) is returned containing 1:REP 0-1 columns indicating if a PSU was selected for each sampling stage.
run.input.checks: logical, if TRUE the input will be checked before applying the bootstrap procedure
already.selected: list of data.tables or NULL where each data.table contains columns in cluster, strata and additionally 1:REP columns containing 0-1 values which indicate if a PSU was selected for each bootstrap replicate. Each of the data.tables corresponds to one of the sampling stages. First entry in the list corresponds to the first sampling stage and so on.
seed: integer specifying the seed for the random number generator.

Author

Johannes Gussenbauer, Eileen Vattheuer, Statistics Austria

Details

For specifying multistage sampling designs the column names in strata,cluster and fpc need to be seperated by ">".
For multistage sampling the strings are read from left to right meaning that the first vector entry or column name before the first ">" is taken as the column for stratification/clustering/number of PSUs at the first and the last vector entry or column after the last ">" is taken as the column for stratification/clustering/number of PSUs at the last stage. If for some stages the sample was not stratified or clustered one must specify this by "1" or "I", e.g. strata=c("strata1","I","strata3") or strata=c("strata1>I>strata3") if there was no stratification at the second stage or cluster=c("cluster1","cluster2","I") respectively cluster=c("cluster1>cluster2>I") if there were no clusters at the last stage.
The number of PSUs at each stage is not calculated internally and must be specified for any sampling design. For single stage sampling using stratification this can usually be done by adding over all sample weights of each PSU by each strata-code.
Spaces in each of the strings will be removed, so if column names contain spaces they should be renamed before calling this procedure!
If already.selected is supplied the sampling of bootstrap replicates considers if speficif PSUs have already been selected by a previous survey wave. For a specific strata and cluster this could lead to more than floor(n/2) records selected. In that case records will be de-selected such that floor(n/2) records, with n as the total number of records, are selected for each strata and cluster. This parameter ist mostly used by draw.bootstrap in order to consider the rotation of the sampling units over time.

References

Preston, J. (2009). Rescaled bootstrap for stratified multistage sampling. Survey Methodology. 35. 227-234. Rao, J. N. K., and C. F. J. Wu. (1988). Resampling Inference with Complex Survey Data. Journal of the American Statistical Association 83 (401): 231–41.

Examples

Run this code


library(surveysd)
library(data.table)
setDTthreads(1)
set.seed(1234)
eusilc <- demo.eusilc(n = 1,prettyNames = TRUE)

eusilc[,N.households := sum(pWeight[!duplicated(hid)]),
                                  by = .(region, year)]
eusilc.bootstrap <- rescaled.bootstrap(eusilc, REP = 10, strata = "region",
                                       cluster = "hid", fpc = "N.households")

eusilc[,new_strata := paste(region,hsize,sep="_")]
eusilc[,N.housholds := sum(pWeight[!duplicated(hid)]), by = new_strata]
eusilc.bootstrap <- rescaled.bootstrap(eusilc, REP = 10, strata = c("new_strata"),
                                       cluster = "hid", fpc = "N.households")

Run the code above in your browser using DataLab