varDT: Variance approximation with Deville-Till<U+00E9> (2005) formula

Description

varDT estimates the variance of the estimator of a total in the case of a balanced sampling design with equal or unequal probabilities. Without balancing variables, it falls back to Deville's (1993) classical approximation. Without balancing variables and with equal probabilities, it falls back to the classical Horvitz-Thompson variance estimator for the total in the case of simple random sampling. Stratification is natively supported.

var_srs is a convenience wrapper for the (stratified) simple random sampling case.

Usage

varDT(y = NULL, pik, x = NULL, strata = NULL, w = NULL,
  collinearity.check = NULL, precalc = NULL)
var_srs(y, pik, strata = NULL, w = NULL, precalc = NULL)

Arguments

A numerical matrix of the variable(s) whose variance of their total is to be estimated. May be a Matrix::TsparseMatrix.

pik

A numerical vector of first-order inclusion probabilities.

An optional numerical matrix of balancing variable(s). May be a Matrix::TsparseMatrix.

strata

An optional categorical vector (factor or character) when variance estimation is to be conducted within strata.

An optional numerical vector of row weights (see Details).

collinearity.check

A boolean (TRUE or FALSE) or NULL indicating whether to perform a check for collinearity or not (see Details).

precalc

A list of pre-calculated results (see Details).

Value

if y is not NULL (calculation step) : the estimated variances as a numerical vector of size the number of columns of y.
if y is NULL (pre-calculation step) : a list containing pre-calculated data:
- pik: the numerical vector of first-order inclusion probabilities.
- A: the numerical matrix denoted A in (Deville, Till<U+00E9>, 2005).
- ck: the numerical vector denoted ck2 in (Deville, Till<U+00E9>, 2005).
- inv: the inverse of A %*% Matrix::Diagonal(x = ck) %*% t(A)
- diago: the diagonal term of the variance estimator

Difference with <code>varest</code> from package <code>sampling</code>

varDT differs from sampling::varest in several ways:

The formula implemented in varDT is more general and encompasses balanced sampling.
Even in its reduced form (without balancing variables), the formula implemented in varDT slightly differs from the one implemented in sampling::varest. Caron, Deville and Sautory (1998, pp. 7-8) compares the two estimators (sampling::varest implements V_2, varDT implements V_1).
varDT introduces several optimizations:
- matrixwise operations allow to estimate variance on several interest variables at once
- Matrix::TsparseMatrix capability and the native integration of stratification yield significant performance gains.
- the ability to pre-calculate some time-consuming operations speeds up the estimation at execution time.
varDT does not natively implements the calibration estimator (i.e. the sampling variance estimator that takes into account the effect of calibration). In the context of the gustave package, rescal could be called before varDT in order to achieve the same result.

Details

varDT aims at being the workhorse of most variance estimation conducted with the gustave package. It may be used to estimate the variance of the estimator of a total in the case of (stratified) simple random sampling, (stratified) unequal probability sampling and (stratified) balanced sampling. The native integration of stratification based on Matrix::TsparseMatrix allows for significant performance gains compared to higher level vectorizations (*apply especially).

Several time-consuming operations (e.g. collinearity-check, matrix inversion) can be pre-calculated in order to speed up the estimation at execution time. This is determined by the value of the parameters y and precalc:

if y not NULL and precalc NULL : on-the-fly calculation (no pre-calculation).
if y NULL and precalc NULL : pre-calculation whose results are stored in a list of pre-calculated data.
if y not NULL and precalc not NULL : calculation using the list of pre-calculated data.

If collinearity.check is NULL, a test for collinearity in the independent variables (x) is conducted only if det(t(x) %*% x) == 0).

w is a row weight used at the final summation step. It is useful when varDT or var_srs are used on the second stage of a two-stage sampling design applying the Rao (1975) formula.

References

Caron N., Deville J.-C., Sautory O. (1998), Estimation de pr<U+00E9>cision de donn<U+00E9>es issues d'enqu<U+00EA>tes : document m<U+00E9>thodologique sur le logiciel POULPE, Insee working paper, n<U+00B0>9806

Deville, J.-C. (1993), Estimation de la variance pour les enqu<U+00EA>tes en deux phases, Manuscript, INSEE, Paris.

Deville, J.-C., Till<U+00E9>, Y. (2005), "Variance approximation under balanced sampling", Journal of Statistical Planning and Inference, 128, issue 2 569-591

Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n<U+00B0>37

Examples

Run this code

# NOT RUN {
library(sampling)
set.seed(1)

# Simple random sampling case
N <- 1000
n <- 100
y <- rnorm(N)[as.logical(srswor(n, N))]
pik <- rep(n/N, n)
varDT(y, pik)
sampling::varest(y, pik = pik)
N^2 * (1 - n/N) * var(y) / n

# Unequal probability sampling case
N <- 1000
n <- 100
pik <- runif(N)
s <- as.logical(UPsystematic(pik))
y <- rnorm(N)[s]
pik <- pik[s]
varDT(y, pik)
varest(y, pik = pik)
# The small difference is expected (see above).

# Balanced sampling case
N <- 1000
n <- 100
pik <- runif(N)
x <- matrix(rnorm(N*3), ncol = 3)
s <- as.logical(samplecube(x, pik))
y <- rnorm(N)[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)

# Balanced sampling case (variable of interest
# among the balancing variables)
N <- 1000
n <- 100
pik <- runif(N)
y <- rnorm(N)
x <- cbind(matrix(rnorm(N*3), ncol = 3), y)
s <- as.logical(samplecube(x, pik))
y <- y[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)
# As expected, the total of the variable of interest is perfectly estimated.

# strata argument
n <- 100
H <- 2
pik <- runif(n)
y <- rnorm(n)
strata <- letters[sample.int(H, n, replace = TRUE)]
all.equal(
 varDT(y, pik, strata = strata)
 , varDT(y[strata == "a"], pik[strata == "a"]) + varDT(y[strata == "b"], pik[strata == "b"])
)

# precalc argument
n <- 1000
H <- 50
pik <- runif(n)
y <- rnorm(n)
strata <- sample.int(H, n, replace = TRUE)
precalc <- varDT(y = NULL, pik, strata = strata)
identical(
 varDT(y, precalc = precalc)
 , varDT(y, pik, strata = strata)
)

# }

Run the code above in your browser using DataLab