survey_var: Calculate the population variance and its variation using survey methods

Description

Calculate population variance from complex survey data. A wrapper around svyvar. survey_var should always be called from summarise.

Usage

survey_var(x, na.rm = FALSE, vartype = c("se", "ci", "var"),
  level = 0.95, df = Inf, .svy = current_svy(), ...)
survey_sd(x, na.rm = FALSE, .svy = current_svy(), ...)

Arguments

A variable or expression, or empty

na.rm

A logical value to indicate whether missing values should be dropped

vartype

Report variability as one or more of: standard error ("se", default) or variance ("var") (confidence intervals and coefficient of variation not available).

level

(For vartype = "ci" only) A single number or vector of numbers indicating the confidence level.

(For vartype = "ci" only) A numeric value indicating the degrees of freedom for t-distribution. The default (Inf) is equivalent to using normal distribution and in case of population variance statistics there is little reason to use any other values (see Details).

.svy

A tbl_svy object. When called from inside a summarize function the default automatically sets the survey to the current survey.

...

Ignored

Details

Be aware that confidence intervals for population variance statistic are computed by package survey using t or normal (with df=Inf) distribution (i.e. symmetric distributions). This could be a very poor approximation if even one of these conditions is met:

there are few sampling design degrees of freedom,
analyzed variable isn't normally distributed,
there is huge variation in sampling probabilities of the survey design.

Because of this be very careful using confidence intervals for population variance statistics especially while performing analysis within subsets of data or using grouped survey objects.

Sampling distribution of the variance statistic in general is asymmetric (chi-squared in case of simple random sampling of normally distributed variable) and if analyzed variable isn't normally distributed or there is huge variation in sampling probabilities of the survey design (or both) it could converge to normality only very slowly (with growing number of survey design degrees of freedom).

Examples

Run this code

# NOT RUN {
library(survey)
data(api)

dstrata <- apistrat %>%
  as_survey_design(strata = stype, weights = pw)

dstrata %>%
  summarise(api99_var = survey_var(api99),
            api99_sd = survey_sd(api99))

dstrata %>%
  group_by(awards) %>%
  summarise(api00_var = survey_var(api00),
            api00_sd = survey_sd(api00))

# standard deviation and variance of the population variance estimator
# are available with vartype argument
# (but not for the population standard deviation estimator)
dstrata %>%
  summarise(api99_variance = survey_var(api99, vartype = c("se", "var")))
# }

Run the code above in your browser using DataLab