rake_to_benchmarks: Re-weight data to match population benchmarks, using raking or post-stratification

Description

Adjusts weights in the data to ensure that estimated population totals for grouping variables match known population benchmarks. If there is only one grouping variable, simple post-stratification is used. If there are multiple grouping variables, raking (also known as iterative post-stratification) is used.

Usage

rake_to_benchmarks(
  survey_design,
  group_vars,
  group_benchmark_vars,
  max_iterations = 100,
  epsilon = 5e-06
)

Value

A survey design object with raked or post-stratified weights

Arguments

survey_design: A survey design object created with the survey package.
group_vars: Names of grouping variables in the data dividing the sample into groups for which benchmark data are available. These variables cannot have any missing values
group_benchmark_vars: Names of group benchmark variables in the data corresponding to group_vars. For each category of a grouping variable, the group benchmark variable gives the population benchmark (i.e. population size) for that category.
max_iterations: If there are multiple grouping variables, then raking is used rather than post-stratification. The parameter max_iterations controls the maximum number of iterations to use in raking.
epsilon: If raking is used, convergence for a given margin is declared if the maximum change in a re-weighted total is less than epsilon times the total sum of the original weights in the design.

Details

Raking adjusts the weight assigned to each sample member so that, after reweighting, the weighted sample percentages for population subgroups match their known population percentages. In a sense, raking causes the sample to more closely resemble the population in terms of variables for which population sizes are known.

Raking can be useful to reduce nonresponse bias caused by having groups which are overrepresented in the responding sample relative to their population size. If the population subgroups systematically differ in terms of outcome variables of interest, then raking can also be helpful in terms of reduce sampling variances. However, when population subgroups do not differ in terms of outcome variables of interest, then raking may increase sampling variances.

There are two basic requirements for raking.

Basic Requirement 1 - Values of the grouping variable(s) must be known for all respondents.
Basic Requirement 2 - The population size of each group must be known (or precisely estimated).

When there is effectively only one grouping variable (though this variable can be defined as a combination of other variables), raking amounts to simple post-stratification. For example, simple post-stratification would be used if the grouping variable is "Age x Sex x Race", and the population size of each combination of age, sex, and race is known. The method of "iterative poststratification" (also known as "iterative proportional fitting") is used when there are multiple grouping variables, and population sizes are known for each grouping variable but not for combinations of grouping variables. For example, iterative proportional fitting would be necessary if population sizes are known for age groups and for gender categories but not for combinations of age groups and gender categories.

Examples

Run this code

# Load the survey data

data(involvement_survey_srs, package = "nrba")

# Calculate population benchmarks
population_benchmarks <- list(
  "PARENT_HAS_EMAIL" = data.frame(
    PARENT_HAS_EMAIL = c("Has Email", "No Email"),
    PARENT_HAS_EMAIL_POP_BENCHMARK = c(17036, 2964)
  ),
  "STUDENT_RACE" = data.frame(
    STUDENT_RACE = c(
      "AM7 (American Indian or Alaska Native)", "AS7 (Asian)",
      "BL7 (Black or African American)",
      "HI7 (Hispanic or Latino Ethnicity)", "MU7 (Two or More Races)",
      "PI7 (Native Hawaiian or Other Pacific Islander)",
      "WH7 (White)"
    ),
    STUDENT_RACE_POP_BENCHMARK = c(206, 258, 3227, 1097, 595, 153, 14464)
  )
)

# Add the population benchmarks as variables in the data
involvement_survey_srs <- merge(
  x = involvement_survey_srs,
  y = population_benchmarks$PARENT_HAS_EMAIL,
  by = "PARENT_HAS_EMAIL"
)
involvement_survey_srs <- merge(
  x = involvement_survey_srs,
  y = population_benchmarks$STUDENT_RACE,
  by = "STUDENT_RACE"
)

# Create a survey design object
library(survey)

survey_design <- svydesign(
  weights = ~BASE_WEIGHT,
  id = ~UNIQUE_ID,
  fpc = ~N_STUDENTS,
  data = involvement_survey_srs
)

# Subset data to only include respondents
survey_respondents <- subset(
  survey_design,
  RESPONSE_STATUS == "Respondent"
)

# Rake to the benchmarks
raked_survey_design <- rake_to_benchmarks(
  survey_design = survey_respondents,
  group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"),
  group_benchmark_vars = c(
    "PARENT_HAS_EMAIL_POP_BENCHMARK",
    "STUDENT_RACE_POP_BENCHMARK"
  ),
)

# Inspect estimates from respondents, before and after raking

svymean(
  x = ~PARENT_HAS_EMAIL,
  design = survey_respondents
)
svymean(
  x = ~PARENT_HAS_EMAIL,
  design = raked_survey_design
)

svymean(
  x = ~WHETHER_PARENT_AGREES,
  design = survey_respondents
)
svymean(
  x = ~WHETHER_PARENT_AGREES,
  design = raked_survey_design
)

Run the code above in your browser using DataLab