county_turnout: U.S. county-level turnout and demographic context (MIT Election Lab 2018 Election Analysis Dataset + additions)

Description

A county-level dataset (U.S.) with voter turnout and sociodemographic covariates.

Usage

data(county_turnout)

Arguments

Format

A tibble with 3,107 rows and 22 variables:

state: State name.
county: County name.
fips: County FIPS code.
turnout: Observed turnout (proportion). Calculated as total votes cast divided by total population (not voting-age population).
total_population: Total county population.
nonwhite_pct: Percent non-white population.
foreignborn_pct: Percent foreign-born population.
female_pct: Percent female population.
age29andunder_pct: Percent of population aged 29 or under.
age65andolder_pct: Percent of population aged 65 or older.
median_hh_inc: Median household income.
clf_unemploy_pct: Percent unemployed in the civilian labor force.
lesscollege_pct: Percent with less than college education.
lesshs_pct: Percent with less than high school education.
rural_pct: Percent rural.
ruralurban_cc: Rural–urban continuum code.
predicted_turnout: LOO-CV random-forest prediction of `turnout` (see Details).
division: U.S. Census division.
region: U.S. Census region.
geo_group: Additional coarse geographic grouping variable (added).
longitude: County centroid longitude (added).
latitude: County centroid latitude (added).

Details

The dataset is based on the MIT Election Lab "2018 Election Analysis dataset" file, with four additions: (1) `turnout`, calculated as the number of votes cast divided by the total population, (2) `geo_group`, a coarse geographic grouping variable for the counties, (3) county centroid coordinates (`longitude`, `latitude`), and (4) `predicted_turnout`. The variable `predicted_turnout` is generated using leave-one-out cross-validation (LOO-CV). For each county a random forest model is fit on the remaining counties with `turnout` as the outcome and all available *non-geographic* covariates as predictors. The fitted model is then used to predict turnout for the held-out county. Geographic features are excluded from the predictor set to avoid leaking spatial information into the prediction target. Concretely, identifiers and geographic variables (e.g., `state`, `county`, `fips`, `division`, `region`, `geo_group`, `longitude`, `latitude`) are excluded from the predictor set.

Below is example code (using `foreach`) to reproduce `predicted_turnout`. This is computationally expensive for LOO-CV; parallel execution is recommended.

library(dplyr) library(ranger) library(foreach) library(pintervals)

dat <- county_turnout # replace with your object name

# Choose predictors: all numeric covariates except turnout + geographic/id vars dat2 <- dat |> select(-c(state, county, fips, division, region, geo_group, longitude, latitude))

set.seed(101010) # The meaning of life in binary

pred_loo <- foreach(.i = seq_len(nrow(dat)), .final = unlist)

train <- dat2[-.i, , drop = FALSE] test <- dat2[ .i, , drop = FALSE]

fit <- ranger( formula = turnout ~ ., data = train )

predict(fit, data = test)$predictions[[1]]

}

dat <- dat |> mutate(predicted_turnout = pred_loo)