A county-level dataset (U.S.) with voter turnout and sociodemographic covariates.
data(county_turnout)A tibble with 3,107 rows and 22 variables:
State name.
County name.
County FIPS code.
Observed turnout (proportion). Calculated as total votes cast divided by total population (not voting-age population).
Total county population.
Percent non-white population.
Percent foreign-born population.
Percent female population.
Percent of population aged 29 or under.
Percent of population aged 65 or older.
Median household income.
Percent unemployed in the civilian labor force.
Percent with less than college education.
Percent with less than high school education.
Percent rural.
Rural–urban continuum code.
LOO-CV random-forest prediction of `turnout` (see Details).
U.S. Census division.
U.S. Census region.
Additional coarse geographic grouping variable (added).
County centroid longitude (added).
County centroid latitude (added).
The dataset is based on the MIT Election Lab "2018 Election Analysis dataset" file, with four additions: (1) `turnout`, calculated as the number of votes cast divided by the total population, (2) `geo_group`, a coarse geographic grouping variable for the counties, (3) county centroid coordinates (`longitude`, `latitude`), and (4) `predicted_turnout`. The variable `predicted_turnout` is generated using leave-one-out cross-validation (LOO-CV). For each county a random forest model is fit on the remaining counties with `turnout` as the outcome and all available *non-geographic* covariates as predictors. The fitted model is then used to predict turnout for the held-out county. Geographic features are excluded from the predictor set to avoid leaking spatial information into the prediction target. Concretely, identifiers and geographic variables (e.g., `state`, `county`, `fips`, `division`, `region`, `geo_group`, `longitude`, `latitude`) are excluded from the predictor set.
Below is example code (using `foreach`) to reproduce `predicted_turnout`. This is computationally expensive for LOO-CV; parallel execution is recommended.
library(dplyr) library(ranger) library(foreach) library(pintervals)
dat <- county_turnout # replace with your object name
# Choose predictors: all numeric covariates except turnout + geographic/id vars
dat2 <- dat |>
select(-c(state, county, fips, division, region, geo_group, longitude, latitude))
set.seed(101010) # The meaning of life in binary
pred_loo <- foreach(.i = seq_len(nrow(dat)), .final = unlist)
train <- dat2[-.i, , drop = FALSE] test <- dat2[ .i, , drop = FALSE]
fit <- ranger( formula = turnout ~ ., data = train )
predict(fit, data = test)$predictions[[1]]
}
dat <- dat |> mutate(predicted_turnout = pred_loo)