Learn R Programming

pintervals (version 1.0.1)

county_turnout: U.S. county-level turnout and demographic context (MIT Election Lab 2018 Election Analysis Dataset + additions)

Description

A county-level dataset (U.S.) with voter turnout and sociodemographic covariates.

Usage

data(county_turnout)

Arguments

Format

A tibble with 3,107 rows and 22 variables:

state

State name.

county

County name.

fips

County FIPS code.

turnout

Observed turnout (proportion). Calculated as total votes cast divided by total population (not voting-age population).

total_population

Total county population.

nonwhite_pct

Percent non-white population.

foreignborn_pct

Percent foreign-born population.

female_pct

Percent female population.

age29andunder_pct

Percent of population aged 29 or under.

age65andolder_pct

Percent of population aged 65 or older.

median_hh_inc

Median household income.

clf_unemploy_pct

Percent unemployed in the civilian labor force.

lesscollege_pct

Percent with less than college education.

lesshs_pct

Percent with less than high school education.

rural_pct

Percent rural.

ruralurban_cc

Rural–urban continuum code.

predicted_turnout

LOO-CV random-forest prediction of `turnout` (see Details).

division

U.S. Census division.

region

U.S. Census region.

geo_group

Additional coarse geographic grouping variable (added).

longitude

County centroid longitude (added).

latitude

County centroid latitude (added).

Details

The dataset is based on the MIT Election Lab "2018 Election Analysis dataset" file, with four additions: (1) `turnout`, calculated as the number of votes cast divided by the total population, (2) `geo_group`, a coarse geographic grouping variable for the counties, (3) county centroid coordinates (`longitude`, `latitude`), and (4) `predicted_turnout`. The variable `predicted_turnout` is generated using leave-one-out cross-validation (LOO-CV). For each county a random forest model is fit on the remaining counties with `turnout` as the outcome and all available *non-geographic* covariates as predictors. The fitted model is then used to predict turnout for the held-out county. Geographic features are excluded from the predictor set to avoid leaking spatial information into the prediction target. Concretely, identifiers and geographic variables (e.g., `state`, `county`, `fips`, `division`, `region`, `geo_group`, `longitude`, `latitude`) are excluded from the predictor set.

Below is example code (using `foreach`) to reproduce `predicted_turnout`. This is computationally expensive for LOO-CV; parallel execution is recommended.

library(dplyr) library(ranger) library(foreach) library(pintervals)

dat <- county_turnout # replace with your object name

# Choose predictors: all numeric covariates except turnout + geographic/id vars dat2 <- dat |> select(-c(state, county, fips, division, region, geo_group, longitude, latitude))

set.seed(101010) # The meaning of life in binary

pred_loo <- foreach(.i = seq_len(nrow(dat)), .final = unlist)

train <- dat2[-.i, , drop = FALSE] test <- dat2[ .i, , drop = FALSE]

fit <- ranger( formula = turnout ~ ., data = train )

predict(fit, data = test)$predictions[[1]]

}

dat <- dat |> mutate(predicted_turnout = pred_loo)