bisg: Performs Bayesian Improved Surname Geocoding(BISG)

Description

Performs BISG, i.e computes the probability a person is of a specific racial group, conditioned on surname and geolocation.

Usage

bisg(
  voter_file,
  surname_col,
  geo_col,
  geo_counts = NULL,
  surname_counts = NULL,
  geography = NULL,
  state = NULL,
  county = NULL,
  year = NULL,
  geo_col_counts = "fips",
  surname_col_counts = "surname",
  race_cols = c("whi", "bla", "his", "asi", "oth"),
  impute_missing = TRUE,
  verbose = FALSE,
  cache = FALSE
)

Value

A tibble with rows denoting voters and columns denoting the probability that each voter is of a particular racial group.

Arguments

voter_file: A tibble containing a list of voters (by row), and a column that denotes their surname.
surname_col: A string denoting which column contains the voter surname.
geo_col: A string denoting which column contains the geographic unit ID.
geo_counts: A tibble containing counts (divided among constituent groups) per geographic units (rows). If NULL, these counts will be obtained using the eiCompare helper function and the other parameters.
surname_counts: A dataframe denoting the frequency with which surnames correspond to different race/ethnicities. If NULL, the Census surname list is used with categories and merging functions from wru. The dataframe should contain one column with surnames (specified with the y surname_col_counts parameter) and one column for each race/ethnicity group (specified with the race_cols parameter).
geography: The geographic level at which to obtain Census data. If obtaining data from the decennial Census, can be up to "block". If obtaining data from the ACS, can only be up to "block group".
state: The state from which to obtain Census data, as a string.
county: The county(ies) from which to obtain Census data. If NULL, data is obtained from all counties in the state.
year: The year to obtain Census data from. If 2010, uses decennial data. Otherwise, uses the 5-year ACS summary data.
geo_col_counts: A string denoting the column in the counts tibble that refers to the geographic unit.
surname_col_counts: A string denoting the column in the surname_counts tibble that refers to the geographic unit.
race_cols: A list of strings denoting the columns containing racial groups.
impute_missing: A bool denoting whether voter file entries that do not match at the geographic level should be imputed with either the surname probabilities, or should be imputed with probabilities calculated at a broader geographic unit.
verbose: A boolean denoting the verbosity.
cache: A boolean denoting whether Census data should be cached.

Examples

Run this code


library(eiExpand)

# Load WA example data from eiExpand #
data("wa_geocoded")
data("wa_block_data")

# Use lat/long to merge with block shapes,
# Bring in block identifier #
voter_file_geo <- eiCompare::merge_voter_file_to_shape(
  voter_file = wa_geocoded,
  shape_file = wa_block_data,
  coords = c("longitude","latitude"),
  voter_id = "id") %>%
  dplyr::as_tibble() %>%
  dplyr::select("id", "county", "precinct", "surname", "GEOID20")

# Run BISG #
bisg_vf <- bisg(
  voter_file = voter_file_geo,
  surname_col = "surname",
  geo_col = "GEOID20",
  geo_counts = wa_block_data %>% as.data.frame(),
  geography = "block",
  state = "WA",
  geo_col_counts = "GEOID20",
  impute_missing = FALSE
)

# Preview individual race predictions #
dplyr::glimpse(bisg_vf)
#'

Run the code above in your browser using DataLab