predict_race
makes probabilistic estimates of individual-level race/ethnicity.
predict_race(
voter.file,
census.surname = TRUE,
surname.only = FALSE,
census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
census.key = Sys.getenv("CENSUS_API_KEY"),
census.data = NULL,
age = FALSE,
sex = FALSE,
year = "2020",
party = NULL,
retry = 3,
impute.missing = TRUE,
skip_bad_geos = FALSE,
use.counties = FALSE,
model = "BISG",
race.init = NULL,
name.dictionaries = NULL,
names.to.use = "surname",
control = NULL
)
Output will be an object of class data.frame
. It will
consist of the original user-input voter.file
with additional columns with
predicted probabilities for each of the five major racial categories:
pred.whi
for White,
pred.bla
for Black,
pred.his
for Hispanic/Latino,
pred.asi
for Asian/Pacific Islander, and
pred.oth
for Other/Mixed.
An object of class data.frame
.
Must contain a row for each individual being predicted,
as well as a field named surname
containing each individual's surname.
If using geolocation in predictions, voter.file
must contain a field named
state
, which contains the two-character abbreviation for each individual's
state of residence (e.g., "nj"
for New Jersey).
If using Census geographic data in race/ethnicity predictions,
voter.file
must also contain at least one of the following fields:
county
, tract
, block_group
, block
,
and/or place
.
These fields should contain character strings matching U.S. Census categories.
County is three characters (e.g., "031"
not "31"
),
tract is six characters, block group is usually a single character and block
is four characters. Place is five characters.
See below for other optional fields.
A TRUE
/FALSE
object. If TRUE
,
function will call merge_surnames
to merge in Pr(Race | Surname)
from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List.
If FALSE
, user must provide a name.dictionary
(see below).
Default is TRUE
.
A TRUE
/FALSE
object. If TRUE
, race predictions will
only use surname data and calculate Pr(Race | Surname). Default is FALSE
.
An optional character vector specifying what level of
geography to use to merge in U.S. Census geographic data. Currently
"county"
, "tract"
, "block_group"
, "block"
, and "place"
are supported.
Note: sufficient information must be in user-defined voter.file
object.
If census.geo = "county"
, then voter.file
must have column named county
.
If census.geo = "tract"
, then voter.file
must have columns named county
and tract
.
And if census.geo = "block"
, then voter.file
must have columns named county
, tract
, and block
.
If census.geo = "place"
, then voter.file
must have column named place
.
If census.geo = "zcta"
, then voter.file
must have column named zcta
.
Specifying census.geo
will call census_helper
function
to merge Census geographic data at specified level of geography.
A character object specifying user's Census API key.
Required if census.geo
is specified, because a valid Census API key is
required to download Census geographic data.
If NULL
, the default, attempts to find a census key stored in an
environment variable named CENSUS_API_KEY
.
A list indexed by two-letter state abbreviations,
which contains pre-saved Census geographic data.
Can be generated using get_census_data
function.
An optional TRUE
/FALSE
object specifying whether to
condition race predictions on age (in addition to surname and geolocation).
Default is FALSE
. Must be same as age
in census.data
object.
May only be set to TRUE
if census.geo
option is specified.
If TRUE
, voter.file
should include a numerical variable age
.
optional TRUE
/FALSE
object specifying whether to
condition race predictions on sex (in addition to surname and geolocation).
Default is FALSE
. Must be same as sex
in census.data
object.
May only be set to TRUE
if census.geo
option is specified.
If TRUE
, voter.file
should include a numerical variable sex
,
where sex
is coded as 0 for males and 1 for females.
An optional character vector specifying the year of U.S. Census geographic
data to be downloaded. Use "2010"
, or "2020"
. Default is "2020"
.
An optional character object specifying party registration field
in voter.file
, e.g., party = "PartyReg"
.
If specified, race/ethnicity predictions will be conditioned
on individual's party registration (in addition to geolocation).
Whatever the name of the party registration field in voter.file
,
it should be coded as 1 for Democrat, 2 for Republican, and 0 for Other.
The number of retries at the census website if network interruption occurs.
Logical, defaults to TRUE. Should missing be imputed?
Logical. Option to have the function skip any geolocations that are not present
in the census data, returning a partial data set. Default is set to FALSE
, in which case it
will break and provide error message with a list of offending geolocations.
A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?
Character string, either "BISG" (default) or "fBISG" (for error-correction, fully-Bayesian model).
Vector of initial race for each observation in voter.file.
Must be an integer vector, with 1=white, 2=black, 3=hispanic, 4=asian, and
5=other. Defaults to values obtained using model="BISG_surname"
.
Optional named list of data.frame
's
containing counts of names by race. Any of the following named elements
are allowed: "surname", "first", "middle". When present, the objects must
follow the same structure as last_c
, first_c
,
mid_c
, respectively.
One of 'surname', 'surname, first', or 'surname, first, middle'. Defaults to 'surname'.
List of control arguments only used when model="fBISG"
, including
Number of MCMC iterations. Defaults to 1000.
Number of iterations discarded as burnin. Defaults to half of iter
.
Print progress information. Defaults to TRUE
.
Boolean. Should the model correct measurement error for races|geo
? Defaults to TRUE
.
RNG seed. If NULL
, a seed is generated and returned as an attribute for reproducibility.
This function implements the Bayesian race prediction methods outlined in Imai and Khanna (2015). The function produces probabilistic estimates of individual-level race/ethnicity, based on surname, geolocation, and party.
# \donttest{
#' data(voters)
try(predict_race(voter.file = voters, surname.only = TRUE))
if (FALSE) {
try(predict_race(voter.file = voters, census.geo = "tract"))
}
if (FALSE) {
try(predict_race(
voter.file = voters, census.geo = "place", year = "2020"))
}
if (FALSE) {
CensusObj <- try(get_census_data(state = c("NY", "DC", "NJ")))
try(predict_race(
voter.file = voters, census.geo = "tract", census.data = CensusObj, party = "PID")
)
}
if (FALSE) {
CensusObj2 <- try(get_census_data(state = c("NY", "DC", "NJ"), age = T, sex = T))
try(predict_race(
voter.file = voters, census.geo = "tract", census.data = CensusObj2, age = T, sex = T))
}
if (FALSE) {
CensusObj3 <- try(get_census_data(state = c("NY", "DC", "NJ"), census.geo = "place"))
try(predict_race(voter.file = voters, census.geo = "place", census.data = CensusObj3))
}
# }
Run the code above in your browser using DataLab