Calculates individual probabilities of belonging to racial groups given last
name, location, and other covariates (optional). The standard function
bisg()
treats the input tables as fixed. An alternative function
bisg_me()
, assumes that the input tables are subject to measurement error,
and uses a Gibbs sampler to impute the individual race probabilities, using
the model of Imai et al. (2022).
bisg(
formula,
data = NULL,
p_r = p_r_natl(),
p_rgx = NULL,
p_rs = NULL,
save_rgx = TRUE
)bisg_me(
formula,
data = NULL,
p_r = p_r_natl(),
p_rgx = NULL,
p_rs = NULL,
iter = 1000,
warmup = 100,
cores = 1L
)
# S3 method for bisg
summary(object, p_r = NULL, ...)
# S3 method for bisg
predict(object, adj = NULL, ...)
# S3 method for bisg
simulate(object, nsim = 1, seed = NULL, ...)
An object of class bisg
, which is just a data frame with some
additional attributes. The data frame has rows matching the input data and
columns for the race probabilities.
A formula specifying the BISG model. Must include the special
term nm()
to identify the surname variable. Certain geographic variables
can be identified similarly: zip()
for ZIP codes, and state()
for
states. If no other predictor variables are provided, then bisg()
will
automatically be able to build a table of census data to use in inference.
If other predictor variables are included, or if other geographic
identifiers are used, then the user must specify the p_rgx
argument below.
The left-hand side of the formula is ignored.
See the examples section below for sample formulas.
The data frame containing the variables in formula
.
The prior distribution of race in the sample, as a numeric vector.
Defaults to U.S. demographics as provided by p_r_natl()
.
Can also set p_r="est"
or "estimate"
to estimate this from the
geographic distribution. Since the prior distribution on race strongly
affects the calibration of the BISG probabilities and thus the accuracy of
downstream estimates, users are encouraged to think carefully about an
appropriate value for p_r
. If no prior information on the racial makeup
of the sample is available, and yet the sample is very different from the
overall U.S. population, then p_r="estimate"
will likely produce superior
results.
The distribution of race given location (G) and other covariates
(X) specified in formula
. Should be provided as a data frame, with columns
matching the predictors in formula
, and additional columns for each
racial group containing the conditional probability for that racial group
given the predictors. For example, if Census tracts are the only predictors,
p_rgx
should be a data frame with a tract column and columns white
,
black
, etc. containing the racial distribution of each tract.
If formula
contains only labeled terms (like zip()
), then by default
p_rgx
will be constructed automatically from the most recent Census data.
This table will be normalized by row, so it can be provided as population
counts as well. Counts are required for bisg_me()
.
The census_race_geo_table()
function can be helpful to prepare tables,
as can be the build_dec()
and build_acs()
functions in the censable
package.
The distribution of race given last name. As with p_rgx
, should
be provided as a data frame, with a column of names and additional columns
for each racial group. Users should not have to specify this argument in
most cases, as the table will be built from published Census surname tables
automatically. Counts are required for bisg_me()
.
If TRUE
, save the p_rgx
table (matched to each
individual) as the "p_rgx"
and "gx"
attributes of the output.
Necessary for some sensitivity analyses.
How many sampling iterations in the Gibbs sampler
How many burn-in iterations in the Gibbs sampler
How many parallel cores to use in computation. Around 4 seems to be optimal, even if more are available.
An object of class bisg
, the result of running bisg()
.
Additional arguments to generic methods (ignored).
A point in the simplex that describes how BISG probabilities
will be thresholded to produce point predictions. The probabilities are
divided by adj
, then the racial category with the highest probability is
predicted. Can be used to trade off types of prediction error. Must be
nonnegative but will be normalized to sum to 1. The default is to make no
adjustment.
The number of vectors to simulate. Defaults to 1.
Used to seed the random number generator. See stats::simulate()
.
summary(bisg)
: Summarize predicted race probabilities. Returns vector of individual entropies.
predict(bisg)
: Create point predictions of individual race. Returns factor
vector of individual race labels. Strongly not recommended for any kind of
inferential purpose, as biases may be extreme and in unpredictable
directions.
simulate(bisg)
: Simulate race from the Pr(R | G, X, S)
distribution.
bisg()
: The standard BISG model.
bisg_me()
: The measurement error BISG model.
Elliott, M. N., Fremont, A., Morrison, P. A., Pantoja, P., and Lurie, N. (2008). A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Services Research, 43(5p1):1722–1736.
Fiscella, K. and Fremont, A. M. (2006). Use of geocoding and surname analysis to estimate race and ethnicity. Health Services Research, 41(4p1):1482–1500.
Imai, K., Olivella, S., & Rosenman, E. T. (2022). Addressing census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements. Science Advances, 8(49), eadc9824.
data(pseudo_vf)
bisg(~ nm(last_name), data=pseudo_vf)
r_probs = bisg(~ nm(last_name) + zip(zip), data=pseudo_vf)
summary(r_probs)
head(predict(r_probs))
data(pseudo_vf)
bisg_me(~ nm(last_name) + zip(zip), data=pseudo_vf)
Run the code above in your browser using DataLab