merge_surnames
merges surnames in user-input dataset with corresponding
race/ethnicity probabilities from U.S. Census Surname List and Spanish Surname List.
merge_surnames(
voter.file,
surname.year = 2020,
name.data,
clean.surname = TRUE,
impute.missing = TRUE
)
Output will be an object of class data.frame
. It will
consist of the original user-input data with additional columns that
specify the part of the name matched with Census data (surname.match
),
and the probabilities Pr(Race | Surname) for each racial group
(p_whi
for White, p_bla
for Black,
p_his
for Hispanic/Latino,
p_asi
for Asian and Pacific Islander, and
p_oth
for Other/Mixed).
#'
An object of class data.frame
. Must contain a field
named 'surname' containing list of surnames to be merged with Census lists.
An object of class numeric
indicating which year
Census Surname List is from. Accepted values are 2010
and 2000
.
Default is 2020
.
An object of class data.frame
. Must contain a leading
column of surnames, and 5 subsequent columns, with Pr(Race | Surname) for each
of the five major racial categories.
A TRUE
/FALSE
object. If TRUE
,
any surnames in voter.file
that cannot initially be matched
to surname lists will be cleaned, according to U.S. Census specifications,
in order to increase the chance of finding a match. Default is TRUE
.
A TRUE
/FALSE
object. If TRUE
,
race/ethnicity probabilities will be imputed for unmatched names using
race/ethnicity distribution for all other names (i.e., not on Census List).
Default is TRUE
.
This function allows users to match surnames in their dataset with the U.S. Census Surname List (from 2000 or 2010) and Spanish Surname List to obtain Pr(Race | Surname) for each of the five major racial groups.
By default, the function matches surnames to the Census list as follows:
Search raw surnames in Census surname list;
Remove any punctuation and search again;
Remove any spaces and search again;
Remove suffixes (e.g., Jr) and search again;
Split double-barreled surnames into two parts and search first part of name;
Split double-barreled surnames into two parts and search second part of name;
For any remaining names, impute probabilities using distribution for all names not appearing on Census list.
Each step only applies to surnames not matched in a previous ste.
Steps 2 through 7 are not applied if clean.surname
is FALSE.
Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.
data(voters)
if (FALSE) try(merge_surnames(voters))
Run the code above in your browser using DataLab