parseName: Parse surname and given name

Description

Identify the presumed surname in a character string assumed to represent a name and return the result in a character matrix with "surname" followed by "givenName".

Usage

parseName(x, surnameFirst=(median(regexpr(',', x))>0),
          suffix=c('Jr.', 'I', 'II', 'III', 'IV', 'Sr.'),
          fixNonStandard=subNonStandardNames, ...)

Arguments

a character vector

surnameFirst

logical: If TRUE, the surname comes first followed by a comma (","), then the given name. If FALSE, parse the surname from a standard Western "John Smith, Jr." format. If missing(surnameFirst), use TRUE if half of the elemen

suffix

character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix.

fixNonStandard

function to look for and repair nonstandard names such as names containing characters with accent marks that are sometimes mangled by different software. Use identity if this is not desired

...

optional arguments passed to fixNonStandard

Value

a character matrix with two columns: surname and givenName

Details

If surnameFirst is FALSE: 1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith". 2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2. 3. split <- strsplit(x2, " ") 4. Take the last as the surname. 5. If the "surname" found per 3 is in suffix, save to append it to the givenName and recurse to get the actual surname. NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anistasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anistasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect. 6. Recompose the rest with any suffix as the givenName.

Examples

Run this code

##
## 1.  Parse standard first-last name format
##
tst <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez',
         'John Brown, Jr.', 'John Brown Jr.',
         'John W. Brown III', 'John Q. Brown,I',
         'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle',
         'Ra_l Vel_zquez')
library(Ecdat)
parsed <- parseName(tst)

tst2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de',
  'Brown', 'John, Jr.', 'Brown', 'John, Jr.',
  'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza',
  'Velazquez', 'Raul'),
  ncol=2, byrow=TRUE)
# NOTE:  This second to last example is in the Spanish tradition
# and is handled incorrectly by the current algorithm.
# The correct answer should be "Somoza Debayle", "Anastasio".
# However, fixing that would complicate the algorithm excessively for now.
colnames(tst2) <- c("surname", 'givenName')

stopifnot(
all.equal(parsed, tst2)
)

##
## 2.  Parse "surname, given name" format
##
tst3 <- c('Smith (AL),Joe', 'Sanchez de Gomez, Teresa Angelica',
     'Brown, John, Jr.', 'Brown, John W., III', 'Brown, John Q., I',
     'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio',
     'Vel_zquez, Ra_l')
tst4 <- parseName(tst3)

tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica',
  'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio',
  'Velazquez', 'Raul'),
  ncol=2, byrow=TRUE)
colnames(tst5) <- c("surname", 'givenName')

stopifnot(
all.equal(tst4, tst5)
)