subNonStandardNames: sub for nonstandard names

Description

sub(nonStandardNames[, 1], nonStandardNames[, 2], x)

Accented characters common in non-English languages often get mangled in different ways by different software. For example, the "e" in "Andre" may carry an accent that gets replaced by other characters by different software.

This function first converts "Andr*" to "Andr_" for any character "*" not in standardCharacters. It then looks for "Andr_" in nonStandardNames. By default, it will find that and replace it with "Andre".

Usage

subNonStandardNames(x,
  standardCharacters=c(letters, LETTERS, ' ', 
    '.', '?', '!', ',', 0:9,   '/', '*', '$', 
    '%', '\"', "\'", '-', '+', '&', '_', ';', 
    '(', ')', '[', ']', '\n'),
  replacement='_',
  gsubList=list(list(pattern=
        '\\\\\\\\|\\\\',
      replacement='\"')),
  removeSecondLine=TRUE,
  nonStandardNames=Ecdat::nonEnglishNames, 
  namesNotFound="attr.replacement", ...)

Arguments

character vector or matrix or a data.frame of character vectors in which it is desired replace nonStandardNames[, 1] in subNonStandardCharacters(x, ...) with the corresponding element of nonStandardNames[, 2].

standardCharacters, replacement, gsubList, …

arguments passed to subNonStandardCharacters

removeSecondLine

logical: If TRUE, delete anything following "\n" and return it as an attribute secondLine.

nonStandardNames

data.frame or character matrix with two columns: Replace any substring of x matching nonStandardNames[, 1] with the corresponding element of nonStandardNames[, 2]

namesNotFound

character vector describing how to treat substitutions not found in nonStandardNames[, 1]:

attr.replacement: Return an attribute namesNotFound with grep(replacement, subNonStandardCharacters(...)), if any.
attr.notFound: Return an attribute namesNotFound with x != subNonStandardCharacters(...), if any.
"print": Print the elements of x notFound per either attr.replacement or attr.notFound, as requested.
"": Do not report any notFound elements of x.

NOTE: x = "_" will be identified by attr.replacement but not by attr.notfound assuming the default value for replacement.

Value

a character vector with all nonStandardCharacters replaced first by replacement and then by the second column of nonStandardNames for any that match the first column. If a secondLine is found on any elements, it is returned as a secondLine attribute.

If any names with nonStandardCharacters are not found in nonStandardNames[, 1], they are identified in an optional attribute per the namesNotFound argument.

Details

1. removeSecondLines

2. x. <- subNonStandardCharacters(x, standardCharacters, replacement, ...)

3. Loop over all rows of nonStandardNames substituting anything matching nonStandardNames[i, 1] with nonStandardNames[i, 2].

4. Eliminate leading and trailing blanks.

5. if(is.matrix(x)) return a matrix; if(is.data.frame(x)) return a data.frame(..., stringsAsFactors=FALSE)

NOTE: On 13 May 2013 Jeff Newmiller at the University of California, Davis, wrote, 'I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.' (This was a reply on r-help@r-project.org, subject: "Re: [R] Matching names with non-English characters".) Doubtless someone has software to do a better job of this than what this function does, but I've so far been unable to find it in R. If you know of a better solution to this problem, I'd be pleased to hear from you. Spencer Graves

Examples

Run this code

# NOT RUN {
##
## 1.  Example 
##
tstSNSN <- c('Raul', 'Ra`l', 'Torres,Raul', 
    'Torres, Ra`l', "Robert C. \\Bobby\\\\", 
    'Ed  \n --Vacancy', '', '  ')
# }
# NOT RUN {
<!-- % '\\' is converted to '\' before testing this in R CMD check             -->
# }
# NOT RUN {
#  confusion in character sets can create
#  names like Names[2]
##
## 2.  subNonStandardNames(vector)
##
# }
# NOT RUN {
<!-- %library(Ecdat) -->
# }
# NOT RUN {
SNS2 <- subNonStandardNames(tstSNSN)
SNS2

# check 
SNS2. <- c('Raul', 'Raul', 'Torres,Raul', 'Torres, Raul',
            'Robert C. "Bobby"', 'Ed', '', '')
attr(SNS2., 'secondLine') <- c(rep(NA, 5), ' --Vacancy',
        NA, NA)

# }
# NOT RUN {
all.equal(SNS2, SNS2.)
# }
# NOT RUN {
##
## 3.  subNonStandardNames(matrix)
##
tstmat <- parseName(tstSNSN, surnameFirst=TRUE)
submat <- subNonStandardNames(tstmat)

# check 
SNSmat <- parseName(SNS2., surnameFirst=TRUE)
# }
# NOT RUN {
all.equal(submat, SNSmat)
# }
# NOT RUN {
##
## 4.  subNonStandardNames(data.frame)
##
tstdf <- as.data.frame(tstmat)
subdf <- subNonStandardNames(tstdf)

# check 
SNSdf <- as.data.frame(SNSmat, stringsAsFactors=FALSE)
# }
# NOT RUN {
all.equal(subdf, SNSdf)
# }
# NOT RUN {
##
## 5.  namesNotFound 
##
noSub <- subNonStandardNames('xx_x')

# check 
noSub. <- 'xx_x'
attr(noSub., 'namesNotFound') <- 'xx_x'
# }
# NOT RUN {
all.equal(noSub, noSub.)
# }

Run the code above in your browser using DataLab