subNonStandardCharacters: sub nonstandard characters with replacement

Description

First convert to ASCII, stripping standard accents and special characters. Then find the first and last character not in standardCharacters and replace all between them with replacement. For example, a string like "Ruben" where "e" carries an accent and is mangled by some software would become something like "Rub_n" using the default values for standardCharacters and replacement.

Usage

subNonStandardCharacters(x,
   standardCharacters=c(letters, LETTERS, ' ','.', '?', '!', 
      ',', 0:9, '/', '*', '$', '%', '"', "'", '-', '+', '&', 
      '_', ';', '(', ')', '[', ']', ''),
   replacement='_',
   gsubList=list(list(pattern='\\\\|\\',
      replacement='"')), ... )
x{
    character vector in which it is desired to find the first and
    last character not in standardCharacters and replace that
    substring by replacement.
  }
  standardCharacters{
    a character vector of acceptable characters to keep.
  }
  replacement{
    a character to replace the subtring starting and ending with
    characters not in standardCharacters.
  }
  gsubList{
    list of lists of pattern and replacement arguments
    to be called in succession before looking for 
    nonStandardCharacters
  }
  ...{
    optional arguments passed to strsplit
  }
1.  for(il in 1:length(gsubList))x <- gsub(
  gsubList[[il]][["pattern"]], gsublist[[il]][['replacement']], x)
  
  2.  x <- stringi::stri_trans_general(x, "Latin-ASCII")
  3.  nx <- length(x)
  4.  x. <- strsplit(x, "", ...)
  5.  for(ix in 1:nx) find the first and last standardCharacters
  in x.[ix] and substitute replacement for everything in between.
  
    
  NOTES:  
  
  ** To find the elements of x that have changed, use either 
  subNonStandardCharacters(x) != x or 
  grep(replacement, subNonStandardCharacters(x)), where 
  replacement is the replacement argument = "_" by 
  default.  
  
  ** On 13 May 2013 Jeff Newmiller at the University of California, 
  Davis, wrote, 'I think it is a fools errand to think that you can 
  automatically "normalize" arbitrary Unicode characters to an ASCII 
  form that everyone will agree on.'  (This was a reply on 
  r-help@r-project.org, subject:  "Re: [R] Matching names with non-
  English characters".)  
  
  ** On 2014-12-15 Ista Zahn suggested 
  stri_trans_general.  (This was a reply on 
  r-help@r-project.org, subject:  "[R] Comparing Latin characters
  with and without accents?".)
a character vector with everthing between the first and last
  character not in standardCharacters replaced by 
  replacement.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
sub, strsplit,
  grepNonStandardCharacters,
  subNonStandardNames
  encoded_text_to_latex
  subNonStandardNames
  iconv in the base package does some 
  conversion, but is not consistent across platforms, at least 
  using R 3.1.2 on 2015-01.25.  
  stri_trans_general seems better.##
## 1. Consider Names = Ruben, Avila and Jose, where "e" and "A" in
##    these examples carry an accent.  With the default values
##    for standardCharacters and replacement, these might be 
##    converted to something like Rub_n, _vila, and Jos_, with 
##    different software possibly mangling the names differently.  
##    (The standard checks for R packages in an English locale 
##    complains about non-ASCII characters, because they are 
##    not portable.)
##
nonstdNames <- c('Ra`l', 'Ra`', '`l', 'Torres, Raul',
           "Robert C. \Bobby\\", NA, '', '  ', 
           '$12', '12%')#  confusion in character sets can create
#  names like Names[2]
Name2 <- subNonStandardCharacters(nonstdNames)
str(Name2)
# check 
Name2. <- c('Ra_l', 'Ra_', '_l', nonstdNames[4],
            'Robert C. "Bobby"', NA, '', '  ', 
            '$12', '12%')
str(Name2.)
stopifnot(
all.equal(Name2, Name2.)
)
##
## 2.  Example from iconv
##
icx <- c("Ekstr\xf8m", "J\xf6reskog", 
         "bi\xdfchen Z\xfcrcher")
icx2 <- subNonStandardCharacters(icx)
# check 
icx. <- c('Ekstrom', 'Joreskog', 'bisschen Zurcher')
stopifnot(
all.equal(icx2, icx.)
)
manip

Data engineering and BI courses are free!

subNonStandardCharacters: sub nonstandard characters with replacement

Description

Usage

Arguments