standardCharacters
and replace all between them with
replacement
. For example, a string like "Ruben" where
"e" carries an accent and is mangled by some software would
become something like "Rub_n" using the default values for
standardCharacters
and replacement
.subNonStandardCharacters(x,
standardCharacters=c(letters, LETTERS, ' ','.', '?', '!',
',', 0:9, '/', '*', '$', '%', '"', "'", '-', '+', '&',
'_', ';', '(', ')', '[', ']', ''),
replacement='_',
gsubList=list(list(pattern='\\\\|\\',
replacement='"')), ... )
- x
{
character vector in which it is desired to find the first and
last character not in standardCharacters
and replace that
substring by replacement
.
}
- standardCharacters
{
a character vector of acceptable characters to keep.
}
- replacement
{
a character to replace the subtring starting and ending with
characters not in standardCharacters
.
}
- gsubList
{
list of lists of pattern
and replacement
arguments
to be called in succession before looking for
nonStandardCharacters
}
- ...
{
optional arguments passed to strsplit
}
1. for(il in 1:length(gsubList))x <- gsub(
gsubList[[il]][["pattern"]], gsublist[[il]][['replacement']], x)
2. x <- stringi::stri_trans_general(x, "Latin-ASCII") 3. nx <- length(x)
4. x. <- strsplit(x, "", ...)
5. for(ix in 1:nx) find the first and last standardCharacters
in x.[ix] and substitute replacement
for everything in between.
NOTES:
** To find the elements of x that have changed, use either
subNonStandardCharacters(x) != x
or
grep(replacement, subNonStandardCharacters(x))
, where
replacement
is the replacement
argument = "_" by
default.
** On 13 May 2013 Jeff Newmiller at the University of California,
Davis, wrote, 'I think it is a fools errand to think that you can
automatically "normalize" arbitrary Unicode characters to an ASCII
form that everyone will agree on.' (This was a reply on
r-help@r-project.org, subject: "Re: [R] Matching names with non-
English characters".)
** On 2014-12-15 Ista Zahn suggested
stri_trans_general
. (This was a reply on
r-help@r-project.org, subject: "[R] Comparing Latin characters
with and without accents?".)
a character vector with everthing between the first and last
character not in standardCharacters
replaced by
replacement
.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
sub
, strsplit
,
grepNonStandardCharacters
,
subNonStandardNames
encoded_text_to_latex
subNonStandardNames
iconv
in the base
package does some
conversion, but is not consistent across platforms, at least
using R 3.1.2 on 2015-01.25.
stri_trans_general
seems better.##
## 1. Consider Names = Ruben, Avila and Jose, where "e" and "A" in
## these examples carry an accent. With the default values
## for standardCharacters and replacement, these might be
## converted to something like Rub_n, _vila, and Jos_, with
## different software possibly mangling the names differently.
## (The standard checks for R packages in an English locale
## complains about non-ASCII characters, because they are
## not portable.)
##
nonstdNames <- c('Ra`l', 'Ra`', '`l', 'Torres, Raul',
"Robert C. \Bobby\\", NA, '', ' ',
'$12', '12%')# confusion in character sets can create
# names like Names[2]
Name2 <- subNonStandardCharacters(nonstdNames)
str(Name2)# check
Name2. <- c('Ra_l', 'Ra_', '_l', nonstdNames[4],
'Robert C. "Bobby"', NA, '', ' ',
'$12', '12%')
str(Name2.)
stopifnot(
all.equal(Name2, Name2.)
)
##
## 2. Example from iconv
##
icx <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
icx2 <- subNonStandardCharacters(icx)
# check
icx. <- c('Ekstrom', 'Joreskog', 'bisschen Zurcher')
stopifnot(
all.equal(icx2, icx.)
)
manip