The UTF-32 (unsigned integer)
by default prior to any further computation. This means that results are
encoding-independent and that strings are interpreted as a sequence of
symbols, not as a sequence of pure bytes. In functions where this is
relevant, this may be switched by setting the useBytes option to
TRUE. However, keep in mind that results will then likely depend on the
system R is running on, except when your strings are pure ASCII.
Also, for multi-byte encodings, results for byte-wise computations
will usually differ from results using encoded computations.
Prior to useBytes=TRUE could
give a significant performance enhancement. Since version 0.9, translation
to integer is done by C code internal to
utf-8, the same (accented) character may be represented as several byte sequences. For example, an u-umlaut
can be represented with a single byte code or as a byte code representing 'u' followed by a modifier byte code
that adds the umlaut. The iconv(x,to="ASCII//TRANSLIT"),
where x is your character vector. See the documentation of iconv for details.
The stringi package (Gagolewski and Tartanus) should work on any system. The command
stringi::stri_trans_general(x,"Latin-ASCII") transliterates character vector x to ASCII.
Encodingiconv has a good overview of base R's
encoding conversion options. The capabilities of iconv depend on the system R is running on.
The stringdist,stringdistmatrix,amatch,ain,qgramsprintable_ascii