utf8
utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R’s UTF-8 handling.
Installation
Stable version
utf8 is available on CRAN. To install the latest released version, run the following command in R:
install.packages("utf8")
Development version
To install the latest development version, run the following:
devtools::install_github("patperry/r-utf8")
Usage
library(utf8)
Validate character data and convert to UTF-8
Use as_utf8()
to validate input text and convert to UTF-8 encoding.
The function alerts you if the input text has the wrong declared
encoding:
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0xdeadbeef) at position 4
# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "façile" "façile" "façile"
Normalize data
Use utf8_normalize()
to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
#> [1] "Å" "Å" "Å"
utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE
# perform full Unicode case-folding
utf8_normalize("Größe", map_case = TRUE)
#> [1] "grösse"
# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("