Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
utf8ToInt(x)
intToUtf8(x, multiple = FALSE)
object to be converted.
logical: should the conversion be to a single character string or multiple individual characters?
utf8ToInt
converts a length-one character string encoded in
UTF-8 to an integer vector of Unicode code points. It checks validity
of the input. (Currently it accepts UTF-8 encodings of code points
greater than 0x10FFFF
: these are no longer regarded as valid by
the UTF-8 RFC and will in future be mapped to NA
. Following
‘Corrigendum 9’ the UTF-8 encodings of the
‘noncharacters’ 0xFFFE
and 0xFFFF
are regarded as
valid as from R 3.4.3.)
intToUtf8
converts a numeric vector of Unicode code points
either (default) to a single character string or a character vector of
single characters. Non-integral numeric values are truncated to
integers: values above the maximum are mapped to NA
. For a
single character string 0
is silently omitted: otherwise
0
is mapped to ""
. The Encoding
of a
non-NA
return value is declared as "UTF-8"
.
Invalid and NA
inputs are mapped to NA
output.
These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.
Unicode defines a name and a number of all of the glyphs it
encompasses: the numbers are called code points: since RFC3629
they run from 0
to 0x10FFFF
(with about 12% being
assigned by version 10.0 of the Unicode standard).
intToUtf8
does not handle surrogate pairs (which should not
occur in UTF-32): inputs in the surrogate ranges are mapped to
NA
.
https://tools.ietf.org/html/rfc3629, the current standard for UTF-8.
http://www.unicode.org/versions/corrigendum9.html for non-characters.
# NOT RUN {
## will only display in some locales and fonts
intToUtf8(0x03B2L) # Greek beta
# }
# NOT RUN {
utf8ToInt("bi\u00dfchen")
utf8ToInt("\xfa\xb4\xbf\xbf\x9f")
# }
Run the code above in your browser using DataLab