utf8Conversion

0th

Percentile

Convert Integer Vectors to or from UTF-8-encoded Character Vectors

Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.

Keywords
utilities, character
Usage
utf8ToInt(x)
intToUtf8(x, multiple = FALSE)
Arguments
x

object to be converted.

multiple

logical: should the conversion be to a single character string or multiple individual characters?

Details

These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.

Unicode defines a name and a number of all of the glyphs it encompasses: the numbers are called code points: since RFC3629 they run from 0 to 0x10FFFF (with about 12% being assigned by version 10.0 of the Unicode standard).

intToUtf8 does not handle surrogate pairs (which should not occur in UTF-32): inputs in the surrogate ranges are mapped to NA.

Value

utf8ToInt converts a length-one character string encoded in UTF-8 to an integer vector of Unicode code points. It checks validity of the input. (Currently it accepts UTF-8 encodings of code points greater than 0x10FFFF: these are no longer regarded as valid by the UTF-8 RFC and will in future be mapped to NA. Following ‘Corrigendum 9’ the UTF-8 encodings of the ‘noncharacters’ 0xFFFE and 0xFFFF are regarded as valid as from R 3.4.3.)

intToUtf8 converts a numeric vector of Unicode code points either (default) to a single character string or a character vector of single characters. Non-integral numeric values are truncated to integers: values above the maximum are mapped to NA. For a single character string 0 is silently omitted: otherwise 0 is mapped to "". The Encoding of a non-NA return value is declared as "UTF-8".

Invalid and NA inputs are mapped to NA output.

References

https://tools.ietf.org/html/rfc3629, the current standard for UTF-8.

http://www.unicode.org/versions/corrigendum9.html for non-characters.

Aliases
  • utf8ToInt
  • intToUtf8
  • Unicode
  • code point
Examples
library(base) # NOT RUN { ## will only display in some locales and fonts intToUtf8(0x03B2L) # Greek beta # } # NOT RUN { utf8ToInt("bi\u00dfchen") utf8ToInt("\xfa\xb4\xbf\xbf\x9f") # }
Documentation reproduced from package base, version 3.4.3, License: Part of R 3.4.3

Community examples

Looks like there are no examples yet.