stri_enc_isutf8
). Most of the computations in
We have observed that Rcorrectly handles UTF-8 strings
regardless of your platform's Native encoding (see
below). Therefore, we decided that most functions in
Note that some Unicode characters may have an ambiguous
representation. For example, ``a with ogonek'' (one
character) and ``a''+``ogonek'' (two graphemes) are
semantically the same. stri_enc_nfc
for discussion. However,
denormalized strings do appear very rarely in typical
string processing activities.
Basically, Rhas a very simple encoding-marking
mechanism, see Encoding. There is an implicit
assumption that your platform's default (native) encoding
is always an 8-bit one and it is a superset of ASCII --
stri_enc_set
.
Character strings in R(internally) can be declared to be in:
"UTF-8"
;"latin1"
, i.e. ISO-8859-1
(Western European)."bytes"
-- strings should
be manipulated as bytes; encoding is not set;"unknown"
(quite misleading name: no explicit
encoding mark) -- strings are assumed to be in your
platform's native (default) encoding. Native strings often appear as result of inputing a
string from keyboard or file. This makes sense: you
operating system works in some encoding and provides Rwith some data. Each time when a stri_enc_get
(default encoding
should only be changed if autodetect fails on
Functions which allow "bytes"
encoding markings
are very rare in stri_enc_toutf8
(with
argument is_unknown_8bit=TRUE
),
stri_enc_toascii
, and
stri_encode
.
stri_enc_list
for the list of encodings
supported by The stri_encode
function allows you to
convert between any given encodings (in some cases you
will obtain "bytes"
-marked strings, or even lists
of raw vectors (i.e. for UTF-16). There are also some
useful more specialized functions, like
stri_enc_toutf32
(converts a character
vector to a list of integers, where one code point is
exactly one numeric value) or
stri_enc_toascii
(substitutes all non-ASCII
bytes with the SUBSTITUTE CHARACTER, which plays a
similar role as R's NA
value).
There are also some routines for automated encoding
detection, see e.g. stri_enc_detect
(for
stri_enc_detect2
for our own,
locale-sensitive solution.
Encoding detection is always an imprecise operation and needs a considerable amount of data. However, in case of some encodings (like UTF-8, ASCII, or UTF-32) a ``false positive'' byte sequence is quite rare (statistically).
Check out stri_enc_detect
and
stri_enc_detect2
(among others) for useful
functions from this category.
"Unicode provides a single character set that covers the major languages of the world, and a small number of machine-friendly encoding forms and schemes to fit the needs of existing applications and protocols. It is designed for best interoperability with both ASCII and ISO-8859-1 (the most widely used character sets) to make it easier for Unicode to be used in almost all applications and protocols" (see the ICU User Guide).
The Unicode Standard determines the way to map any possible character to a numeric value -- a so-called code point. Such code points, however, have to be stored somehow in computer's memory. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit integer (cf. the ICU FAQ).
In most cases, Unicode is a superset of the characters supported by any given codepage.
Conversion -- ICU User Guide,
Converters -- ICU User Guide,
UTF-8, UTF-16, UTF-32 & BOM -- ICU FAQ,
stri_conv
,
stri_encode
;
stri_enc_fromutf32
;
stri_enc_toascii
;
stri_enc_toutf32
;
stri_enc_toutf8
Other encoding_detection: stri_enc_detect2
;
stri_enc_detect
;
stri_enc_isascii
;
stri_enc_isutf16be
,
stri_enc_isutf16le
,
stri_enc_isutf32be
,
stri_enc_isutf32le
;
stri_enc_isutf8
Other encoding_management: stri_enc_get
,
stri_enc_set
; stri_enc_info
;
stri_enc_list
Other encoding_normalization: stri_enc_isnfc
,
stri_enc_isnfd
,
stri_enc_isnfkc
,
stri_enc_isnfkc_casefold
,
stri_enc_isnfkd
, stri_enc_nfc
,
stri_enc_nfd
, stri_enc_nfkc
,
stri_enc_nfkc_casefold
,
stri_enc_nfkd
Other stringi_general_topics:
stringi-arguments
;
stringi-locale
;
stringi-package
;
stringi-search-charclass
;
stringi-search-fixed
;
stringi-search-regex
;
stringi-search