In particular you should note that:
print
,cat
etc.
silently reencodes each string so that it can be properly
shown e.g. in theR's console.stri_enc_isutf8
).Most of the computations in
We have observed that Rcorrectly handles UTF-8 strings regardless of your
platform's native encoding (see below). Therefore, we decided that most
functions in
Note that some Unicode characters may have an ambiguous representation.
For example, ``a with ogonek'' (one character) and ``a''+``ogonek''
(two graphemes) are semantically the same. stri_trans_nfc
for discussion. However, it is observed that denormalized strings
do appear very rarely in typical string processing activities.
Additionally, do note that stri_enc_toutf8
.
Basically, Rhas a very simple encoding marking mechanism,
see stri_enc_mark
. There is an implicit assumption
that your platform's default (native) encoding is always a superset
of ASCII -- stri_enc_set
.
Character strings in R(internally) can be declared to be in:
UTF-8
;latin1
, i.e. ISO-8859-1 (Western European);bytes
-- for strings that
should be manipulated as sequences of bytes.native
(a.k.a.unknown
inEncoding
;
quite a misleading name: no explicit encoding mark) -- for
strings that are assumed to be in your platform's native (default) encoding.
This can represent UTF-8 if you are an OS X user,
or some 8-bit Windows code page, for example.
The native encoding used byRmay be determined by examining
the LC_CTYPE category, seeSys.getlocale
.Intuitively, ``native'' strings result from inputting a string e.g. via a keyboard. This makes sense: your operating system works in some encoding and provides Rwith some data.
Each time when a stri_enc_get
(unless you know what you are doing, the default encoding should only be
changed if the automatic encoding detection process fails on
Functions which allow "bytes"
encoding markings are very rare in
stri_enc_toutf8
(with argument is_unknown_8bit=TRUE
),
stri_enc_toascii
, and stri_encode
.
Finally, note that R lets strings in ASCII, UTF-8, and your platform's
native encoding coexist peacefully. Character vector printed with
print
, cat
etc. silently reencodes each
string so that it can be properly shown e.g. on the console.
stri_enc_list
for the list of
encodings supported by The stri_encode
function
allows you to convert between any given encodings
(in some cases you will obtain bytes
-marked
strings, or even lists of raw vectors (i.e. for UTF-16).
There are also some useful more specialized functions,
like stri_enc_toutf32
(converts a character vector to a list
of integers, where one code point is exactly one numeric value)
or stri_enc_toascii
(substitutes all non-ASCII
bytes with the SUBSTITUTE CHARACTER,
which plays a similar role as R's NA
value).
There are also some routines for automated encoding detection,
see e.g. stri_enc_detect
.
Encoding detection is always an imprecise operation and needs a considerable amount of data. However, in case of some encodings (like UTF-8, ASCII, or UTF-32) a ``false positive'' byte sequence is quite rare (statistically speaking).
Check out stri_enc_detect
(among others) for a useful
function in this category.
"Unicode provides a single character set that covers the major languages of the world, and a small number of machine-friendly encoding forms and schemes to fit the needs of existing applications and protocols. It is designed for best interoperability with both ASCII and ISO-8859-1 (the most widely used character sets) to make it easier for Unicode to be used in almost all applications and protocols" (see the ICU User Guide).
The Unicode Standard determines the way to map any possible character to a numeric value -- a so-called code point. Such code points, however, have to be stored somehow in computer's memory. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit integer (cf. the ICU FAQ).
In most cases, Unicode is a superset of the characters supported by any given code page.
Conversion -- ICU User Guide,
Converters -- ICU User Guide,
UTF-8, UTF-16, UTF-32 & BOM -- ICU FAQ,
stri_conv
,
stri_encode
;
stri_enc_fromutf32
;
stri_enc_toascii
;
stri_enc_tonative
;
stri_enc_toutf32
;
stri_enc_toutf8
Other encoding_detection: stri_enc_detect2
;
stri_enc_detect
;
stri_enc_isascii
;
stri_enc_isutf16be
,
stri_enc_isutf16le
,
stri_enc_isutf32be
,
stri_enc_isutf32le
;
stri_enc_isutf8
Other encoding_management: stri_enc_get
,
stri_enc_set
; stri_enc_info
;
stri_enc_list
; stri_enc_mark
Other stringi_general_topics: stringi-arguments
;
stringi-locale
;
stringi-search-boundaries
;
stringi-search-charclass
;
stringi-search-coll
;
stringi-search-fixed
;
stringi-search-regex
;
stringi-search
; stringi
,
stringi-package