stri_enc_isutf8: Check If a Data Stream Is Possibly in UTF-8
Description
The function checks whether given sequences of bytes forms
a proper UTF-8 string.
Usage
stri_enc_isutf8(str)
Arguments
str
character vector, a raw vector, or a list of
raw vectors
Value
Returns a logical vector. Its i-th element indicates
whether the i-th string corresponds to a valid UTF-8 byte
sequence.
Details
Negative answer means that a string is surely not valid
UTF-8. Positive result does not mean that we should be
absolutely sure. E.g. (c4,85) properly represents
("Polish a with ogonek") in UTF-8 as well as ("A umlaut",
"Ellipsis") in WINDOWS-1250. Also note that UTF-8, as well
as most 8-bit encodings, have ASCII as their subsets (note
that stri_enc_isascii =>
stri_enc_isutf8).
However, the longer the sequence, the bigger the
possibility that the result is indeed in UTF-8 -- this is
because not all sequences of bytes are valid UTF-8.
This function is independent of the way Rmarks encodings
in character strings (see Encoding and
stringi-encoding).