[DRAFT API] Detect Locale-Sensitive Character Encoding
This function tries to detect character encoding in case the language of text is known.
[THIS IS AN EXPERIMENTAL FUNCTION]
stri_enc_detect2(str, locale = NULL)
character vector, a raw vector, or a list of
""for default locale,
NAfor just checking the UTF-* family, or a single string with locale identifier.
First, the text is checked whether it is valid
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE, UTF-8
this slightly bases on ICU's
but we do it in our own way, however) or ASCII.
locale is not
NA and the above fails,
the text is checked for the number of occurrences
of language-specific code points (data provided by the ICU library)
converted to all possible 8-bit encodings
that fully cover the indicated language.
The encoding is selected based on the greatest number of total
The guess is of course imprecise [This is DRAFT API - still does not work as expected], as it is obtained using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's in a single language.
If you have no initial guess on language and encoding, try with
stri_enc_detect (uses ICU facilities).
However, it turns out that (empirically)
works better than the ICU-based one if UTF-* text
is provided. Try it yourself.
this function returns a list of length equal to the length of
Each list element is a list with the following three named components:
Encoding-- string; guessed encodings;
NAon failure (iff
Confidence-- numeric in [0,1]; the higher the value, the more confidence there is in the match;
The guesses are ordered by decreasing confidence.