stri_enc_detect(str, filter_angle_brackets = FALSE)
raw
vectorsstr
.
Each list element is a list with the following three named vectors
representing all guesses:
Encoding
-- string; guessed encodings; NA
on failure,
Language
-- string; guessed languages; NA
if the language could
not be determined (e.g. in case of UTF-8),
Confidence
-- numeric in [0,1]; the higher the value,
the more confidence there is in the match; NA
on failure.
str
and filter_angle_brackets
.This is, at best, an imprecise operation using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language. However, Because the detection only looks at a limited amount of the input byte data, some of the returned charsets may fail to handle the all of input data. Note that in some cases, the language can be determined along with the encoding.
Several different techniques are used for character set detection. For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding. For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding.
The detection process can be configured to optionally ignore HTML or XML style markup (using ICU's internal facilities), which can interfere with the detection process by changing the statistics.
This function should most often be used for byte-marked input strings,
especially after loading them from text files and before the main
conversion with stri_encode
.
The input encoding is of course not taken into account here, even
if marked.
The following table shows all the encodings that can be detected:
Character_Set |
Languages |
UTF-8 |
-- |
UTF-16BE |
-- |
UTF-16LE |
-- |
UTF-32BE |
-- |
UTF-32LE |
-- |
Shift_JIS |
Japanese |
ISO-2022-JP |
Japanese |
ISO-2022-CN |
Simplified Chinese |
ISO-2022-KR |
Korean |
GB18030 |
Chinese |
Big5 |
Traditional Chinese |
EUC-JP |
Japanese |
EUC-KR |
Korean |
ISO-8859-1 |
Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
ISO-8859-2 |
Czech, Hungarian, Polish, Romanian |
ISO-8859-5 |
Russian |
ISO-8859-6 |
Arabic |
ISO-8859-7 |
Greek |
ISO-8859-8 |
Hebrew |
ISO-8859-9 |
Turkish |
windows-1250 |
Czech, Hungarian, Polish, Romanian |
windows-1251 |
Russian |
windows-1252 |
Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
windows-1253 |
Greek |
windows-1254 |
Turkish |
windows-1255 |
Hebrew |
windows-1256 |
Arabic |
KOI8-R |
Russian |
IBM420 |
Arabic |
IBM424 |
Hebrew |
If you have some initial guess on language and encoding, try with
stri_enc_detect2
.
stri_enc_detect2
,
stri_enc_isascii
,
stri_enc_isutf16be
,
stri_enc_isutf8
,
stringi-encoding
## Not run:
# f <- rawToChar(readBin("test.txt", "raw", 100000))
# stri_enc_detect(f)
# ## End(Not run)
Run the code above in your browser using DataCamp Workspace