Learn R Programming

quanteda (version 0.9.2-0)

encoding: detect the encoding of texts

Description

Detect the encoding of texts in a character, corpus, or corpusSource-class object and report on the most likely encoding. Useful in detecting the encoding of input texts, so that a source encoding can be specified when (re)constructing a corpus using corpus.

Usage

encoding(x, verbose = TRUE, ...)

## S3 method for class 'character': encoding(x, verbose = TRUE, ...)

## S3 method for class 'corpus': encoding(x, verbose = TRUE, ...)

## S3 method for class 'corpusSource': encoding(x, verbose = TRUE, ...)

Arguments

x
character vector, corpus, or corpusSource object whose texts' encodings will be detected.
verbose
if FALSE, do not print diagnostic report
...
additional arguments passed to stri_enc_detect

Details

Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, http://userguide.icu-project.org/conversion/detection.

Examples

Run this code
encoding(encodedTexts)
# show detected value for each text, versus known encoding
data.frame(labelled = names(encodedTexts), detected = encoding(encodedTexts)$all)

encoding(ukimmigTexts)
encoding(inaugCorpus)
encoding(ie2010Corpus)

# Russian text, Windows-1251
mytextfile <- textfile("http://www.kenbenoit.net/files/01_er_5.txt", cache = FALSE)
encoding(mytextfile)

Run the code above in your browser using DataLab