check.encoding: Check character encoding in corpus folder

Description

Using non-ASCII characters is never trivial, but sometimes unavoidable. Specifically, most of the world's languages use non-Latin alphabets or diacritics added to the standard Latin script. The default character encoding in stylo is UTF-8, deviating from it can cause problems. This function allows users to check the character encoding in a corpus. A summary is returned to the termial and a detailed list reporting the most probable encodings of all the text files in the folder can be written to a csv file. The function is basically a wrapper around the function guess_encoding() from the 'readr' package by Wickham et al. (2017). To change the encoding to UTF-8, try the change.encoding() function.

Usage

check.encoding(corpus.dir = "corpus/", output.file = NULL)

Arguments

corpus.dir

path to the folder containing the corpus.

output.file

path to a csv file that reports the most probable encoding for each text file in the corpus.

Value

The function returns a summary message and writes detailed results into a csv file.

Details

If no additional argument is passed, then the function tries to check the text files in the default subdirectory corpus.

References

Wickham , H., Hester, J., Francois, R., Jylanki, J., and J<U+00F8>rgensen, M. (2017). Package: 'readr'. <https://cran.r-project.org/web/packages/readr/readr.pdf>.

Examples

Run this code

# NOT RUN {
# standard usage from stylo working directory with a 'corpus' subfolder:
check.encoding()

# specifying another folder:
check.encoding("~/corpora/example1/")

# specifying an output file:
check.encoding(output.file = "~/experiments/charencoding/example1.csv")

# }

Run the code above in your browser using DataLab