enc v0.2.0

0

Monthly downloads

0th

Percentile

Portable Tools for 'UTF-8' Character Data

Implements an S3 class for storing 'UTF-8' strings, based on regular character vectors. Also contains routines to portably read and write 'UTF-8' encoded text files, to convert all strings in an object to 'UTF-8', and to create character vectors with various encodings.

Readme

enc

Travis-CI Build Status AppVeyor Build Status codecov CRAN\_Status\_Badge

Portable tools for UTF-8 character data

R and character encoding

The character encoding of determines the translation of the letters, digits, or other codepoints (atomic components of a text) into a sequence of bytes. A byte sequence may translate into valid text in one character encoding, but give nonsense in other character encodings.

For historic reasons, R can store strings in different ways:

  1. in the "native" encoding, the default encoding of the operating system
  2. in UTF-8, the most prevalent and versatile encoding nowadays
  3. in "latin1", a popular encoding in Western Europe
  4. as "bytes", leaving the interpretation to the user

On OS X and Linux, the "native" encoding is often UTF-8, but on Windows it is not. To add to the confusion, the encoding is a property of individual strings in a character vector, and not of the entire vector.

Why UTF-8?

When working with text, it is advisable to use UTF-8, because it allows encoding virtually any text, even in foreign languages that contain symbols that cannot be represented in your system's native encoding. The UTF-8 encoding possesses several nice technical properties, and is by far the predominant encoding on the Web. Standardization on a "universal" encoding faciliates data exchange.

Because of R's special handling of strings, some care must be taken to make sure that you're actually using the UTF-8 encoding. Many functions in R will hide encoding issues from you, and transparently convert to UTF-8 as necessary. However, some functions (such as reading and writing files) will stubbornly prefer the native encoding.

The enc pacakge provides helpers for converting all textual components of an object to UTF-8, and for reading and writing files in UTF-8 (with a LF end-of-line terminator by default). It also defines an S3 class for tagging all-UTF-8 character vectors and ensuring that updates maintain the UTF-8 encoding. Examples for other packages that use UTF-8 by default are:

Example

library(enc)
utf8(c("a", "ä"))
#> [1] "a" "ä"
as_utf8(1)
#> [1] "1"

a <- utf8("ä")
a[2] <- "ö"
class(a)
#> [1] "utf8"

data.frame(abc = letters[1:3], utf8 = utf8(letters[1:3]))
#>   abc utf8
#> 1   a    a
#> 2   b    b
#> 3   c    c

Install the package from GitHub:

# install.packages("devtools")
devtools::install_github("krlmlr/enc")

Functions in enc

Name Description
write_lines_enc Writes to a text file
to_encoding Deep conversion to an encoding
utf8 A simple class for storing UTF-8 strings
native_eol The native end-of-line identifier on the current platform
encoding Encoding information
transform_lines_enc Transform a text file
read_lines_enc Reads from a text file
enc-package enc: Portable Tools for 'UTF-8' Character Data
No Results!

Last month downloads

Details

Date 2018-02-24
License GPL-3
Encoding UTF-8
LazyData true
BugReports https://github.com/krlmlr/enc/issues
URL https://github.com/krlmlr/enc
RoxygenNote 6.0.1.9000
NeedsCompilation yes
Packaged 2018-03-03 00:03:50 UTC; muelleki
Repository CRAN
Date/Publication 2018-03-03 14:03:13 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/enc)](http://www.rdocumentation.org/packages/enc)