utf8 v1.1.2

0

Monthly downloads

0th

Percentile

Unicode Text Processing

Processing and printing 'UTF-8' encoded international text (Unicode). Functions to input, validate, normalize, encode, format, and display.

Readme

utf8

[![Build Status (Linux)][travis-badge]][travis] [![Build Status (Windows)][appveyor-badge]][appveyor] [![Coverage Status][codecov-badge]][codecov] [![CRAN Status][cran-badge]][cran] [![License][apache-badge]][apache] [![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]

utf8 is an R package for manipulating and printing UTF-8 text that fixes [multiple][windows-enc2utf8] [bugs][emoji-print] in R's UTF-8 handling.

Installation

Stable version

utf8 is [available on CRAN][cran]. To install the latest released version, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

tmp <- tempfile()
system2("git", c("clone", "--recursive", shQuote("https://github.com/patperry/r-utf8.git"), shQuote(tmp)))
devtools::install(tmp)

Note that utf8 uses a git submodule, so you cannot use devtools::install_github.

Usage

Validate character data and convert to UTF-8

Use as_utf8 to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "façile" "façile" "façile"

Normalize data

Use utf8_normalize to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

```r

three ways to encode an angstrom character

(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))

#> [1] "Å" "Å" "Å" utf8_normalize(angstrom) == "\u00c5"

#> [1] TRUE TRUE TRUE

perform full Unicode case-folding

utf8_normalize("Größe", map_case = TRUE)

#> [1] "grösse"

apply compatibility maps to NFKC normal form

(example from https://twitter.com/aprilarcus/status/367557195186970624)

utf8_normalize("

Functions in utf8

Name Description
as_utf8 UTF-8 Character Encoding
output_utf8 Output Capabilities
utf8_format UTF-8 Text Formatting
utf8_print Print UTF-8 Text
utf8_encode Encode Character Object as for UTF-8 Printing
utf8-package The utf8 Package
utf8_normalize Text Normalization
utf8_width Measure the Character String Width
No Results!

Vignettes of utf8

Name
utf8.Rmd
No Results!

Last month downloads

Details

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/utf8)](http://www.rdocumentation.org/packages/utf8)