epitrix (version 0.4.0)

clean_labels: Standardise labels

Description

This function standardises labels e.g. used as variable names or character string values, removing non-ascii characters, replacing diacritics (e.g. é, ô) with their closest ascii equivalents, and standardises separating characters. See details for more information on label transformation.

Usage

clean_labels(
  x,
  sep = "_",
  transformation = "Any-Latin; Latin-ASCII",
  protect = ""
)

Arguments

x

A vector of labels, normally provided as characters.

sep

A character string used as separator, defaulting to '_'.

transformation

a string to be passed on to stringi::stri_trans_general() for conversion. Default is "Any-Latin; Latin-ASCII", which will convert any non-latin characters to latin and then converts all accented characters to ASCII characters. See stringi::stri_trans_list() for a full list of options.

protect

a character string defining the punctuation that should be protected. This helps prevent meaninful symbols like > and < from being removed.

Author

Thibaut Jombart thibautjombart@gmail.com, Zhian N. Kamvar

Details

The following changes are performed:

  • all non-ascii characters are removed

  • all diacritics are replaced with their non-accentuated equivalents, e.g. 'é', 'ê' and 'è' become 'e'.

  • all characters are set to lower case

  • separators are standardised to the use of a single character provided in sep (defaults to '_'); heading and trailing separators are removed.

Examples

Run this code
if (FALSE) {
clean_labels("-_-This is; A    WeÏrD**./sêntënce...")
clean_labels("-_-This is; A    WeÏrD**./sêntënce...", sep = ".")
input <- c("Peter and stëven",
           "peter-and.stëven",
           "pëtêr and stëven  _-")
input
clean_labels(input)

# Don't transliterate non-latin words
clean_labels(input, transformation = "Latin-ASCII")

# protect useful symbols
clean_labels(c("energy > 9000", "energy < 9000"), protect = "><")

# if you only want to clean accents, transform to lower, and transliterate,
# you can specify "[:punct:][:space:]" for protect:
clean_labels(input, protect = "[:punct:][:space:]")

# appropriately transliterate Germanic umlaute
if (stringi::stri_info()$ICU.system) {
  # This will only be true if you have the correct version of ICU installed

  clean_labels("'é', 'ê' and 'è' become 'e', 'ö' becomes 'oe', etc.", 
               transformation = "Any-Latin; de-ASCII; Latin-ASCII")
}
}

Run the code above in your browser using DataLab