clean: Clean column data to a class

Description

Use any of these functions to quickly clean columns in your data set. Use clean() to pick the functions that return the least relative number of NAs. They always return the class from the function name (e.g. clean_Date() always returns class Date).

Usage

clean(x)
# S3 method for data.frame
clean(x)
clean_logical(x, true = regex_true(), false = regex_false(),
  na = NULL, fixed = FALSE, ignore.case = TRUE)
clean_factor(x, levels = unique(x), ordered = FALSE,
  droplevels = FALSE, fixed = FALSE, ignore.case = TRUE)
clean_numeric(x, remove = "[^0-9.,]", fixed = FALSE)
clean_character(x, remove = "[^a-z \t\r\n]", fixed = FALSE,
  ignore.case = TRUE, trim = TRUE)
clean_currency(x, currency_symbol = NULL, ...)
clean_Date(x, format = NULL, ...)
clean_POSIXct(x, remove = "[^.0-9 :/-]", fixed = FALSE, ...)

Arguments

data to clean

true

regex to interpret values as TRUE (which defaults to regex_true), see Details

false

regex to interpret values as FALSE (which defaults to regex_false), see Details

regex to force interpret values as NA, i.e. not as TRUE or FALSE

fixed

logical to indicate whether regular expressions should be turned off

ignore.case

logical to indicate whether matching should be case-insensitive

levels

new factor levels, may be named with regular expressions to match existing values, see Details

ordered

logical to indicate whether the factor levels should be ordered

droplevels

logical to indicate whether non-existing factor levels should be dropped

remove

regex to define the character(s) that should be removed, see Details

trim

logical to indicate whether the result should be trimmed with trimws

currency_symbol

the currency symbol to use, which will be guessed based on the input and otherwise defaults to the current system locale setting (see Sys.localeconv)

...

other parameters passed on to as.Date or as.POSIXct

format

a date format that will be passed on to format_datetime, see Details

Value

The clean functions always return the class from the function name:

clean_logical(): class logical
clean_factor(): class factor
clean_numeric(): class numeric
clean_character(): class character
clean_currency(): class currency
clean_Date(): class Date
clean_POSIXct(): classes POSIXct/POSIXt

Details

Using clean() on a vector will guess a cleaning function based on the potential number of NAs it returns. Using clean() on a data.frame to apply this guessed cleaning over all columns.

Info about the different functions:

clean_logical():Use parameters true and false to match values using case-insensitive regular expressions (regex). Unmatched values are considered NA. At default, values are matched with regex_true and regex_false. This allows support for values "Yes" and "No" in the following languages: Arabic, Bengali, Chinese (Mandarin), Dutch, English, French, German, Hindi, Indonesian, Japanese, Malay, Portuguese, Russian, Spanish, Telugu, Turkish and Urdu. Use parameter na to override values as NA that would else be matched with true or false. See Examples.
clean_factor():Use parameter levels to set new factor levels. They can be case-insensitive regular expressions to match existing values of x. For matching, new values for levels are internally temporary sorted descending on text length. See Examples.
clean_numeric() and clean_character():Use parameter remove to match values that must be removed from the input, using regular expressions (regex). In case of clean_numeric(), comma's will be read as dots and only the last dot will be kept. Function clean_character() will keep middle spaces at default. See Examples.
clean_currency():This new class works like clean_numeric(), but transforms it with as.currency. The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee and the South African rand. See Examples.
clean_Date():Use parameter format to define a date format, or leave it empty to have the format guessed. Use "Excel" to read values as Microsoft Excel dates. The format parameter will be evaluated with format_datetime, which means that a format like "d-mmm-yy" with be translated internally to "%e-%b-%y" for convenience. See Examples.
clean_POSIXct():Use parameter remove to match values that must be removed from the input, using regular expressions (regex). The resulting string will be coerced to a date/time element with class POSIXct, using as.POSIXct(). See Examples.

The use of invalid regular expressions in any of the above functions will not return an error (like in base R), but will instead interpret the expression as a fixed value and will throw a warning.

Examples

Run this code

# NOT RUN {
clean_logical(c("Yes", "No"))   # English
clean_logical(c("Oui", "Non"))  # French
clean_logical(c("ya", "tidak")) # Indonesian
clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"),
              true = "pos", false = "neg")

gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+")
clean_factor(gender_age, c("M", "F"))
clean_factor(gender_age, c("Male", "Female"))
clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE)

clean_Date("13jul18", "ddmmmyy")
clean_Date("12 august 2010")
clean_Date("12 06 2012")
clean_Date(36526) # Excel date
clean_Date("43658")
clean_Date("14526", "Excel") # "1939-10-08"

clean_POSIXct("Created log on 2019/04/11 11:23 by user Joe")

clean_numeric("qwerty123456")
clean_numeric("Positive (0.143)")
clean_numeric("0,143")

clean_character("qwerty123456")
clean_character("Positive (0.143)")

clean_currency(c("Received $ 25", "Received $ 31.40"))
clean_currency(c("Jack sent <U+00A3> 25", "Bill sent <U+00A3> 31.40"))
 
clean("12 06 2012")
clean(data.frame(dates = "2013-04-02", 
                 logicals = c("yes", "no")))
# }

Run the code above in your browser using DataLab