Learn R Programming

messy

When teaching examples using R, instructors often using nice datasets - but these aren't very realistic, and aren't what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.

Installation

Install from CRAN using:

install.packages("messy")

Install development version from GitHub using:

remotes::install_github("nrennie/messy")

Usage

For more in-depth usage instructions, see the package documentation at nrennie.rbind.io/messy which has examples of each function.

The simplest way to use the {messy} package is applying the messy() function:

set.seed(1234)
messy(ToothGrowth[1:10,])
    len supp dose
1   4.2   VC  0.5
2  11.5 <NA> <NA>
3  7.3    VC  0.5
4   5.8  (VC  0.5
5   6.4   VC <NA>
6    10   VC  0.5
7  11.2 <NA>  0.5
8  11.2   VC  0.5
9  5.2    VC  0.5
10    7   VC 0.5 

You can vary the amount of messiness for each function, and chain together different functions to create customised messy data:

set.seed(1234)
ToothGrowth[1:10,] |> 
  make_missing(cols = "supp", missing = " ") |> 
  make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |> 
  add_whitespace(cols = "supp", messiness = 0.5) |> 
  add_special_chars(cols = "supp")
    len supp dose
1   4.2   VC  0.5
2  11.5  VC    NA
3   7.3   VC  0.5
4   5.8 *VC   0.5
5   6.4  VC   0.5
6  10.0   VC  0.5
7  11.2       0.5
8  11.2  V#C   NA
9   5.2  !VC  0.5
10  7.0 VC*   0.5

Copy Link

Version

Install

install.packages('messy')

Monthly Downloads

167

Version

0.1.0

License

CC BY 4.0

Issues

Pull Requests

Stars

Forks

Maintainer

Nicola Rennie

Last Published

December 3rd, 2024

Functions in messy (0.1.0)

messy_datetime_tzones

Change the timezone of datetime columns
split_datetimes

Splits date(time) column(s) into multiple columns
duplicate_rows

Duplicate rows and insert them into the dataframe in order or at random
messy

Messy
messy_colnames

Make column names messy
change_case

Change case
add_special_chars

Add special characters to strings
add_whitespace

Add whitespaces
messy_datetime_formats

Make date(time) formats inconsistent
make_missing

Make missing