janitor
janitor has simple little functions for examining and cleaning dirty data. An intermediate R user can already do all of this, but with janitor you can save your brainpower for the fun stuff.
janitor is not yet on CRAN. Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("sfirke/janitor")
Using janitor
Below are quick examples of how janitor functions are commonly used. A full description of each function can be found in janitor's catalog of functions.
Janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>%
pipe and is optimized for cleaning data brought in with the readr and readxl packages.
Cleaning dirty data
Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Dirtiness includes:
- Dreadful column names
- Rows and columns containing Excel formatting but no data
- Dates stored as numbers
- Values spread inconsistently over the "Certification" columns
Here's that data after being read in to R:
library(pacman) # for loading packages
p_load(readxl, janitor, dplyr)
roster_raw <- read_excel("dirty_data.xlsx") # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
#> Observations: 17
#> Variables: 12
#> $ First Name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA,...
#> $ Last Name <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamar...
#> $ Employee Status <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teach...
#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "...
#> $ Hire Date <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700...
#> $ % Allocated <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80, NA, NA,...
#> $ Full time? <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "N...
#> $ do not edit! ---> <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-...
#> $ Certification <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA,...
#> $ Certification <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
Clean it with janitor functions:
roster <- roster_raw %>%
clean_names() %>%
remove_empty_rows() %>%
remove_empty_cols() %>%
convert_to_NA(c("TBD", "PENDING")) %>%
mutate(hire_date = excel_numeric_to_date(hire_date),
main_cert = use_first_valid_of(certification, certification_2)) %>%
select(-certification, -certification_2) # drop unwanted columns
roster
#> # A tibble: 12 × 8
#> first_name last_name employee_status subject hire_date percent_allocated full_time main_cert
#> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr>
#> 1 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed
#> 2 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed
#> 3 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music
#> 4 Ada Lovelace Teacher <NA> 1975-05-01 1.00 Yes Computers
#> 5 Desus Nice Administration Dean 2013-06-06 1.00 Yes <NA>
#> 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.50 Yes Science 6-12
#> 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.50 Yes Science 6-12
#> 8 James Joyce Teacher English 1990-05-01 0.50 No English 6-12
#> 9 Hedy Lamarr Teacher Science 1976-06-08 0.50 No <NA>
#> 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physical ed
#> 11 Young Boozer Coach <NA> 1995-01-01 NA No Political sci.
#> 12 Micheal Larsen Teacher English 2009-09-15 0.80 No Vocal music
The core janitor cleaning function is clean_names()
- call it whenever you load data into R.
Examining dirty data
Finding duplicates
Use get_dupes()
to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:
roster %>% get_dupes(first_name, last_name)
#> # A tibble: 4 × 9
#> first_name last_name dupe_count employee_status subject hire_date percent_allocated full_time
#> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr>
#> 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.50 Yes
#> 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.50 Yes
#> 3 Jason Bourne 2 Teacher PE 2008-08-30 0.75 Yes
#> 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.25 Yes
#> # ... with 1 more variables: main_cert <chr>
Yes, some teachers appear twice. We ought to address this before counting employees.
Tabulating tools
janitor has several functions for quick counts. The big ones are tabyl()
and crosstab()
.
Notably, they can be called two ways:
- On vectors - e.g.,
tabyl(roster$subject)
- On a piped-in data.frame:
roster %>% tabyl(subject)
.- This allows for dplyr commands earlier in the pipeline
tabyl()
Like table()
, but pipe-able and more functional.
roster %>%
tabyl(subject)
#> subject n percent valid_percent
#> 1 Basketball 1 0.08333333 0.1
#> 2 Chemistry 1 0.08333333 0.1
#> 3 Dean 1 0.08333333 0.1
#> 4 Drafting 1 0.08333333 0.1
#> 5 English 2 0.16666667 0.2
#> 6 Music 1 0.08333333 0.1
#> 7 PE 1 0.08333333 0.1
#> 8 Physics 1 0.08333333 0.1
#> 9 Science 1 0.08333333 0.1
#> 10 <NA> 2 0.16666667 NA
crosstab()
roster %>%
filter(hire_date > as.Date("1950-01-01")) %>%
crosstab(employee_status, full_time)
#> employee_status No Yes
#> 1 Administration 0 1
#> 2 Coach 2 0
#> 3 Teacher 3 4
Prettifying
Other janitor functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:
roster %>%
tabyl(employee_status, sort = TRUE) %>%
add_totals_row()
#> employee_status n percent
#> 1 Teacher 9 0.75000000
#> 2 Coach 2 0.16666667
#> 3 Administration 1 0.08333333
#> 4 Total 12 1.00000000
roster %>%
crosstab(full_time, employee_status) %>%
adorn_crosstab(denom = "col", show_totals = TRUE)
#> full_time Administration Coach Teacher Total
#> 1 No 0.0% (0) 100.0% (2) 33.3% (3) 41.7% (5)
#> 2 Yes 100.0% (1) 0.0% (0) 66.7% (6) 58.3% (7)
Together, these tabulation functions reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.
Contact me
You are welcome to:
- submit suggestions and bug-reports: https://github.com/sfirke/janitor/issues
- send a pull request: https://github.com/sfirke/janitor/
- let me know what you think on twitter @samfirke
- compose a friendly e-mail to: