tidytable
Why tidytable?
tidyverse-like syntax withdata.tablespeedrlangcompatibility - See here- Includes functions that
dtplyris missing, including manytidyrfunctions
Note: tidytable functions do not use data.table’s
modify-by-reference, and instead use the copy-on-modify principles
followed by the tidyverse and base R.
Installation
Install the released version from CRAN with:
install.packages("tidytable")Or install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("markfairbanks/tidytable")tidytable functions
tidytable helpers
dt(): Pipeabledata.tablesyntax. See hereget_dummies.()%notin%
dplyr
Core verbs
arrange.()filter.()mutate.()&mutate_across.()- The
_across.()helper is new and can replace the_if.()/_at.()/_all.()helpers See here
- The
select.()summarize.()&summarize_across.()- Group by specifications called inside. See here
Other dplyr functions
bind_cols.()&bind_rows.()case.(): Similar todplyr::case_when(). See?case.for syntaxcount.()distinct.()ifelse.()- Joins:
left_join.(),inner_join.(),right_join.(),full_join.(), &anti_join.()
lags.()&leads.()pull.()relocate.()rename.()&rename_with.()row_number.()slice.():_head.()/_tail.()/_max.()/_min.()- The
slice_*()helpers are liketop_n.(), but are a bit easier to use
- The
top_n.()transmute.()
tidyr
drop_na.()fill.(): Works on character/factor/logical types (data.table::nafill()does not)group_split.()- Nesting:
nest_by.()&unnest.() pivot_longer.()&pivot_wider.()replace_na.()separate.()unite.()
purrr
map.(),map2.(),map_*.()variants, &map2_*.()variants
General syntax
tidytable uses verb.() syntax to replicate tidyverse functions:
library(tidytable)
test_df <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a","a","b"))
test_df %>%
select.(x, y, z) %>%
filter.(x < 4, y > 1) %>%
arrange.(x, y) %>%
mutate.(double_x = x * 2,
double_y = y * 2)
#> x y z double_x double_y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1: 1 4 a 2 8
#> 2: 2 5 a 4 10
#> 3: 3 6 b 6 12Using “group by”
Group by calls are done from inside any function that has group by
functionality (such as summarize.() & mutate.())
- A single column can be passed with
by = z - Multiple columns can be passed with
by = c(y, z) tidyselectcan also be used, including using predicates:- Single predicate:
by = where(is.character) - Multiple predicates:
by = c(where(is.character), where(is.factor)) - A combination of predicates and column names:
by = c(where(is.character), y)
- Single predicate:
test_df %>%
summarize.(avg_x = mean(x),
count = n.(),
by = z)
#> z avg_x count
#> <chr> <dbl> <int>
#> 1: a 1.5 2
#> 2: b 3.0 1tidyselect support
tidytable allows you to select/drop columns just like you would in the
tidyverse.
Normal selection can be mixed with:
- Predicates:
where(is.numeric),where(is.character), etc. - Select helpers:
everything(),starts_with(),ends_with(),contains(),any_of(), etc.
test_df <- data.table(a = c(1,2,3),
b = c(4,5,6),
c = c("a","a","b"),
d = c("a","b","c"))
test_df %>%
select.(where(is.numeric), d)
#> a b d
#> <dbl> <dbl> <chr>
#> 1: 1 4 a
#> 2: 2 5 b
#> 3: 3 6 cYou can also use this format to drop columns:
test_df %>%
select.(-where(is.numeric))
#> c d
#> <chr> <chr>
#> 1: a a
#> 2: a b
#> 3: b cThese same ideas can be used whenever selecting columns in tidytable
functions - for example when using count.(), drop_na.(),
pivot_longer.(), pivot_wider.(), etc.
New helper: mutate_across.()
tidyselect allows the user to replace mutate_if.(), mutate_at.(),
and mutate_all.() with one helper - mutate_across.().
Using _across.() instead of _if.():
test_df <- data.table(a = c(1,1,1),
b = c(1,1,1),
c = c("a","a","b"),
d = c("a","b","c"))
test_df %>%
mutate_across.(where(is.numeric), as.character)
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1: 1 1 a a
#> 2: 1 1 a b
#> 3: 1 1 b cUsing _across.() instead of _at.():
test_df %>%
mutate_across.(c(a, b), ~ .x + 1)
#> a b c d
#> <dbl> <dbl> <chr> <chr>
#> 1: 2 2 a a
#> 2: 2 2 a b
#> 3: 2 2 b cUsing _across.() instead of _all.():
test_df %>%
mutate_across.(everything(), as.factor)
#> a b c d
#> <fct> <fct> <fct> <fct>
#> 1: 1 1 a a
#> 2: 1 1 a b
#> 3: 1 1 b crlang compatibility
rlang can be used to write custom functions with tidytable
functions.
Custom function with mutate.()
df <- data.table(x = c(1,1,1), y = c(1,1,1), z = c("a","a","b"))
# Using enquo() with !!
add_one <- function(data, add_col) {
add_col <- enquo(add_col)
data %>%
mutate.(new_col = !!add_col + 1)
}
# Using the {{ }} shortcut
add_one <- function(data, add_col) {
data %>%
mutate.(new_col = {{add_col}} + 1)
}
df %>%
add_one(x)
#> x y z new_col
#> <dbl> <dbl> <chr> <dbl>
#> 1: 1 1 a 2
#> 2: 1 1 a 2
#> 3: 1 1 b 2Custom function with summarize.()
df <- data.table(x = 1:10, y = c(rep("a", 6), rep("b", 4)), z = c(rep("a", 6), rep("b", 4)))
find_mean <- function(data, grouping_cols, col) {
data %>%
summarize.(avg = mean({{col}}),
by = {{grouping_cols}})
}
df %>%
find_mean(grouping_cols = c(y, z), col = x)
#> y z avg
#> <chr> <chr> <dbl>
#> 1: a a 3.5
#> 2: b b 8.5Auto-conversion
All tidytable functions automatically convert data.frame and
tibble inputs to a data.table:
library(dplyr)
library(data.table)
test_df <- tibble(x = c(1,2,3), y = c(4,5,6), z = c("a","a","b"))
test_df %>%
mutate.(double_x = x * 2) %>%
is.data.table()
#> [1] TRUEdt() helper
The dt() function makes regular data.table syntax pipeable, so you
can easily mix tidytable syntax with data.table syntax:
df <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a", "a", "b"))
df %>%
dt(, list(x, y, z)) %>%
dt(x < 4 & y > 1) %>%
dt(order(x, y)) %>%
dt(, ':='(double_x = x * 2,
double_y = y * 2)) %>%
dt(, list(avg_x = mean(x)), by = z)
#> z avg_x
#> <chr> <dbl>
#> 1: a 1.5
#> 2: b 3.0Speed Comparisons
Below are some speed comparisons of various functions. More functions will get added to the speed comps over time.
A few notes:
- Comparing times from separate functions won’t be very useful. For
example - the
summarize()tests were performed on a different dataset fromcase_when(). setDTthreads(4)was used fordata.table&tidytabletimings.- Modify-by-reference was used in
data.tablewhen being compared tomutate.()&dplyr::mutate() fill.()&tidyr::fill()both work with character/factor/logical columns, whereasdata.table::nafill()does not. Testing only included numeric columns due to this constraint.- Currently
data.tabledoesn’t have its owncase_when()translation, so a multiple nestedfifelse()was used. - All tests can be found in the source code of the README.
pandascomparisons are in the process of being added - more will be added soon.- Lastly I’d like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
all_marks
#> # A tibble: 13 x 6
#> function_tested data.table tidytable tidyverse pandas tidytable_vs_tidyverse
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 arrange 58.81ms 58.07ms 422.61ms 355ms 13.7%
#> 2 case_when 69.45ms 69.37ms 448.81ms 59.2ms 15.5%
#> 3 distinct 45.59ms 39.74ms 99.58ms 309ms 39.9%
#> 4 fill 49.86ms 50.49ms 126.32ms 846ms 40.0%
#> 5 filter 244.35ms 232.87ms 308.63ms 707ms 75.5%
#> 6 inner_join 99.18ms 97.69ms 82.47ms <NA> 118.5%
#> 7 left_join 64.18ms 70.53ms 83.69ms <NA> 84.3%
#> 8 mutate 65.67ms 81ms 64.7ms 86.4ms 125.2%
#> 9 nest 16.45ms 17.14ms 33.35ms <NA> 51.4%
#> 10 pivot_longer 13.06ms 14.43ms 50.31ms <NA> 28.7%
#> 11 pivot_wider 113.57ms 137.96ms 72.29ms <NA> 190.8%
#> 12 summarize 287.99ms 265.42ms 599.98ms 834ms 44.2%
#> 13 unnest 26.98ms 21.97ms 938.07ms <NA> 2.3%