tidytable
Why tidytable
?
tidyverse
-like syntax withdata.table
speedrlang
compatibility - See here- Includes functions that
dtplyr
is missing, including manytidyr
functions
Note: tidytable
functions do not use data.table
’s
modify-by-reference, and instead use the copy-on-modify principles
followed by the tidyverse
and base R.
Installation
Install the released version from CRAN with:
install.packages("tidytable")
Or install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("markfairbanks/tidytable")
tidytable functions
tidytable helpers
dt()
: Pipeabledata.table
syntax. See hereget_dummies.()
%notin%
dplyr
Core verbs
arrange.()
filter.()
mutate.()
&mutate_across.()
- The
_across.()
helper is new and can replace the_if.()
/_at.()
/_all.()
helpers See here
- The
select.()
summarize.()
&summarize_across.()
- Group by specifications called inside. See here
Other dplyr functions
bind_cols.()
&bind_rows.()
case.()
: Similar todplyr::case_when()
. See?case.
for syntaxcount.()
distinct.()
ifelse.()
- Joins:
left_join.()
,inner_join.()
,right_join.()
,full_join.()
, &anti_join.()
lags.()
&leads.()
pull.()
relocate.()
rename.()
&rename_with.()
row_number.()
slice.()
:_head.()
/_tail.()
/_max.()
/_min.()
- The
slice_*()
helpers are liketop_n.()
, but are a bit easier to use
- The
top_n.()
transmute.()
tidyr
drop_na.()
fill.()
: Works on character/factor/logical types (data.table::nafill()
does not)group_split.()
- Nesting:
nest_by.()
&unnest.()
pivot_longer.()
&pivot_wider.()
replace_na.()
separate.()
unite.()
purrr
map.()
,map2.()
,map_*.()
variants, &map2_*.()
variants
General syntax
tidytable
uses verb.()
syntax to replicate tidyverse
functions:
library(tidytable)
test_df <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a","a","b"))
test_df %>%
select.(x, y, z) %>%
filter.(x < 4, y > 1) %>%
arrange.(x, y) %>%
mutate.(double_x = x * 2,
double_y = y * 2)
#> x y z double_x double_y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1: 1 4 a 2 8
#> 2: 2 5 a 4 10
#> 3: 3 6 b 6 12
Using “group by”
Group by calls are done from inside any function that has group by
functionality (such as summarize.()
& mutate.()
)
- A single column can be passed with
by = z
- Multiple columns can be passed with
by = c(y, z)
tidyselect
can also be used, including using predicates:- Single predicate:
by = where(is.character)
- Multiple predicates:
by = c(where(is.character), where(is.factor))
- A combination of predicates and column names:
by = c(where(is.character), y)
- Single predicate:
test_df %>%
summarize.(avg_x = mean(x),
count = n.(),
by = z)
#> z avg_x count
#> <chr> <dbl> <int>
#> 1: a 1.5 2
#> 2: b 3.0 1
tidyselect
support
tidytable
allows you to select/drop columns just like you would in the
tidyverse.
Normal selection can be mixed with:
- Predicates:
where(is.numeric)
,where(is.character)
, etc. - Select helpers:
everything()
,starts_with()
,ends_with()
,contains()
,any_of()
, etc.
test_df <- data.table(a = c(1,2,3),
b = c(4,5,6),
c = c("a","a","b"),
d = c("a","b","c"))
test_df %>%
select.(where(is.numeric), d)
#> a b d
#> <dbl> <dbl> <chr>
#> 1: 1 4 a
#> 2: 2 5 b
#> 3: 3 6 c
You can also use this format to drop columns:
test_df %>%
select.(-where(is.numeric))
#> c d
#> <chr> <chr>
#> 1: a a
#> 2: a b
#> 3: b c
These same ideas can be used whenever selecting columns in tidytable
functions - for example when using count.()
, drop_na.()
,
pivot_longer.()
, pivot_wider.()
, etc.
New helper: mutate_across.()
tidyselect
allows the user to replace mutate_if.()
, mutate_at.()
,
and mutate_all.()
with one helper - mutate_across.()
.
Using _across.()
instead of _if.()
:
test_df <- data.table(a = c(1,1,1),
b = c(1,1,1),
c = c("a","a","b"),
d = c("a","b","c"))
test_df %>%
mutate_across.(where(is.numeric), as.character)
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1: 1 1 a a
#> 2: 1 1 a b
#> 3: 1 1 b c
Using _across.()
instead of _at.()
:
test_df %>%
mutate_across.(c(a, b), ~ .x + 1)
#> a b c d
#> <dbl> <dbl> <chr> <chr>
#> 1: 2 2 a a
#> 2: 2 2 a b
#> 3: 2 2 b c
Using _across.()
instead of _all.()
:
test_df %>%
mutate_across.(everything(), as.factor)
#> a b c d
#> <fct> <fct> <fct> <fct>
#> 1: 1 1 a a
#> 2: 1 1 a b
#> 3: 1 1 b c
rlang
compatibility
rlang
can be used to write custom functions with tidytable
functions.
Custom function with mutate.()
df <- data.table(x = c(1,1,1), y = c(1,1,1), z = c("a","a","b"))
# Using enquo() with !!
add_one <- function(data, add_col) {
add_col <- enquo(add_col)
data %>%
mutate.(new_col = !!add_col + 1)
}
# Using the {{ }} shortcut
add_one <- function(data, add_col) {
data %>%
mutate.(new_col = {{add_col}} + 1)
}
df %>%
add_one(x)
#> x y z new_col
#> <dbl> <dbl> <chr> <dbl>
#> 1: 1 1 a 2
#> 2: 1 1 a 2
#> 3: 1 1 b 2
Custom function with summarize.()
df <- data.table(x = 1:10, y = c(rep("a", 6), rep("b", 4)), z = c(rep("a", 6), rep("b", 4)))
find_mean <- function(data, grouping_cols, col) {
data %>%
summarize.(avg = mean({{col}}),
by = {{grouping_cols}})
}
df %>%
find_mean(grouping_cols = c(y, z), col = x)
#> y z avg
#> <chr> <chr> <dbl>
#> 1: a a 3.5
#> 2: b b 8.5
Auto-conversion
All tidytable
functions automatically convert data.frame
and
tibble
inputs to a data.table
:
library(dplyr)
library(data.table)
test_df <- tibble(x = c(1,2,3), y = c(4,5,6), z = c("a","a","b"))
test_df %>%
mutate.(double_x = x * 2) %>%
is.data.table()
#> [1] TRUE
dt()
helper
The dt()
function makes regular data.table
syntax pipeable, so you
can easily mix tidytable
syntax with data.table
syntax:
df <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a", "a", "b"))
df %>%
dt(, list(x, y, z)) %>%
dt(x < 4 & y > 1) %>%
dt(order(x, y)) %>%
dt(, ':='(double_x = x * 2,
double_y = y * 2)) %>%
dt(, list(avg_x = mean(x)), by = z)
#> z avg_x
#> <chr> <dbl>
#> 1: a 1.5
#> 2: b 3.0
Speed Comparisons
Below are some speed comparisons of various functions. More functions will get added to the speed comps over time.
A few notes:
- Comparing times from separate functions won’t be very useful. For
example - the
summarize()
tests were performed on a different dataset fromcase_when()
. setDTthreads(4)
was used fordata.table
&tidytable
timings.- Modify-by-reference was used in
data.table
when being compared tomutate.()
&dplyr::mutate()
fill.()
&tidyr::fill()
both work with character/factor/logical columns, whereasdata.table::nafill()
does not. Testing only included numeric columns due to this constraint.- Currently
data.table
doesn’t have its owncase_when()
translation, so a multiple nestedfifelse()
was used. - All tests can be found in the source code of the README.
pandas
comparisons are in the process of being added - more will be added soon.- Lastly I’d like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
all_marks
#> # A tibble: 13 x 6
#> function_tested data.table tidytable tidyverse pandas tidytable_vs_tidyverse
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 arrange 58.81ms 58.07ms 422.61ms 355ms 13.7%
#> 2 case_when 69.45ms 69.37ms 448.81ms 59.2ms 15.5%
#> 3 distinct 45.59ms 39.74ms 99.58ms 309ms 39.9%
#> 4 fill 49.86ms 50.49ms 126.32ms 846ms 40.0%
#> 5 filter 244.35ms 232.87ms 308.63ms 707ms 75.5%
#> 6 inner_join 99.18ms 97.69ms 82.47ms <NA> 118.5%
#> 7 left_join 64.18ms 70.53ms 83.69ms <NA> 84.3%
#> 8 mutate 65.67ms 81ms 64.7ms 86.4ms 125.2%
#> 9 nest 16.45ms 17.14ms 33.35ms <NA> 51.4%
#> 10 pivot_longer 13.06ms 14.43ms 50.31ms <NA> 28.7%
#> 11 pivot_wider 113.57ms 137.96ms 72.29ms <NA> 190.8%
#> 12 summarize 287.99ms 265.42ms 599.98ms 834ms 44.2%
#> 13 unnest 26.98ms 21.97ms 938.07ms <NA> 2.3%