This is a general purpose complement to the specialised manipulation
functions filter
, select
, mutate
,
summarise
and arrange
. You can use do
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
do
and then flexibly extract components with either another
do
or summarise
.
do(.data, ...)do_(.data, ..., .dots)
# S3 method for tbl_sql
do_(.data, ..., .dots, .chunk_size = 10000L)
a tbl
Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use .
to refer to the current group. You can not mix named and
unnamed arguments.
Used to work around non-standard evaluation. See
vignette("nse")
for details.
The size of each chunk to pull into R. If this number is too big, the process will be slow because R has to allocate and free a lot of memory. If it's too small, it will be slow, because of the overhead of talking to the database.
do
always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from ...
. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.
Groups are preserved for a single unnamed input. This is different to
summarise
because do
generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
rowwise
. This allows other verbs to work in an intuitive
way.
If you're familiar with plyr, do
with named arguments is basically
equivalent to dlply
, and do
with a single unnamed argument
is basically equivalent to ldply
. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that summarise
applied to the result of do
can
act like ldply
.
For an empty data frame, the expressions will be evaluated once, even in the presence of a grouping. This makes sure that the format of the resulting data frame is the same for both empty and non-empty input.
by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
models
summarise(models, rsq = summary(mod)$r.squared)
models %>% do(data.frame(coef = coef(.$mod)))
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
models <- by_cyl %>% do(
mod_linear = lm(mpg ~ disp, data = .),
mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad))
# compare %>% summarise(p.value = aov$`Pr(>F)`)
if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)
mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods %>% do(as.data.frame(coef(.$mod)))
mods %>% summarise(rsq = summary(mod)$r.squared)
# This longer example shows the progress bar in action
by_dest <- flights %>% group_by(dest) %>% filter(n() > 100)
library(mgcv)
by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
}
Run the code above in your browser using DataLab