dplyr::group_by()This works the exact same as dplyr::group_by() and typically
performs around the same speed but uses slightly less memory.
f_group_by(
data,
...,
.add = FALSE,
order = df_group_by_order_default(data),
.by = NULL,
.cols = NULL,
.drop = df_group_by_drop_default(data)
)group_ordered(data)
f_group_by() returns a grouped_df that can be used
for further for grouped calculations.
group_ordered() returns TRUE if the group data are sorted,
i.e if attr(attr(data, "groups"), "ordered") == TRUE. If sorted,
which is usually the default, this leads to summary calculations
like f_summarise() or dplyr::summarise() producing sorted groups.
If FALSE they are returned based on order-of-first appearance in the data.
data frame.
Variables to group by.
Should groups be added to existing groups?
Default is FALSE.
Should groups be ordered? If FALSE
groups will be ordered based on first-appearance.
Typically, setting order to FALSE is faster.
(Optional). A selection of columns to group by for this operation.
Columns are specified using tidyselect.
(Optional) alternative to ... that accepts
a named character vector or numeric vector.
If speed is an expensive resource, it is recommended to use this.
Should unused factor levels be dropped? Default is TRUE.
f_group_by() works almost exactly like the 'dplyr' equivalent.
An attribute "ordered" (TRUE or FALSE) is added to the group data to
signify if the groups are sorted or not.
The distinction between ordered and sorted is somewhat subtle.
Functions in fastplyr that use a sort argument generally refer
to the top-level dataset being sorted in some way, either by sorting
the group columns like in f_expand() or f_distinct(), or
some other columns, like the count column in f_count().
The order argument, when set to TRUE (the default),
is used to mean that the group data will be calculated
using a sort-based algorithm, leading to sorted group data.
When order is FALSE, the group data will be returned based on
the order-of-first appearance of the groups in the data.
This order-of-first appearance may still naturally be sorted
depending on the data.
For example, group_id(1:3, order = T) results in the same group IDs
as group_id(1:3, order = F) because 1, 2, and 3 appear in the data in
ascending sequence whereas group_id(3:1, order = T) does not equal
group_id(3:1, order = F)
Part of the reason for the distinction is that internally fastplyr
can in theory calculate group data
using the sort-based algorithm and still return unsorted groups,
though this combination is only available to the user in limited places like
f_distinct(order = TRUE, sort = FALSE).
The other reason is to prevent confusion in the meaning
of sort and order so that order always refers to the
algorithm specified, resulting in sorted groups, and sort implies a
physical sorting of the returned data. It's also worth mentioning that
in most functions, sort will implicitly utilise the sort-based algorithm
specified via order = TRUE.
In many situations (not all) it can be faster to use the
order-of-first appearance algorithm, specified via order = FALSE.
This can generally be accessed by first calling
f_group_by(data, ..., order = FALSE) and then
performing your calculations.
To utilise this algorithm more globally and package-wide,
set the '.fastplyr.order.groups' option to FALSE using the code:
options(.fastplyr.order.groups = FALSE).