Last chance! 50% off unlimited learning
Sale ends in
BY
is an S3 generic that efficiently applies functions over vectors or matrix- and data frame columns by groups. Similar to dapply
it seeks to retain the structure and attributes of the data, but can also output to various standard formats. A simple parallelism is also available.
BY(x, …)# S3 method for default
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "vector", "list"))
# S3 method for matrix
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
# S3 method for data.frame
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
# S3 method for grouped_df
BY(x, FUN, …, use.g.names = FALSE, keep.group_vars = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
a atomic vector, matrix, data frame or alike object.
a function, can be scalar- or vector-valued.
further arguments to FUN
.
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
logical. If FUN
is a vector-valued function returning a vector of fixed length > 1 (such as the quantile
function), expand.wide
can be used to return the result in a wider format (instead of stacking the resulting vectors of fixed length above each other in each output column).
integer. Argument to mclapply
indicating the number of cores to use for parallel execution. Can use detectCores()
to select all available cores.
an integer or string indicating the type of object to return. The default 1 - "same"
returns the same object type (i.e. class and other attributes are retained, just the names for the dimensions are adjusted). 2 - "matrix"
always returns the output as matrix, 3 - "data.frame"
always returns a data frame and 4 - "list"
returns the raw (uncombined) output. Note: 4 - "list"
works together with expand.wide
to return a list of matrices.
grouped_df method: Logical. FALSE
removes grouping variables after computation.
X
where FUN
was applied to every column split by g
.
BY
is a frugal re-implementation of the Split-Apply-Combine computing paradigm. It is generally faster than tapply
, by
, aggregate
and plyr, and preserves data attributes just like dapply
.
It is however principally a wrapper around lapply(split(x, g), FUN, …)
, that strongly optimizes on attribute checking compared to base R functions. For more details look at the documentation for dapply
which works very similar (apart from the splitting performed in BY
). For larger tasks requiring split-apply-combine computing on data frames use dplyr, data.table, or try to work with the Fast Statistical Functions.
BY
is used internally in collap
for functions that are not Fast Statistical Functions.
dapply
, collap
, Fast Statistical Functions, Data Transformations, Collapse Overview
# NOT RUN {
v <- iris$Sepal.Length # A numeric vector
f <- iris$Species # A factor. Vectors/lists will internally be converted to factor
## default vector method
BY(v, f, sum) # Sum by species
head(BY(v, f, scale)) # Scale by species (please use fscale instead)
head(BY(v, f, scale, use.g.names = FALSE)) # Omitting auto-generated names
BY(v, f, quantile) # Species quantiles: by default stacked
BY(v, f, quantile, expand.wide = TRUE) # Wide format
## matrix method
m <- qM(num_vars(iris))
BY(m, f, sum) # Also return as matrix
BY(m, f, sum, return = "data.frame") # Return as data.frame.. also works for computations below
head(BY(m, f, scale))
head(BY(m, f, scale, use.g.names = FALSE))
BY(m, f, quantile)
BY(m, f, quantile, expand.wide = TRUE)
BY(m, f, quantile, expand.wide = TRUE, # Return as list of matrices
return = "list")
## data.frame method
BY(num_vars(iris), f, sum) # Also returns a data.fram
BY(num_vars(iris), f, sum, return = 2) # Return as matrix.. also works for computations below
head(BY(num_vars(iris), f, scale))
head(BY(num_vars(iris), f, scale, use.g.names = FALSE))
BY(num_vars(iris), f, quantile)
BY(num_vars(iris), f, quantile, expand.wide = TRUE)
BY(num_vars(iris), f, quantile, # Return as list of matrices
expand.wide = TRUE, return = "list")
# }
# NOT RUN {
<!-- % No code relying on suggested package -->
## grouped data frame method (faster than dplyr only for small data)
library(dplyr)
giris <- group_by(iris, Species)
giris |> BY(sum) # Compute sum
giris |> BY(sum, use.g.names = TRUE, # Use row.names and
keep.group_vars = FALSE) # remove 'Species' and groups attribute
giris |> BY(sum, return = "matrix") # Return matrix
giris |> BY(sum, return = "matrix", # Matrix with row.names
use.g.names = TRUE)
giris |> BY(quantile) # Compute quantiles (output is stacked)
giris |> BY(quantile, # Much better, also keeps 'Species'
expand.wide = TRUE)
# }
Run the code above in your browser using DataLab