fast-statistical-functions: Fast (Grouped, Weighted) Statistical Functions for Matrix-Like Objects

Description

With fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs and fndistinct, collapse presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (dplyr) and data.table's.

Value

x suitably aggregated or transformed. Data frame column-attributes and overall attributes are generally preserved if the output is of the same data type.

Usage

## All functions (FUN) follow a common syntax in 4 methods:
FUN(x, ...)
## Default S3 method:
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'matrix'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'data.frame'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'grouped_df'
FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = FALSE, keep.group_vars = TRUE,
    [keep.w = TRUE,] [nthreads = 1L,] ...)

Arguments

`x`		a vector, matrix, data frame or grouped data frame (class 'grouped_df').
`g`		a factor, `GRP` object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a `GRP` object) used to group `x`.
`w`		a numeric vector of (non-negative) weights, may contain missing values. Supported by `fsum`, `fprod`, `fmean`, `fmedian`, `fnth`, `fvar`, `fsd` and `fmode`.
`TRA`		an integer or quoted operator indicating the transformation to perform: 0 - "replace_NA" \| 1 - "replace_fill" \| 2 - "replace" \| 3 - "-" \| 4 - "-+" \| 5 - "/" \| 6 - "%" \| 7 - "+" \| 8 - "*" \| 9 - "%%" \| 10 - "-%%". See `TRA`.
`na.rm`		logical. Skip missing values in `x`. Defaults to `TRUE` in all functions and implemented at very little computational cost. Not available for `fnobs`.
`use.g.names`		logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
`nthreads`		integer. The number of threads to utilize. Supported by `fsum`, `fmean`, `fmedian`, `fmode` and `fndistinct`.
`drop`		matrix and data.frame methods: Logical. `TRUE` drops dimensions and returns an atomic vector if `g = NULL` and `TRA = NULL`.
`keep.group_vars`		grouped_df method: Logical. `FALSE` removes grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df.
`keep.w`		grouped_df method: Logical. `TRUE` (default) also aggregates weights and saves them in a column, `FALSE` removes weighting variable after computation (if contained in `grouped_df`).
`…`		arguments to be passed to or from other methods. If `TRA` is used, passing `set = TRUE` will transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output).

Related Functionality

Panel-decomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in qsu.
Function frange efficiently computes the minimum and maximum on atomic vectors.
The vector-valued functions and operators fcumsum, fscale/STD, fbetween/B, fhdbetween/HDB, fwithin/W, fhdwithin/HDW, flag/L/F, fdiff/D/Dlog and fgrowth/G are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (plm).

Examples

## default vector method
mpg <- mtcars$mpg
fsum(mpg)                         # Simple sum
fsum(mpg, TRA = "/")              # Simple transformation: divide all values by the sum
fsum(mpg, mtcars$cyl)             # Grouped sum
fmean(mpg, mtcars$cyl)            # Grouped mean
fmean(mpg, w = mtcars$hp)         # Weighted mean, weighted by hp
fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp
fsum(mpg, mtcars$cyl, TRA = "/")  # Proportions / division by group sums
fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin
      TRA = "-")
## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%")                  # This computes percentages
fsum(mtcars, mtcars[c(2,8:9)])           # Grouped column sum
g <- GRP(mtcars, ~ cyl + vs + am)        # Here precomputing the groups!
fsum(mtcars, g)                          # Faster !!
fmean(mtcars, g, mtcars$hp)
fmean(mtcars, g, mtcars$hp, "-")         # Demeaning by weighted group means..
fmean(fgroup_by(mtcars, cyl, vs, am), hp, "-")  # Another way of doing it..
fmode(wlddev, drop = FALSE)              # Compute statistical modes of variables in this data
fmode(wlddev, wlddev$income)             # Grouped statistical modes ..
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, g) # ..
\donttest{
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% select(mpg,carb) %>% fsum()
mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg,carb) %>% fsum() # equivalent and faster !!
mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%")
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean(hp)         # weighted grouped mean, save sum of weights
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean(hp, keep.group_vars = FALSE)
}

Benchmark

## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)
library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
               rowsum(mtcDT, f, reorder = FALSE),
               fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
expr        min         lq      mean    median        uq       max neval cld
 mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726   100   c
 rowsum(mtcDT, f, reorder = FALSE)   2.833333   2.798203  2.489064  2.937889  2.425724  2.181173   100  b
     fsum(mtcDT, f, na.rm = FALSE)   1.000000   1.000000  1.000000  1.000000  1.000000  1.000000   100 a
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE))                        # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
               rowsum(tdata, f, reorder = FALSE),
               fsum(tdata, f, na.rm = FALSE), unit = "relative")
expr      min       lq     mean   median       uq       max neval cld
 tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475   100   c
 rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937   100  b
     fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100 a

Details

Please see the documentation of individual functions.