collap: Advanced Data Aggregation

Description

collap is a fast and easy to use multi-purpose data aggregation command.

It performs simple aggregations, multi-type data aggregations applying different functions to numeric and categorical data, weighted aggregations (including weighted multi-type aggregations), aggregations applying multiple functions to each column (which can be performed in parallel), and fully customized aggregations where the user passes a list mapping functions to columns.

collap works with collapse's Fast Statistical Functions, providing extremely fast conventional and weighted aggregation. It also works with other functions but this does not deliver high speeds on large data and does not support weighted aggregations.

Usage

# Main function: allows formula and data input to `by` and `w` arguments
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
       custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
       sort.row = TRUE, parallel = FALSE, mc.cores = 1L,
       return = c("wide","list","long","long_dupl"), give.names = "auto", ...)
# Programmer function: allows column names and indices input to `by` and `w` arguments
collapv(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
        custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
        sort.row = TRUE, parallel = FALSE, mc.cores = 1L,
        return = c("wide","list","long","long_dupl"), give.names = "auto", ...)
# Auxiliary function: for grouped tibble ('grouped_df') input + non-standard evaluation
collapg(X, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum, custom = NULL,
        keep.group_vars = TRUE, keep.w = TRUE, keep.col.order = TRUE, sort.row = TRUE,
        parallel = FALSE, mc.cores = 1L,
        return = c("wide","list","long","long_dupl"), give.names = "auto", ...)

Arguments

a data.frame, or an object coercible to data.frame using qDF.

for collap: a one-or two sided formula, i.e. ~ group1 or var1 + var2 ~ group1 + group2, or a atomic vector, list of vectors or GRP object used to group X. For collapv: names or indices of grouping columns, or a logical vector or selector function such as is.categorical selecting grouping columns.

FUN

a function, list of functions (i.e. list(fsum, fmean, fsd) or list(myfun1 = function(x).., sd = sd)), or a character vector of function names, which are automatically applied only to numeric variables.

catFUN

same as FUN, but applied only to categorical (non-numeric) typed columns (is.categorical).

cols

select columns to aggregate using a function, column names, indices or logical vector. Note: cols is ignored if a two-sided formula is passed to by.

weights. Can be passed as numeric vector or alternatively as formula i.e. ~ weightvar in collap or column name / index etc. i.e. "weightvar" in collapv. collapg supports non-standard evaluations so weightvar can be indicate without quotes if found in X.

wFUN

same as FUN: Function(s) to aggregate weight variable if keep.w = TRUE. By default the sum of the weights is computed in each group.

custom

a named list specifying a fully customized aggregation task. The names of the list are function names and the content columns to aggregate using this function (same input as cols). For example custom = list(fmean = 1:6, fsd = 7:9, fmode = 10:11) tells collap to aggregate columns 1-6 of X using the mean, columns 7-9 using the standard deviation etc. Note: custom lets collap ignore any inputs passed to FUN, catFUN or cols.

keep.by, keep.group_vars

logical. FALSE will omit grouping variables from the output. TRUE keeps the variables, even if passed externally in a list or vector (unlike other collapse functions).

keep.w

logical. FALSE will omit weight variable from the output i.e. no aggregation of the weights. TRUE aggregates and adds weights, even if passed externally as a vector (unlike other collapse functions).

keep.col.order

logical. Retain original column order post-aggregation.

sort.row

logical. Sort rows by the groups. From collapse 1.2.0 this only applies to character grouping variables.

parallel

logical. Use parallel::mclapply instead of lapply for multi-function or custom aggregation.

mc.cores

integer. Argument to parallel::mclapply setting the number of cores to use.

return

character. Control the output format when aggregating with multiple functions or performing custom aggregation. "wide" (default) returns a wider data frame with added columns for each additional function. "list" returns a list of data frame's - one for each function. "long" adds a column "Function" and row-binds the results from different functions using data.table::rbindlist. "long.dupl" is a special option for aggregating multi-type data using multiple FUN but only one catFUN or vice-versa. In that case the format is long and data aggregated using only one function is duplicated. See Examples.

give.names

logical. Create unique names of aggregated columns by adding a prefix 'FUN.'. 'auto' will automatically create such prefixes whenever multiple functions are applied to a column or custom is used.

...

additional arguments passed to all functions supplied to FUN, catFUN, wFUN or custom. The behavior of Fast Statistical Functions is regulated by option("collapse_unused_arg_action") and defaults to "warning".

Value

X aggregated by groups supplied to the by argument.

Details

collap automatically checks each function passed to it whether it is a Fast Statistical Function (i.e. whether the function name is contained in .FAST_STAT_FUN). If the function is a fast function, collap only does the grouping and then calls the function to carry out the grouped computations. If the function is not one of .FAST_STAT_FUN, BY is called internally to perform the computation. The resulting computations from each function are put into a list and recombined to produce the desired output format as controlled by the return argument. When multiple functions are used with collap, setting parallel = TRUE and the number of cores with mc.cores will instruct collap to execute these function calls in parallel using parallel::mclapply. If only a single function is used which is not a .FAST_STAT_FUN, the parallel and mc.cores arguments are handed down to BY. See Examples.

Examples

Run this code

# NOT RUN {
## A Simple Introduction --------------------------------------
head(iris)
collap(iris, ~ Species)                                        # Default: FUN = fmean for numeric
collapv(iris, 5)                                               # Same using collapv
collap(iris, ~ Species, fmedian)                               # Using the median
collap(iris, ~ Species, fmedian, keep.col.order = FALSE)       # Groups in-front
collap(iris, Sepal.Width + Petal.Width ~ Species, fmedian)     # Only '.Width' columns
collapv(iris, 5, cols = c(2, 4))                               # Same using collapv
collap(iris, ~ Species, list(fmean, fmedian))                  # Two functions
collap(iris, ~ Species, list(fmean, fmedian), return = "long") # Long format
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4))    # Custom aggregation
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4),    # Raw output, no column reordering
        return = "list")
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4),    # A strange choice...
        return = "long")
collap(iris, ~ Species, w = ~ Sepal.Length)                    # Using Sepal.Length as weights, ..
weights <- abs(rnorm(fnrow(iris)))
collap(iris, ~ Species, w = weights)                           # Some random weights..
collap(iris, iris$Species, w = weights)                        # Note this behavior...
collap(iris, iris$Species, w = weights,
       keep.by = FALSE, keep.w = FALSE)
library(dplyr) # Needed for "%>%"
iris %>% fgroup_by(Species) %>% collapg                        # dplyr style, but faster

## Multi-Type Aggregation --------------------------------------
head(wlddev)                                                    # World Development Panel Data
head(collap(wlddev, ~ country + decade))                        # Aggregate by country and decade
head(collap(wlddev, ~ country + decade, fmedian, ffirst))       # Different functions
head(collap(wlddev, ~ country + decade, cols = is.numeric))     # Aggregate only numeric columns
head(collap(wlddev, ~ country + decade, cols = 9:12))           # Only the 4 series
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade))         # Only GDP and life-expactancy
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, fsum))   # Using the sum instead
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, sum,     # Same using base::sum -> slower!!
            na.rm = TRUE))
head(collap(wlddev, wlddev[c("country","decade")], fsum,        # same, exploring different inputs
            cols = 9:10))
head(collap(wlddev[9:10], wlddev[c("country","decade")], fsum))
head(collapv(wlddev, c("country","decade"), fsum))              # ... names/indices with collapv
head(collapv(wlddev, c(1,5), fsum))

g <- GRP(wlddev, ~ country + decade)                            # Precomputing the grouping
head(collap(wlddev, g, keep.by = FALSE))                        # This is slightly faster now
# Aggregate categorical data using not the mode but the last element
head(collap(wlddev, ~ country + decade, fmean, flast))
head(collap(wlddev, ~ country + decade, catFUN = flast,         # Aggregate only categorical data
            cols = is.categorical))


## Weighted aggregation ----------------------------------------
weights <- abs(rnorm(fnrow(wlddev)))                            # Random weight vector
head(collap(wlddev, ~ country + decade, w = weights))           # Takes weighted mean for numeric..
# ..and weighted mode for categorical data. The weight vector is aggregated using fsum
wlddev$weights <- weights                                       # Adding to data
head(collap(wlddev, ~ country + decade, w = ~ weights))         # Keeps column order
head(collap(wlddev, ~ country + decade, w = ~ weights,          # Aggregating weights using sum
            wFUN = list(fsum, fmax)))                           # and max (corresponding to mode)
wlddev$weights <- NULL


## Multi-Function Aggregation ----------------------------------
head(collap(wlddev, ~ country + decade, list(fmean, fNobs),     # Saving mean and Nobs
            cols = 9:12))

head(collap(wlddev, ~ country + decade,                         # same using base R -> slower
            list(mean = mean,
                 Nobs = function(x,...) sum(!is.na(x))),
            cols = 9:12, na.rm = TRUE))

head(collap(wlddev, ~ country + decade,                         # list output format
            list(fmean, fNobs), cols = 9:12, return = "list"))

head(collap(wlddev, ~ country + decade,                         # long output format
            list(fmean, fNobs), cols = 9:12, return = "long"))

head(collap(wlddev, ~ country + decade,                         # also aggregating categorical data,
            list(fmean, fNobs), return = "long_dupl"))          # and duplicating it 2 times

head(collap(wlddev, ~ country + decade,                         # now also using 2 functions on
            list(fmean, fNobs), list(fmode, flast),             # categorical data
            keep.col.order = FALSE))

head(collap(wlddev, ~ country + decade,                         # more functions, string input,
            c("fmean","fsum","fNobs","fsd","fvar"),             # parallelized execution
            c("fmode","ffirst","flast","fNdistinct"),           # (choose more than 1 cores,
            parallel = TRUE, mc.cores = 1L,                     # depending on your machine)
            keep.col.order = FALSE))


## Custom Aggregation ------------------------------------------
head(collap(wlddev, ~ country + decade,                         # custom aggregation
            custom = list(fmean = 9:12, fsd = 9:10, fmode = 7:8)))

head(collap(wlddev, ~ country + decade,                         # using column names
            custom = list(fmean = "PCGDP", fsd = c("LIFEEX","GINI"),
                          flast = "date")))

head(collap(wlddev, ~ country + decade,                         # weighted parallelized custom
            custom = list(fmean = 9:12, fsd = 9:10,             # aggregation
                          fmode = 7:8), w = weights,
            wFUN = list(fsum, fmax),
            parallel = TRUE, mc.cores = 1L))

head(collap(wlddev, ~ country + decade,                         # No column reordering
            custom = list(fmean = 9:12, fsd = 9:10,
                          fmode = 7:8), w = weights,
            wFUN = list(fsum, fmax),
            parallel = TRUE, mc.cores = 1L, keep.col.order = FALSE))


## Piped use --------------------------------------------------
wlddev %>% fgroup_by(country, decade) %>% collapg
wlddev %>% fgroup_by(country, decade) %>% collapg(w = ODA)
wlddev %>% fgroup_by(country, decade) %>% collapg(fmedian, flast)
wlddev %>% fgroup_by(country, decade) %>%
  collapg(custom = list(fmean = 9:12, fmode = 5:7, flast = 3))

# }

Run the code above in your browser using DataLab