aggregateData: Aggregate data by categorical variables

Description

Aggregate a dataframe into summaries of all numeric variables by grouping them by specified categorical variables and returns the result along with tidyverse code used to generate it.

Usage

aggregateData(
  .data,
  vars,
  summaries,
  summary_vars,
  varnames = NULL,
  quantiles = c(0.25, 0.75),
  custom_funs = NULL
)

Value

aggregated dataframe containing the summaries with tidyverse code attached

Arguments

.data: a dataframe or survey design object to aggregate
vars: a character vector of categorical variables in .data to group by
summaries: summaries to generate for the groups generated in vars. See details.
summary_vars: names of variables in the dataset to calculate summaries of
varnames: name templates for created variables (see details).
quantiles: if requesting quantiles, specify the desired quantiles here
custom_funs: a list of custom functions (see details).

Calculating variable summaries

The aggregateData function accepts any R function which returns a single-value (such as mean, var, sd, sum, IQR). The default name of new variables will be {var}_{fun}, where {var} is the variable name and {fun} is the summary function used. You may pass new names via the varnames argument, which should be either a vector the same length as summary_vars, or a named list (where the names are the summary function). In either case, use {var} to represent the variable name. e.g., {var}_mean or min_{var}.

You can also include the summary missing, which will count the number of missing values in the variable. It has default name {var}_missing.

For the quantile summary, there is the additional argument quantiles. A new variable will be created for each specified quantile 'p'. To name these variables, use {p} in varnames (the default is {var}_q{p}).

Custom functions can be passed via the custom_funs argument. This should be a list, and each element should have a name and either an expr or fun element. Expressions should operate on a variable x. The function should be a function of x and return a single value.

cust_funs <- list(name = '{var}_width', expr = diff(range(x), na.rm = TRUE))
cust_funs <- list(name = '{var}_stderr',
  fun = function(x) {
    s <- sd(x)
    n <- length(x)
    s / sqrt(n)
  }
)

Author

Tom Elliott, Owen Jin

Examples

Run this code

aggregated <-
    aggregateData(iris,
        vars = c("Species"),
        summaries = c("mean", "sd", "iqr")
    )
cat(code(aggregated))
head(aggregated)

Run the code above in your browser using DataLab