group: Create groups from your data

Description

lifecycle::badge("stable")

Divides data into groups by a wide range of methods. Creates a grouping factor with 1s for group 1, 2s for group 2, etc. Returns a data.frame grouped by the grouping factor for easy use in magrittr `%>%` pipelines.

By default*, the data points in a group are connected sequentially (e.g. c(1, 1, 2, 2, 3, 3)) and splitting is done from top to bottom. *Except in the "every" method.

There are five types of grouping methods:

The "n_*" methods split the data into a given number of groups. They differ in how they handle excess data points.

The "greedy" method uses a group size to split the data into groups, greedily grabbing `n` data points from the top. The last group may thus differ in size (e.g. c(1, 1, 2, 2, 3)).

The "l_*" methods use a list of either starting points ("l_starts") or group sizes ("l_sizes"). The "l_starts" method can also auto-detect group starts (when a value differs from the previous value).

The "every" method puts every `n`th data point into the same group (e.g. c(1, 2, 3, 1, 2, 3)).

The step methods "staircase" and "primes" increase the group size by a step for each group.

Note: To create groups balanced by a categorical and/or numerical variable, see the fold() and partition() functions.

Usage

group(
  data,
  n,
  method = "n_dist",
  starts_col = NULL,
  force_equal = FALSE,
  allow_zero = FALSE,
  return_factor = FALSE,
  descending = FALSE,
  randomize = FALSE,
  col_name = ".groups",
  remove_missing_starts = FALSE
)

Value

data.frame grouped by existing grouping variables and the new grouping factor.

Arguments

data

data.frame or vector. When a grouped data.frame, the function is applied group-wise.

n

Depends on `method`.

Number of groups (default), group size, list of group sizes, list of group starts, number of data points between group members, step size or prime number to start at. See `method`.

Passed as whole number(s) and/or percentage(s) (0 < n < 1) and/or character.

Method "l_starts" allows 'auto'.

method

"greedy", "n_dist", "n_fill", "n_last", "n_rand", "l_sizes", "l_starts", "every", "staircase", or "primes".

Note: examples are sizes of the generated groups based on a vector with 57 elements.

greedy

Divides up the data greedily given a specified group size \((e.g. 10, 10, 10, 10, 10, 7)\).

`n` is group size.

n_dist (default)

Divides the data into a specified number of groups and distributes excess data points across groups \((e.g. 11, 11, 12, 11, 12)\).

`n` is number of groups.

n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning \((e.g. 12, 12, 11, 11, 11)\).

`n` is number of groups.

n_last

Divides the data into a specified number of groups. It finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size \((e.g. 11, 11, 11, 11, 13)\).

`n` is number of groups.

n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (max. 1 per group) \((e.g. 12, 11, 11, 11, 12)\).

`n` is number of groups.

l_sizes

Divides up the data by a list of group sizes. Excess data points are placed in an extra group at the end.

\(E.g. n = list(0.2, 0.3) outputs groups with sizes (11, 17, 29)\).

`n` is a list of group sizes.

l_starts

Starts new groups at specified values in the `starts_col` vector.

n is a list of starting positions. Skip values by c(value, skip_to_number) where skip_to_number is the nth appearance of the value in the vector after the previous group start. The first data point is automatically a starting position.

\(E.g. n = c(1, 3, 7, 25, 50) outputs groups with sizes (2, 4, 18, 25, 8)\).

To skip: \(given vector c("a", "e", "o", "a", "e", "o"), n = list("a", "e", c("o", 2)) outputs groups with sizes (1, 4, 1)\).

If passing \(n = 'auto'\) the starting positions are automatically found such that a group is started whenever a value differs from the previous value (see find_starts()). Note that all NAs are first replaced by a single unique value, meaning that they will also cause group starts. See differs_from_previous() to set a threshold for what is considered "different".

\(E.g. n = "auto" for c(10, 10, 7, 8, 8, 9) would start groups at the first 10, 7, 8 and 9, and give c(1, 1, 2, 3, 3, 4).\)

every

Combines every `n`th data point into a group. \((e.g. 12, 12, 11, 11, 11 with n = 5)\).

`n` is the number of data points between group members ("every n").

staircase

Uses step size to divide up the data. Group size increases with 1 step for every group, until there is no more data \((e.g. 5, 10, 15, 20, 7)\).

`n` is step size.

primes

Uses prime numbers as group sizes. Group size increases to the next prime number until there is no more data. \((e.g. 5, 7, 11, 13, 17, 4)\).

`n` is the prime number to start at.

starts_col

Name of column with values to match in method "l_starts" when `data` is a data.frame. Pass 'index' to use row names. (Character)

force_equal

Create equal groups by discarding excess data points. Implementation varies between methods. (Logical)

allow_zero

Whether `n` can be passed as 0. Can be useful when programmatically finding n. (Logical)

return_factor

Only return the grouping factor. (Logical)

descending

Change the direction of the method. (Not fully implemented) (Logical)

randomize

Randomize the grouping factor. (Logical)

col_name

Name of the added grouping factor.

remove_missing_starts

Recursively remove elements from the list of starts that are not found. For method "l_starts" only. (Logical)

Author

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples

Run this code

# Attach packages
library(groupdata2)
library(dplyr)

# Create data frame
df <- data.frame(
  "x" = c(1:12),
  "species" = factor(rep(c("cat", "pig", "human"), 4)),
  "age" = sample(c(1:100), 12)
)

# Using group()
df_grouped <- group(df, n = 5, method = "n_dist")

# Using group() in pipeline to get mean age
df_means <- df %>%
  group(n = 5, method = "n_dist") %>%
  dplyr::summarise(mean_age = mean(age))

# Using group() with `l_sizes`
df_grouped <- group(
  data = df,
  n = list(0.2, 0.3),
  method = "l_sizes"
)

# Using group_factor() with `l_starts`
# `c('pig', 2)` skips to the second appearance of
# 'pig' after the first appearance of 'cat'
df_grouped <- group(
  data = df,
  n = list("cat", c("pig", 2), "human"),
  method = "l_starts",
  starts_col = "species"
)

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

greedy

n_dist (default)

n_fill

n_last

n_rand

l_sizes

l_starts

every

staircase

primes

Author

See Also

Examples