Unlimited learning, half price | 50% off

Last chance! 50% off unlimited learning

Sale ends in


groupdata2 (version 2.0.5)

group: Create groups from your data

Description

lifecycle::badge("stable")

Divides data into groups by a wide range of methods. Creates a grouping factor with 1s for group 1, 2s for group 2, etc. Returns a data.frame grouped by the grouping factor for easy use in magrittr `%>%` pipelines.

By default*, the data points in a group are connected sequentially (e.g. c(1, 1, 2, 2, 3, 3)) and splitting is done from top to bottom. *Except in the "every" method.

There are five types of grouping methods:

The "n_*" methods split the data into a given number of groups. They differ in how they handle excess data points.

The "greedy" method uses a group size to split the data into groups, greedily grabbing `n` data points from the top. The last group may thus differ in size (e.g. c(1, 1, 2, 2, 3)).

The "l_*" methods use a list of either starting points ("l_starts") or group sizes ("l_sizes"). The "l_starts" method can also auto-detect group starts (when a value differs from the previous value).

The "every" method puts every `n`th data point into the same group (e.g. c(1, 2, 3, 1, 2, 3)).

The step methods "staircase" and "primes" increase the group size by a step for each group.

Note: To create groups balanced by a categorical and/or numerical variable, see the fold() and partition() functions.

Usage

group(
  data,
  n,
  method = "n_dist",
  starts_col = NULL,
  force_equal = FALSE,
  allow_zero = FALSE,
  return_factor = FALSE,
  descending = FALSE,
  randomize = FALSE,
  col_name = ".groups",
  remove_missing_starts = FALSE
)

Value

data.frame grouped by existing grouping variables and the new grouping factor.

Arguments

data

data.frame or vector. When a grouped data.frame, the function is applied group-wise.

n

Depends on `method`.

Number of groups (default), group size, list of group sizes, list of group starts, number of data points between group members, step size or prime number to start at. See `method`.

Passed as whole number(s) and/or percentage(s) (0 < n < 1) and/or character.

Method "l_starts" allows 'auto'.

method

"greedy", "n_dist", "n_fill", "n_last", "n_rand", "l_sizes", "l_starts", "every", "staircase", or "primes".

Note: examples are sizes of the generated groups based on a vector with 57 elements.

greedy

Divides up the data greedily given a specified group size (e.g.10,10,10,10,10,7).

`n` is group size.

n_dist (default)

Divides the data into a specified number of groups and distributes excess data points across groups (e.g.11,11,12,11,12).

`n` is number of groups.

n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning (e.g.12,12,11,11,11).

`n` is number of groups.

n_last

Divides the data into a specified number of groups. It finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size (e.g.11,11,11,11,13).

`n` is number of groups.

n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (max. 1 per group) (e.g.12,11,11,11,12).

`n` is number of groups.

l_sizes

Divides up the data by a list of group sizes. Excess data points are placed in an extra group at the end.

E.g.n=list(0.2,0.3)outputsgroupswithsizes(11,17,29).

`n` is a list of group sizes.

l_starts

Starts new groups at specified values in the `starts_col` vector.

n is a list of starting positions. Skip values by c(value, skip_to_number) where skip_to_number is the nth appearance of the value in the vector after the previous group start. The first data point is automatically a starting position.

E.g.n=c(1,3,7,25,50)outputsgroupswithsizes(2,4,18,25,8).

To skip: givenvectorc("a","e","o","a","e","o"),n=list("a","e",c("o",2))outputsgroupswithsizes(1,4,1).

If passing n=auto the starting positions are automatically found such that a group is started whenever a value differs from the previous value (see find_starts()). Note that all NAs are first replaced by a single unique value, meaning that they will also cause group starts. See differs_from_previous() to set a threshold for what is considered "different".

E.g.n="auto"forc(10,10,7,8,8,9)wouldstartgroupsatthefirst10,7,8and9,andgivec(1,1,2,3,3,4).

every

Combines every `n`th data point into a group. (e.g.12,12,11,11,11withn=5).

`n` is the number of data points between group members ("every n").

staircase

Uses step size to divide up the data. Group size increases with 1 step for every group, until there is no more data (e.g.5,10,15,20,7).

`n` is step size.

primes

Uses prime numbers as group sizes. Group size increases to the next prime number until there is no more data. (e.g.5,7,11,13,17,4).

`n` is the prime number to start at.

starts_col

Name of column with values to match in method "l_starts" when `data` is a data.frame. Pass 'index' to use row names. (Character)

force_equal

Create equal groups by discarding excess data points. Implementation varies between methods. (Logical)

allow_zero

Whether `n` can be passed as 0. Can be useful when programmatically finding n. (Logical)

return_factor

Only return the grouping factor. (Logical)

descending

Change the direction of the method. (Not fully implemented) (Logical)

randomize

Randomize the grouping factor. (Logical)

col_name

Name of the added grouping factor.

remove_missing_starts

Recursively remove elements from the list of starts that are not found. For method "l_starts" only. (Logical)

Author

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

See Also

Other grouping functions: all_groups_identical(), collapse_groups(), collapse_groups_by, fold(), group_factor(), partition(), splt()

Other staircase tools: %primes%(), %staircase%(), group_factor()

Other l_starts tools: differs_from_previous(), find_missing_starts(), find_starts(), group_factor()

Examples

Run this code
# Attach packages
library(groupdata2)
library(dplyr)

# Create data frame
df <- data.frame(
  "x" = c(1:12),
  "species" = factor(rep(c("cat", "pig", "human"), 4)),
  "age" = sample(c(1:100), 12)
)

# Using group()
df_grouped <- group(df, n = 5, method = "n_dist")

# Using group() in pipeline to get mean age
df_means <- df %>%
  group(n = 5, method = "n_dist") %>%
  dplyr::summarise(mean_age = mean(age))

# Using group() with `l_sizes`
df_grouped <- group(
  data = df,
  n = list(0.2, 0.3),
  method = "l_sizes"
)

# Using group_factor() with `l_starts`
# `c('pig', 2)` skips to the second appearance of
# 'pig' after the first appearance of 'cat'
df_grouped <- group(
  data = df,
  n = list("cat", c("pig", 2), "human"),
  method = "l_starts",
  starts_col = "species"
)

Run the code above in your browser using DataLab