group_var: Recode numeric variables into equal-ranged groups

Description

Recode numeric variables into equal ranged, grouped factors, i.e. a variable is cut into a smaller number of groups, where each group has the same value range. group_labels() creates the related value labels. group_var_if() and group_labels_if() are scoped variants of group_var() and group_labels(), where grouping will be applied only to those variables that match the logical condition of predicate.

Usage

group_var(x, ..., size = 5, as.num = TRUE, right.interval = FALSE,
  n = 30, append = TRUE, suffix = "_gr")
group_var_if(x, predicate, size = 5, as.num = TRUE,
  right.interval = FALSE, n = 30, append = TRUE, suffix = "_gr")
group_labels(x, ..., size = 5, right.interval = FALSE, n = 30)
group_labels_if(x, predicate, size = 5, right.interval = FALSE,
  n = 30)

Arguments

A vector or data frame.

...

Optional, unquoted names of variables that should be selected for further processing. Required, if x is a data frame (and no vector) and only selected variables from x should be processed. You may also use functions like : or tidyselect's select_helpers. See 'Examples' or package-vignette.

size

Numeric; group-size, i.e. the range for grouping. By default, for each 5 categories of x a new group is defined, i.e. size = 5. Use size = "auto" to automatically resize a variable into a maximum of 30 groups (which is the ggplot-default grouping when plotting histograms). Use n to determine the amount of groups.

as.num

Logical, if TRUE, return value will be numeric, not a factor.

right.interval

Logical; if TRUE, grouping starts with the lower bound of size. See 'Details'.

Sets the maximum number of groups that are defined when auto-grouping is on (size = "auto"). Default is 30. If size is not set to "auto", this argument will be ignored.

append

Logical, if TRUE (the default) and x is a data frame, x including the new variables as additional columns is returned; if FALSE, only the new variables are returned.

suffix

String value, will be appended to variable (column) names of x, if x is a data frame. If x is not a data frame, this argument will be ignored. The default value to suffix column names in a data frame depends on the function call:

recoded variables (rec()) will be suffixed with "_r"
recoded variables (recode_to()) will be suffixed with "_r0"
dichotomized variables (dicho()) will be suffixed with "_d"
grouped variables (split_var()) will be suffixed with "_g"
grouped variables (group_var()) will be suffixed with "_gr"
standardized variables (std()) will be suffixed with "_z"
centered variables (center()) will be suffixed with "_c"
de-meaned variables (de_mean()) will be suffixed with "_dm"
grouped-meaned variables (de_mean()) will be suffixed with "_gm"

If suffix = "" and append = TRUE, existing variables that have been recoded/transformed will be overwritten.

predicate

A predicate function to be applied to the columns. The variables for which predicate returns TRUE are selected.

Value

For group_var(), a grouped variable, either as numeric or as factor (see paramter as.num). If x is a data frame, only the grouped variables will be returned.
For group_labels(), a string vector or a list of string vectors containing labels based on the grouped categories of x, formatted as "from lower bound to upper bound", e.g. "10-19" "20-29" "30-39" etc. See 'Examples'.

Details

If size is set to a specific value, the variable is recoded into several groups, where each group has a maximum range of size. Hence, the amount of groups differ depending on the range of x.

If size = "auto", the variable is recoded into a maximum of n groups. Hence, independent from the range of x, always the same amount of groups are created, so the range within each group differs (depending on x's range).

right.interval determins which boundary values to include when grouping is done. If TRUE, grouping starts with the lower bound of size. For example, having a variable ranging from 50 to 80, groups cover the ranges from 50-54, 55-59, 60-64 etc. If FALSE (default), grouping starts with the upper bound of size. In this case, groups cover the ranges from 46-50, 51-55, 56-60, 61-65 etc. Note: This will cover a range from 46-50 as first group, even if values from 46 to 49 are not present. See 'Examples'.

If you want to split a variable into a certain amount of equal sized groups (instead of having groups where values have all the same range), use the split_var function!

group_var() also works on grouped data frames (see group_by). In this case, grouping is applied to the subsets of variables in x. See 'Examples'.

Examples

Run this code

# NOT RUN {
age <- abs(round(rnorm(100, 65, 20)))
age.grp <- group_var(age, size = 10)
hist(age)
hist(age.grp)

age.grpvar <- group_labels(age, size = 10)
table(age.grp)
print(age.grpvar)

# histogram with EUROFAMCARE sample dataset
# variable not grouped
library(sjlabelled)
data(efc)
hist(efc$e17age, main = get_label(efc$e17age))

# bar plot with EUROFAMCARE sample dataset
# grouped variable
ageGrp <- group_var(efc$e17age)
ageGrpLab <- group_labels(efc$e17age)
barplot(table(ageGrp), main = get_label(efc$e17age), names.arg = ageGrpLab)

# within a pipe-chain
library(dplyr)
efc %>%
  select(e17age, c12hour, c160age) %>%
  group_var(size = 20)

# create vector with values from 50 to 80
dummy <- round(runif(200, 50, 80))
# labels with grouping starting at lower bound
group_labels(dummy)
# labels with grouping startint at upper bound
group_labels(dummy, right.interval = TRUE)

# works also with gouped data frames
mtcars %>%
  group_var(disp, size = 4, append = FALSE) %>%
  table()

mtcars %>%
  group_by(cyl) %>%
  group_var(disp, size = 4, append = FALSE) %>%
  table()

# }

Run the code above in your browser using DataLab