Learn R Programming

DataExplorer (version 0.5.0)

group_category: Group categories for discrete features

Description

Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.

Usage

group_category(data, feature, threshold, measure, update = FALSE,
  category_name = "OTHER", exclude = NULL)

Arguments

data

input data, in either data.frame or data.table format.

feature

name of the discrete feature to be collapsed.

threshold

the bottom x% categories to be grouped, e.g., if set to 20%, categories with cumulative frequency of the bottom 20% will be grouped

measure

name of feature to be used as an alternative measure.

update

logical, indicating if the data should be modified. Setting to TRUE will modify the input data directly, and will only work with data.table. The default is FALSE.

category_name

name of the new category if update is set to TRUE. The default is "OTHER".

exclude

categories to be excluded from grouping when update is set to TRUE.

Value

If update is set to FALSE, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data.

Details

If a continuous feature is passed to the argument feature, it will be force set to character-class.

Examples

Run this code
# NOT RUN {
# load packages
library(data.table)

# generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 1:500))

# view cumulative frequency without collpasing categories
group_category(data, "a", 0.2)

# view cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")

# group bottom 20% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)

# exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)
# }

Run the code above in your browser using DataLab