Learn R Programming

tidybins (version 0.1.0)

bin_cols: Bin Cols

Description

Make bins in a tidy fashion. Adds a column to your data frame containing the integer codes of the specified bins of a certain column. Specifying multiple columns is only intended for supervised binning, so mutliple columns can be simultaneously binned optimally with respect to a target variable.

Usage

bin_cols(
  .data,
  col,
  n_bins = 10,
  bin_type = "frequency",
  ...,
  target = NULL,
  pretty_labels = FALSE,
  seed = 1,
  method = "mdlp"
)

Arguments

.data

a data frame

col

a column, vector of columns, or tidyselect

n_bins

number of bins

bin_type

method to make bins

...

params to be passed to selected binning method

target

unquoted column for supervised binning

pretty_labels

logical. If T returns interval label rather than integer rank

seed

seed for stochastic binning (xgboost)

method

method for bin mdlp

Value

a data frame

Details

Description of the arguments for bin_type

  • frequency (fr) creates bins of equal content via quantiles. Wraps bin with method "content". Similar to ntile

  • width (wi) create bins of equal numeric width. Wraps bin with method "length"

  • kmeans (km) create bins using 1-dimensional kmeans. Wraps bin with method "clusters"

  • value (va) each bin has equal sum of values

  • xgboost (xg) column is binned by best predictor of a target column using step_discretize_xgb

  • cart (ca) if the col does not have enough distinct values, xgboost will fail and automatically revert to step_discretize_cart

  • woe (wo) column is binned by weight of evidence. Requires binary target

  • logreg (lr) column is binned by logistic regression. Requires binary target.

  • mdlp uses the discretizeDF.supervised algorithm with a variety of methods.

Examples

Run this code
# NOT RUN {
iris %>%
bin_cols(Sepal.Width, n_bins = 5, pretty_labels = TRUE) %>%
bin_cols(Petal.Width, n_bins = 3, bin_type = c("width", "kmeans")) %>%
bin_cols(Sepal.Width, bin_type = "xgboost", target = Species, seed = 1) -> iris1

#binned columns are named by original name + method abbreviation + number bins created.
#Sometimes the actual number of bins is less than n_bins if the col lacks enough variance.
iris1 %>%
print(width = Inf)

iris1 %>%
bin_summary() %>%
print(width = Inf)
# }

Run the code above in your browser using DataLab