
woebin
generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge. woebin
can also customizing breakpoints if the breaks_list
was provided. The default woe
is defined as ln(Bad_i/Good_i). If you prefer ln(Good_i/Bad_i), please set the argument positive
as negative value, such as '0' or 'good'. If there is a zero frequency class when calculating woe, the zero will replaced by 0.99 to make the woe calculable.
woebin(dt, y, x = NULL, var_skip = NULL, breaks_list = NULL,
special_values = NULL, stop_limit = 0.1, count_distr_limit = 0.05,
bin_num_limit = 8, positive = "bad|1", no_cores = NULL,
print_step = 0L, method = "tree", save_breaks_list = NULL,
ignore_const_cols = TRUE, ignore_datetime_cols = TRUE,
check_cate_num = TRUE, replace_blank_na = TRUE, ...)
A data frame with both x (predictor/feature) and y (response/label) variables.
Name of y variable.
Name of x variables. Default is NULL. If x is NULL, then all columns except y and var_skip are counted as x variables.
Name of variables that will skip for binning. Default is NULL.
List of break points, default is NULL. If it is not NULL, variable binning will based on the provided breaks.
the values specified in special_values will be in separate bins. Default is NULL.
Stop binning segmentation when information value gain ratio less than the stop_limit if using tree method; or stop binning merge when the minimum of chi-square larger than 'qchisq(1-stoplimit, 1)' if using chimerge method. Accepted range: 0-0.5; default is 0.1.
The minimum count distribution percentage. Accepted range: 0.01-0.2; default is 0.05.
Integer. The maximum number of binning. Default is 8.
Value of positive class, default "bad|1".
Number of CPU cores for parallel computation. Defaults NULL. If no_cores is NULL, the no_cores will set as 1 if length of x variables less than 10, and will set as the number of all CPU cores if the length of x variables greater than or equal to 10.
A non-negative integer. Default is 1. If print_step>0, print variable names by each print_step-th iteration. If print_step=0 or no_cores>1, no message is print.
Optimal binning method, it should be "tree" or "chimerge". Default is "tree".
A string. The file name to save breaks_list. Default is None.
Logical. Ignore constant columns. Default is TRUE.
Logical. Ignore datetime columns. Default is TRUE.
Logical. Check whether the number of unique values in categorical columns larger than 50. It might make the binning process slow if there are too many unique categories. Default is TRUE.
Logical. Replace blank values with NA. Default is TRUE.
Additional parameters.
A list of data frames include binning information for each x variables.
# NOT RUN {
# load germancredit data
data(germancredit)
# Example I
# binning of two variables in germancredit dataset
# using tree method
bins2_tree = woebin(germancredit, y="creditability",
x=c("credit.amount","housing"), method="tree")
bins2_tree
# }
# NOT RUN {
# using chimerge method
bins2_chi = woebin(germancredit, y="creditability",
x=c("credit.amount","housing"), method="chimerge")
# save breaks_list as a R file
bins2 = woebin(germancredit, y="creditability",
x=c("credit.amount","housing"), save_breaks_list='breaks_list')
# binning in equal freq/width # only supports numerical variables
numeric_cols = c("duration.in.month", "credit.amount",
"installment.rate.in.percentage.of.disposable.income", "present.residence.since",
"age.in.years", "number.of.existing.credits.at.this.bank",
"number.of.people.being.liable.to.provide.maintenance.for")
bins_freq = woebin(germancredit, y="creditability", x=numeric_cols, method="freq")
bins_width = woebin(germancredit, y="creditability", x=numeric_cols, method="width")
# Example II
# binning of the germancredit dataset
bins_germ = woebin(germancredit, y = "creditability")
# converting bins_germ into a data frame
# bins_germ_df = data.table::rbindlist(bins_germ)
# Example III
# customizing the breakpoints of binning
library(data.table)
dat = rbind(
germancredit,
data.table(creditability=sample(c("good","bad"),10,replace=TRUE)),
fill=TRUE)
breaks_list = list(
age.in.years = c(26, 35, 37, "Inf%,%missing"),
housing = c("own", "for free%,%rent")
)
special_values = list(
credit.amount = c(2600, 9960, "6850%,%missing"),
purpose = c("education", "others%,%missing")
)
bins_cus_brk = woebin(dat, y="creditability",
x=c("age.in.years","credit.amount","housing","purpose"),
breaks_list=breaks_list, special_values=special_values)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab