Learn R Programming

scorecard (version 0.2.2)

woebin: WOE Binning

Description

woebin generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge. woebin can also customizing breakpoints if the breaks_list was provided. The default woe is defined as ln(Bad_i/Good_i). If you prefer ln(Good_i/Bad_i), please set the argument positive as negative value, such as '0' or 'good'. If there is a zero frequency class when calculating woe, the zero will replaced by 0.99 to make the woe calculable.

Usage

woebin(dt, y, x = NULL, breaks_list = NULL, special_values = NULL,
  stop_limit = 0.1, count_distr_limit = 0.05, bin_num_limit = 8,
  positive = "bad|1", no_cores = NULL, print_step = 0L,
  method = "tree", save_breaks_list = NULL, ignore_const_cols = TRUE,
  ignore_datetime_cols = TRUE, check_cate_num = TRUE,
  replace_blank_na = TRUE, ...)

Arguments

dt

A data frame with both x (predictor/feature) and y (response/label) variables.

y

Name of y variable.

x

Name of x variables. Default is NULL. If x is NULL, then all columns except y are counted as x variables.

breaks_list

List of break points, default is NULL. If it is not NULL, variable binning will based on the provided breaks.

special_values

the values specified in special_values will be in separate bins. Default is NULL.

stop_limit

Stop binning segmentation when information value gain ratio less than the stop_limit if using tree method, or stop binning merge when the minimum of chi-square less than 'qchisq(1-stoplimit, 1)' if using chimerge method. Accepted range: 0-0.5; default is 0.1.

count_distr_limit

The minimum count distribution percentage. Accepted range: 0.01-0.2; default is 0.05.

bin_num_limit

Integer. The maximum number of binning. Default is 8.

positive

Value of positive class, default "bad|1".

no_cores

Number of CPU cores for parallel computation. Defaults NULL. If no_cores is NULL, the no_cores will set as 1 if length of x variables less than 10, and will set as the number of all CPU cores if the length of x variables greater than or equal to 10.

print_step

A non-negative integer. Default is 1. If print_step>0, print variable names by each print_step-th iteration. If print_step=0 or no_cores>1, no message is print.

method

Optimal binning method, it should be "tree" or "chimerge". Default is "tree".

save_breaks_list

A string. The file name to save breaks_list. Default is None.

ignore_const_cols

Logical. Ignore constant columns. Default is TRUE.

ignore_datetime_cols

Logical. Ignore datetime columns. Default is TRUE.

check_cate_num

Logical. Check categorical columns if have more than 50 unique values. Default is TRUE.

replace_blank_na

Logical. Replace blank values with NA. Default is TRUE.

...

Additional parameters.

Value

A list of dataframes include binning information for each x variables.

See Also

woebin_ply, woebin_plot, woebin_adj

Examples

Run this code
# NOT RUN {
# load germancredit data
data(germancredit)

# Example I
# binning of two variables in germancredit dataset
# using tree method
bins2_tree = woebin(germancredit, y="creditability",
   x=c("credit.amount","housing"), method="tree")
bins2_tree

# }
# NOT RUN {
# using chimerge method
bins2_chi = woebin(germancredit, y="creditability",
   x=c("credit.amount","housing"), method="chimerge")

# save breaks_list as a R file
bins2 = woebin(germancredit, y="creditability",
   x=c("credit.amount","housing"), save_breaks_list='breaks_list')


# Example II
# binning of the germancredit dataset
bins_germ = woebin(germancredit, y = "creditability")
# converting bins_germ into a dataframe
# bins_germ_df = data.table::rbindlist(bins_germ)

# Example III
# customizing the breakpoints of binning
library(data.table)
dat = rbind(
  germancredit,
  data.table(creditability=sample(c("good","bad"),10,replace=TRUE)),
  fill=TRUE)

breaks_list = list(
  age.in.years = c(26, 35, 37, "Inf%,%missing"),
  housing = c("own", "for free%,%rent")
)

special_values = list(
  credit.amount = c(2600, 9960, "6850%,%missing"),
  purpose = c("education", "others%,%missing")
)

bins_cus_brk = woebin(dat, y="creditability",
  x=c("age.in.years","credit.amount","housing","purpose"),
  breaks_list=breaks_list, special_values=special_values)

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab