Performs supervised discretization of continuous numerical variables using the theoretical framework proposed by Zeng (2013). This method creates bins that maximize a specified divergence measure (e.g., Kullback-Leibler, Hellinger) between the distributions of positive and negative cases, effectively maximizing the Information Value (IV) or other discriminatory statistics.
ob_numerical_dmiv(
feature,
target,
min_bins = 3,
max_bins = 5,
bin_cutoff = 0.05,
max_n_prebins = 20,
is_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000,
bin_method = c("woe1", "woe"),
divergence_method = c("l2", "he", "kl", "tr", "klj", "sc", "js", "l1", "ln")
)A list containing the binning results:
id: Integer vector of bin identifiers.
bin: Character vector of bin labels in interval notation.
woe: Numeric vector of Weight of Evidence for each bin.
divergence: Numeric vector of the chosen divergence contribution per bin.
count: Integer vector of total observations per bin.
count_pos: Integer vector of positive cases.
count_neg: Integer vector of negative cases.
cutpoints: Numeric vector of upper boundaries (excluding Inf).
total_divergence: The sum of the divergence measure across all bins.
bin_method: The WoE calculation method used.
divergence_method: The divergence measure used.
A numeric vector representing the continuous predictor variable. Missing values (NA) are excluded during the pre-binning phase.
An integer vector of binary outcomes (0/1) corresponding to
each observation in feature. Must have the same length as feature.
Integer. The minimum number of bins to produce. Must be \(\ge\) 2. Defaults to 3.
Integer. The maximum number of bins to produce. Must be \(\ge\)
min_bins. Defaults to 5.
Numeric. The minimum fraction of total observations required
for a bin to be considered valid. Bins with frequency < bin_cutoff
will be merged. Value must be in (0, 1). Defaults to 0.05.
Integer. The number of initial quantiles to generate during the pre-binning phase. Defaults to 20.
Logical. If TRUE, the algorithm enforces a strict
monotonic relationship (increasing or decreasing) between the bin indices
and their WoE values. Defaults to TRUE.
Numeric. The threshold for the change in total divergence to determine convergence during the iterative merging process. Defaults to 1e-6.
Integer. Safety limit for the maximum number of merging iterations. Defaults to 1000.
Character string specifying the formula for Weight of Evidence calculation:
"woe": Standard definition \(\ln((p_i/P) / (n_i/N))\).
"woe1": Zeng's definition \(\ln(p_i / n_i)\) (direct log odds).
Defaults to "woe1".
Character string specifying the divergence measure to maximize. Available options:
"iv": Information Value (conceptually similar to KL).
"he": Hellinger Distance.
"kl": Kullback-Leibler Divergence.
"tr": Triangular Discrimination.
"klj": Jeffrey's Divergence (Symmetric KL).
"sc": Symmetric Chi-Square Divergence.
"js": Jensen-Shannon Divergence.
"l1": Manhattan Distance (L1 Norm).
"l2": Euclidean Distance (L2 Norm).
"ln": Chebyshev Distance (L-infinity Norm).
Defaults to "l2".
This algorithm implements the "Metric Divergence Measures" framework. Unlike standard ChiMerge which uses statistical significance, this method uses a branch-and-bound approach to minimize the loss of a specific divergence metric when merging bins.
The Process:
Pre-binning: Generates granular bins based on quantiles.
Rare Merging: Merges bins smaller than bin_cutoff.
Monotonicity: If is_monotonic = TRUE, forces the WoE trend
to be monotonic by merging "violating" bins in the direction that
maximizes the total divergence.
Optimization: Iteratively merges the pair of adjacent bins that
results in the smallest loss of total divergence, until max_bins
is reached.
Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of the Operational Research Society, 64(5), 712-731.
# Example using the "he" (Hellinger) distance
set.seed(123)
feature <- rnorm(1000)
target <- rbinom(1000, 1, plogis(feature))
result <- ob_numerical_dmiv(feature, target,
min_bins = 3,
max_bins = 5,
divergence_method = "he",
bin_method = "woe"
)
print(result$bin)
print(result$divergence)
print(paste("Total Hellinger Distance:", round(result$total_divergence, 4)))
Run the code above in your browser using DataLab