ob_categorical_dmiv: Optimal Binning for Categorical Variables using Divergence Measures

Description

Performs supervised discretization of categorical variables using a divergence-based hierarchical merging algorithm. This implementation supports multiple information-theoretic and metric divergence measures as described by Zeng (2013), enabling flexible optimization of binning structures for credit scoring and binary classification tasks.

Usage

ob_categorical_dmiv(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000,
  bin_method = "woe1",
  divergence_method = "l2"
)

Value

A list containing the binning results with the following components:

id: Integer vector of bin identifiers (1-indexed)
bin: Character vector of bin labels (merged category names)
woe: Numeric vector of Weight of Evidence values per bin
divergence: Numeric vector of divergence contribution per bin
count: Integer vector of total observations per bin
count_pos: Integer vector of positive cases (target=1) per bin
count_neg: Integer vector of negative cases (target=0) per bin
converged: Logical indicating algorithm convergence
iterations: Integer count of merge operations performed
total_divergence: Numeric total divergence of the binning solution
bin_method: Character string of WoE method used
divergence_method: Character string of divergence measure used

Arguments

feature

A character vector or factor representing the categorical predictor variable to be binned. Missing values are automatically converted to the category "NA".

target

An integer vector of binary outcomes (0/1) corresponding to each observation in feature. Missing values are not permitted.

min_bins

Integer. Minimum number of bins to produce. Must be >= 2. If the final number of bins after merging falls below this threshold, the algorithm will attempt to split bins. Defaults to 3.

max_bins

Integer. Maximum number of bins to produce. Must be >= min_bins. The algorithm performs hierarchical merging until this constraint is satisfied. Defaults to 5.

bin_cutoff

Numeric. Frequency threshold for rare category handling. Categories with relative frequency below this value are candidates for pre-binning. Must be in (0, 1). Defaults to 0.05.

max_n_prebins

Integer. Maximum number of initial bins before the main merging phase. When unique categories exceed this limit, rare categories are pre-merged into an "other" bin. Must be >= 2. Defaults to 20.

bin_separator

Character string used to concatenate category names when multiple categories are merged into a single bin. Defaults to "%;%".

convergence_threshold

Numeric. Convergence tolerance for the iterative merging process. Merging stops when the change in minimum divergence between iterations falls below this threshold. Must be > 0. Defaults to 1e-6.

max_iterations

Integer. Maximum number of merge operations allowed. Prevents infinite loops in edge cases. Must be > 0. Defaults to 1000.

bin_method

Character string specifying the Weight of Evidence calculation method. Must be one of:

"woe": Traditional WoE: \(\ln\left(\frac{p_i/P}{n_i/N}\right)\)

"woe1"

Smoothed WoE (Zeng): \(\ln\left(\frac{g_i + 0.5}{b_i + 0.5}\right)\)

The smoothed variant provides numerical stability for sparse bins. Defaults to "woe1".

divergence_method

Character string specifying the divergence measure used for determining bin similarity. Must be one of:

"he": Hellinger Distance: \(\sum(\sqrt{p_i} - \sqrt{n_i})^2\)

"kl"

Symmetrized Kullback-Leibler Divergence

"klj"

Jeffreys J-Divergence: \((p-n)\ln(p/n)\)

"tr"

Triangular Discrimination: \((p-n)^2/(p+n)\)

"sc"

Symmetric Chi-Square: \((p-n)^2(p+n)/(pn)\)

"js"

Jensen-Shannon Divergence

"l1"

L1 Metric (Manhattan Distance): \(|p-n|\)

"l2"

L2 Metric (Euclidean Distance): \(\sqrt{\sum(p-n)^2}\)

"ln"

L-infinity Metric (Chebyshev Distance): \(\max|p-n|\)

Defaults to "l2".

Details

The algorithm implements a hierarchical agglomerative approach where bins are iteratively merged based on minimum pairwise divergence until the max_bins constraint is satisfied or convergence is achieved.

Algorithm Workflow:

Input validation and frequency computation
Pre-binning of rare categories (if unique categories > max_n_prebins)
Initialization of pairwise divergence matrix
Iterative merging of most similar bin pairs
Splitting of heterogeneous bins (if bins < min_bins)
Final metric computation and WoE-based sorting

Divergence Measure Selection: The choice of divergence measure affects the binning structure:

Information-theoretic measures ("kl", "js", "klj"): Emphasize distributional differences; sensitive to rare events
Metric measures ("l1", "l2", "ln"): Provide geometric interpretation; robust to outliers
Chi-square family ("sc", "tr"): Balance between information content and robustness
Hellinger distance ("he"): Bounded measure; suitable for probability distributions

Pre-binning Strategy: When the number of unique categories exceeds max_n_prebins, categories with fewer than 5 observations are aggregated into a special "PREBIN_OTHER" bin to control computational complexity.

References

Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271. tools:::Rd_expr_doi("10.1155/2013/848271")

Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Lin, J. (1991). Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples

Run this code

# \donttest{
# Example 1: Basic usage with synthetic credit data
set.seed(42)
n <- 1000

# Simulate occupation categories with varying default rates
occupations <- c(
  "Engineer", "Doctor", "Teacher", "Sales",
  "Manager", "Clerk", "Other"
)
default_probs <- c(0.05, 0.03, 0.08, 0.15, 0.07, 0.12, 0.20)

feature <- sample(occupations, n,
  replace = TRUE,
  prob = c(0.15, 0.10, 0.20, 0.18, 0.12, 0.15, 0.10)
)
target <- sapply(feature, function(x) {
  rbinom(1, 1, default_probs[which(occupations == x)])
})

# Apply optimal binning with L2 divergence
result <- ob_categorical_dmiv(feature, target,
  min_bins = 2,
  max_bins = 4,
  divergence_method = "l2"
)

# Examine binning results
print(data.frame(
  bin = result$bin,
  woe = round(result$woe, 3),
  count = result$count,
  event_rate = round(result$count_pos / result$count, 3)
))

# Example 2: Comparing divergence methods
result_js <- ob_categorical_dmiv(feature, target,
  divergence_method = "js",
  max_bins = 4
)
result_kl <- ob_categorical_dmiv(feature, target,
  divergence_method = "kl",
  max_bins = 4
)

cat("Jensen-Shannon bins:", length(result_js$bin), "\n")
cat("Kullback-Leibler bins:", length(result_kl$bin), "\n")

# Example 3: High cardinality feature with pre-binning
set.seed(123)
postal_codes <- paste0("ZIP_", sprintf("%03d", 1:50))
feature_high_card <- sample(postal_codes, 2000, replace = TRUE)
target_high_card <- rbinom(2000, 1, 0.1)

result_prebin <- ob_categorical_dmiv(
  feature_high_card,
  target_high_card,
  max_n_prebins = 15,
  max_bins = 5
)

cat("Final bins after pre-binning:", length(result_prebin$bin), "\n")
cat("Algorithm converged:", result_prebin$converged, "\n")
# }

Run the code above in your browser using DataLab