Learn R Programming

OptimalBinningWoE (version 1.0.8)

ob_categorical_cm: Optimal Binning for Categorical Variables using Enhanced ChiMerge Algorithm

Description

Performs supervised discretization of categorical variables using an enhanced implementation of the ChiMerge algorithm (Kerber, 1992) with optional Chi2 extension (Liu & Setiono, 1995). This method optimally groups categorical levels based on their relationship with a binary target variable to maximize predictive power while maintaining statistical significance.

Usage

ob_categorical_cm(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000,
  chi_merge_threshold = 0.05,
  use_chi2_algorithm = FALSE
)

Value

A list containing binning results with the following components:

  • id: Integer vector of bin identifiers (1:n_bins)

  • bin: Character vector of bin labels (merged category names)

  • woe: Numeric vector of Weight of Evidence for each bin

  • iv: Numeric vector of Information Value contribution per bin

  • count: Integer vector of total observations per bin

  • count_pos: Integer vector of positive cases per bin

  • count_neg: Integer vector of negative cases per bin

  • converged: Logical indicating if algorithm converged

  • iterations: Integer count of algorithm iterations performed

  • algorithm: Character string identifying algorithm used

  • warnings: Character vector of any warnings encountered

  • metadata: List with additional diagnostic information:

    • total_iv: Total Information Value of the binned variable

    • n_bins: Final number of bins produced

    • unique_categories: Number of unique input categories

    • total_obs: Total number of observations processed

    • execution_time_ms: Processing time in milliseconds

    • monotonic: Direction of WoE monotonicity ("increasing"/"decreasing")

Arguments

feature

A character vector or factor representing the categorical predictor variable to be binned.

target

An integer vector of binary outcomes (0/1) corresponding to each observation in feature.

min_bins

Integer. Minimum number of bins to produce. Must be >= 2. Defaults to 3.

max_bins

Integer. Maximum number of bins to produce. Must be >= min_bins. Defaults to 5.

bin_cutoff

Numeric. Threshold for treating categories as rare. Categories with frequency < bin_cutoff will be merged with their most similar neighbors. Value must be in (0, 1). Defaults to 0.05.

max_n_prebins

Integer. Maximum number of initial pre-bins before merging. Controls computational complexity. Must be >= 2. Defaults to 20.

bin_separator

String. Separator used when combining multiple categories into a single bin label. Defaults to "%;%".

convergence_threshold

Numeric. Convergence tolerance for iterative merging process. Smaller values require stricter convergence. Must be > 0. Defaults to 1e-6.

max_iterations

Integer. Maximum iterations for the merging algorithm. Prevents infinite loops. Must be > 0. Defaults to 1000.

chi_merge_threshold

Numeric. Statistical significance level (p-value) for chi-square tests during merging. Higher values create fewer bins. Value must be in (0, 1). Defaults to 0.05.

use_chi2_algorithm

Logical. If TRUE, uses the Chi2 variant which performs multi-pass merging with decreasing significance thresholds. Defaults to FALSE.

Author

Developed as part of the OptimalBinningWoE package

Details

The algorithm implements two main approaches:

1. Standard ChiMerge: Iteratively merges adjacent bins with lowest chi-square statistics until all remaining pairs are statistically distinguishable at the specified significance level.

2. Chi2 Algorithm (when use_chi2_algorithm = TRUE): Performs multiple passes with decreasing significance thresholds (0.5 → 0.001), creating more robust binning structures particularly for noisy data.

Key features include:

  • Rare category handling through pre-merging

  • Monotonicity enforcement of Weight of Evidence

  • Numerical stability with underflow protection

  • Efficient chi-square caching for performance

  • Comprehensive input validation and error handling

Information Value interpretation:

  • < 0.02: Predictive power not useful

  • 0.02-0.1: Weak predictive power

  • 0.1-0.3: Medium predictive power

  • 0.3-0.5: Strong predictive power

  • > 0.5: Suspiciously high (potential overfitting)

References

Kerber, R. (1992). ChiMerge: Discretization of numeric attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 123-128).

Liu, B., & Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence (pp. 372-377).

Examples

Run this code
# Example 1: Basic usage with synthetic data
set.seed(123)
n <- 1000
categories <- c("A", "B", "C", "D", "E", "F", "G", "H")
feature <- sample(categories, n, replace = TRUE, prob = c(
  0.2, 0.15, 0.15,
  0.1, 0.1, 0.1,
  0.1, 0.1
))
# Create target with some association to categories
probs <- c(0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85) # increasing probability
target <- sapply(seq_along(feature), function(i) {
  cat_idx <- which(categories == feature[i])
  rbinom(1, 1, probs[cat_idx])
})

result <- ob_categorical_cm(feature, target)
print(result[c("bin", "woe", "iv", "count")])

# View metadata
print(paste("Total IV:", round(result$metadata$total_iv, 3)))
print(paste("Algorithm converged:", result$converged))

# Example 2: Using Chi2 algorithm for more conservative binning
result_chi2 <- ob_categorical_cm(feature, target,
  use_chi2_algorithm = TRUE,
  max_bins = 6
)

# Compare number of bins
cat("Standard ChiMerge bins:", result$metadata$n_bins, "\n")
cat("Chi2 algorithm bins:", result_chi2$metadata$n_bins, "\n")

Run the code above in your browser using DataLab