Learn R Programming

OptimalBinningWoE (version 1.0.8)

ob_numerical_udt: Optimal Binning for Numerical Variables using Entropy-Based Partitioning

Description

Implements a supervised binning algorithm that uses Information Gain (Entropy) to identify the most informative initial split points, followed by a bottom-up merging process to satisfy constraints (minimum frequency, monotonicity, max bins).

Although historically referred to as "Unsupervised Decision Trees" in some contexts, this method is strictly **supervised** (uses target variable) and operates **bottom-up** after an initial entropy-based selection of cutpoints. It is particularly effective when the relationship between feature and target is non-linear but highly informative in specific regions.

Usage

ob_numerical_udt(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  laplace_smoothing = 0.5,
  monotonicity_direction = "none",
  convergence_threshold = 1e-06,
  max_iterations = 1000
)

Value

A list containing:

id

Integer bin identifiers (1-based).

bin

Character bin intervals "(lower;upper]".

woe

Numeric WoE values.

iv

Numeric IV contributions.

event_rate

Numeric event rates.

count

Integer total observations.

count_pos

Integer positive class counts.

count_neg

Integer negative class counts.

cutpoints

Numeric bin boundaries.

total_iv

Total Information Value.

gini

Gini index (2*AUC - 1) calculated on the binned data.

ks

Kolmogorov-Smirnov statistic calculated on the binned data.

converged

Logical convergence flag.

iterations

Integer iteration count.

Arguments

feature

Numeric vector of feature values. Missing values (NA) are handled by placing them in a separate bin. Infinite values are treated as valid numeric extremes or placed in the missing bin if they represent errors.

target

Integer vector of binary target values (must contain only 0 and 1). Must have the same length as feature.

min_bins

Minimum number of bins (default: 3). Must be at least 2.

max_bins

Maximum number of bins (default: 5). Must be greater than or equal to min_bins.

bin_cutoff

Minimum fraction of total observations per bin (default: 0.05). Bins below this threshold are merged based on Event Rate similarity.

max_n_prebins

Maximum number of pre-bins (default: 20). The algorithm will select the top max_n_prebins cutpoints with highest Information Gain. Performance Note: High values (>50) may significantly slow down processing for large datasets due to the O(N^2) nature of cutpoint selection.

laplace_smoothing

Laplace smoothing parameter (default: 0.5) for WoE calculation.

monotonicity_direction

String specifying monotonicity constraint:

  • "none" (default): No monotonicity enforcement.

  • "increasing": WoE must be non-decreasing.

  • "decreasing": WoE must be non-increasing.

  • "auto": Automatically determined by Pearson correlation.

convergence_threshold

Convergence threshold for IV optimization (default: 1e-6).

max_iterations

Maximum iterations for optimization loop (default: 1000).

Author

Lopes, J. E.

Details

Algorithm Overview

The UDT algorithm executes in four phases:

Phase 1: Entropy-Based Pre-binning

The algorithm evaluates every possible cutpoint \(c\) (midpoints between sorted unique values) using Information Gain (IG): $$IG(S, c) = H(S) - \left( \frac{|S_L|}{|S|} H(S_L) + \frac{|S_R|}{|S|} H(S_R) \right)$$

The top max_n_prebins cutpoints with the highest IG are selected to form the initial bins. This ensures that the starting bins capture the most discriminative regions of the feature space.

Phase 2: Rare Bin Merging

Bins with frequency \(< \text{bin\_cutoff}\) are merged. The merge partner is chosen to minimize the difference in Event Rates: $$\text{merge\_idx} = \arg\min_{j \in \{i-1, i+1\}} |ER_i - ER_j|$$ This differs from IV-based methods and aims to preserve local risk probability smoothness.

Phase 3: Monotonicity Enforcement

If requested, monotonicity is enforced by iteratively merging bins that violate the specified direction ("increasing", "decreasing", or "auto"). Auto-direction is determined by the sign of the Pearson correlation between feature and target.

Phase 4: Constraint Satisfaction

If \(k > \text{max\_bins}\), bins are merged minimizing IV loss until the constraint is met.

Warning on Complexity

The pre-binning phase evaluates Information Gain for all unique values. For continuous features with many unique values (e.g., \(N > 10,000\)), this step can be computationally intensive (\(O(N^2)\)). Consider rounding or using ob_numerical_sketch for very large datasets.

References

  • Quinlan, J. R. (1986). "Induction of Decision Trees". Machine Learning, 1(1), 81-106.

  • Fayyad, U. M., & Irani, K. B. (1992). "On the Handling of Continuous-Valued Attributes in Decision Tree Generation". Machine Learning, 8, 87-102.

  • Liu, H., et al. (2002). "Discretization: An Enabling Technique". Data Mining and Knowledge Discovery, 6(4), 393-423.

See Also

ob_numerical_mdlp for a pure MDL-based approach, ob_numerical_sketch for fast approximation on large data.