Learn R Programming

OptimalBinningWoE (version 1.0.8)

ob_numerical_sketch: Optimal Binning for Numerical Variables using Sketch-based Algorithm

Description

Implements optimal binning using the **KLL Sketch** (Karnin, Lang, Liberty, 2016), a probabilistic data structure for quantile approximation in data streams. This is the only method in the package that uses a fundamentally different algorithmic approach (streaming algorithms) compared to batch processing methods (MOB, MDLP, etc.).

The sketch-based approach enables:

  • Sublinear space complexity: O(k log N) vs O(N) for batch methods

  • Single-pass processing: Suitable for streaming data

  • Provable approximation guarantees: Quantile error \(\epsilon \approx O(1/k)\)

The method combines KLL Sketch for candidate generation with either Dynamic Programming (for small N <= 50) or greedy IV-based selection (for larger datasets), followed by monotonicity enforcement via the Pool Adjacent Violators Algorithm (PAVA).

Usage

ob_numerical_sketch(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000,
  sketch_k = 200
)

Value

A list of class c("OptimalBinningSketch", "OptimalBinning") containing:

id

Numeric vector of bin identifiers (1-based indexing).

bin_lower

Numeric vector of lower bin boundaries (inclusive).

bin_upper

Numeric vector of upper bin boundaries (inclusive for last bin, exclusive for others).

woe

Numeric vector of Weight of Evidence values. Monotonic if monotonic = TRUE.

iv

Numeric vector of Information Value contributions per bin.

count

Integer vector of total observations per bin.

count_pos

Integer vector of positive class (target = 1) counts per bin.

count_neg

Integer vector of negative class (target = 0) counts per bin.

cutpoints

Numeric vector of bin split points (length = number of bins - 1). These are the internal boundaries between bins.

converged

Logical flag indicating whether optimization converged.

iterations

Integer number of optimization iterations performed.

Arguments

feature

Numeric vector of feature values. Missing values (NA) are not permitted and will trigger an error. Infinite values (Inf, -Inf) and NaN are also not allowed.

target

Integer vector of binary target values (must contain only 0 and 1). Must have the same length as feature. Missing values are not permitted.

min_bins

Minimum number of bins (default: 3). Must be at least 2.

max_bins

Maximum number of bins (default: 5). Must be >= min_bins.

bin_cutoff

Minimum fraction of total observations per bin (default: 0.05). Must be in (0, 1). Bins with fewer observations will be merged with neighbors.

max_n_prebins

Maximum number of pre-bins to generate from quantiles (default: 20). This parameter controls the initial granularity of binning candidates. Higher values provide more flexibility but increase computational cost.

monotonic

Logical flag to enforce WoE monotonicity (default: TRUE). Uses PAVA (Pool Adjacent Violators Algorithm) for enforcement. Direction (increasing/ decreasing) is automatically detected from the data.

convergence_threshold

Convergence threshold for IV change (default: 1e-6). Optimization stops when the change in total IV between iterations falls below this value.

max_iterations

Maximum iterations for bin optimization (default: 1000). Prevents infinite loops in the optimization process.

sketch_k

Integer parameter controlling sketch accuracy (default: 200). Larger values improve quantile precision but increase memory usage. Approximation error: \(\epsilon \approx 1/k\) (200 → 0.5% error). Valid range: [10, 1000]. Typical values: 50 (fast), 200 (balanced), 500 (precise).

Author

Lopes, J. E.

Details

Algorithm Overview

The sketch-based binning algorithm executes in four phases:

Phase 1: KLL Sketch Construction

The KLL Sketch maintains a compressed, multi-level representation of the data distribution:

$$\text{Sketch} = \{\text{Compactor}_0, \text{Compactor}_1, \ldots, \text{Compactor}_L\}$$

where each \(\text{Compactor}_\ell\) stores items with weight \(2^\ell\). When a compactor exceeds capacity \(k\) (controlled by sketch_k), it is compacted.

Theoretical Guarantees (Karnin et al., 2016):

For a quantile \(q\) with estimated value \(\hat{q}\):

$$|\text{rank}(\hat{q}) - q \cdot N| \le \epsilon \cdot N$$

where \(\epsilon \approx O(1/k)\) and space complexity is \(O(k \log(N/k))\).

Phase 2: Candidate Extraction

Approximately 40 quantiles are extracted from the sketch using a non-uniform grid with higher resolution in distribution tails.

Phase 3: Optimal Cutpoint Selection

For small datasets (N <= 50), Dynamic Programming maximizes total IV. For larger datasets, a greedy IV-based selection is used.

Phase 4: Bin Refinement

Bins are refined through frequency constraint enforcement, monotonicity enforcement (if requested), and bin count optimization to minimize IV loss.

Computational Complexity

  • Time: \(O(N \log k + N \times C + k^2 \times I)\)

  • Space: \(O(k \log N)\) for large N

When to Use Sketch-based Binning

  • Use: Large datasets (N > 10^6) with memory constraints or streaming data

  • Avoid: Small datasets (N < 1000) where approximation error may dominate

References

  • Karnin, Z., Lang, K., & Liberty, E. (2016). "Optimal Quantile Approximation in Streams". Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 71-78. tools:::Rd_expr_doi("10.1109/FOCS.2016.20")

  • Greenwald, M., & Khanna, S. (2001). "Space-efficient online computation of quantile summaries". ACM SIGMOD Record, 30(2), 58-66. tools:::Rd_expr_doi("10.1145/376284.375670")

  • Barlow, R. E., Bartholomew, D. J., Bremner, J. M., & Brunk, H. D. (1972). Statistical Inference Under Order Restrictions. Wiley.

  • Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley. tools:::Rd_expr_doi("10.1002/9781119201731")

See Also

ob_numerical_mdlp, ob_numerical_mblp

Examples

Run this code
# \donttest{
# Example 1: Basic usage with simulated data
set.seed(123)
feature <- rnorm(500, mean = 100, sd = 20)
target <- rbinom(500, 1, prob = plogis((feature - 100) / 20))

result <- ob_numerical_sketch(
  feature = feature,
  target = target,
  min_bins = 3,
  max_bins = 5
)

# Display results
print(data.frame(
  Bin = result$id,
  Count = result$count,
  WoE = round(result$woe, 4),
  IV = round(result$iv, 4)
))

# Example 2: Comparing different sketch_k values
set.seed(456)
x <- rnorm(1000, 50, 15)
y <- rbinom(1000, 1, prob = 0.3)

result_k50 <- ob_numerical_sketch(x, y, sketch_k = 50)
result_k200 <- ob_numerical_sketch(x, y, sketch_k = 200)

cat("K=50 IV:", sum(result_k50$iv), "\n")
cat("K=200 IV:", sum(result_k200$iv), "\n")
# }

Run the code above in your browser using DataLab