obwoe_gains: Gains Table Statistics for Credit Risk Scorecard Evaluation

Description

Computes a comprehensive gains table (also known as a lift table or decile analysis) for evaluating the discriminatory power of credit scoring models and optimal binning transformations. The gains table is a fundamental tool in credit risk management for model validation, cutoff selection, and regulatory reporting (Basel II/III, IFRS 9).

This function accepts three types of input:

An "obwoe" object from obwoe (uses stored binning)
A data.frame from obwoe_apply (uses bin/WoE columns)
Any data.frame with a grouping variable (e.g., score deciles)

Usage

obwoe_gains(
  obj,
  target = NULL,
  feature = NULL,
  use_column = c("auto", "bin", "woe", "direct"),
  sort_by = c("id", "woe", "event_rate", "bin"),
  n_groups = NULL
)

Value

An S3 object of class "obwoe_gains" containing:

table

Data frame with 18 statistics per bin (see Details)

metrics

Named list of global performance metrics:

ks: Kolmogorov-Smirnov statistic (%)

gini

Gini coefficient (%)

auc

Area Under ROC Curve

total_iv

Total Information Value

ks_bin

Bin where maximum KS occurs

feature

Feature/variable name analyzed

n_bins

Number of bins/groups

n_obs

Total observations

event_rate

Overall event rate

Arguments

obj

Input object: an "obwoe" object, a data.frame from obwoe_apply, or any data.frame containing a grouping variable and target values.

target

Integer vector of binary target values (0/1) or the name of the target column in obj. Required for data.frame inputs. For "obwoe" objects, the target is extracted automatically.

feature

Character string specifying the feature/variable to analyze. For "obwoe" objects: defaults to the feature with highest IV. For data.frame objects: can be any column name representing groups (e.g., "age_bin", "age_woe", "score_decile").

use_column

Character string specifying which column type to use when obj is a data.frame from obwoe_apply:

"bin": Use the <feature>_bin column (default)

"woe"

Use the <feature>_woe column (groups by WoE values)

"auto"

Automatically detect: use _bin if available

"direct"

Use the feature column name directly (for any variable)

sort_by

Character string specifying sort order for bins:

"woe": Descending WoE (highest risk first) - default

"event_rate"

Descending event rate

"bin"

Alphabetical/natural order

n_groups

Integer. For continuous variables (e.g., scores), the number of groups (deciles) to create. Default is NULL (use existing groups). Set to 10 for standard decile analysis.

Details

Gains Table Construction

The gains table is constructed by:

Sorting observations by risk score or WoE (highest risk first)
Grouping into bins (pre-defined or created via quantiles)
Computing bin-level and cumulative statistics

The table enables assessment of model rank-ordering ability: a well-calibrated model should show monotonically increasing event rates as risk score increases.

Bin-Level Statistics (18 metrics)

Column	Formula	Description
`bin`	-	Bin label or interval
`count`	$n_i$	Total observations in bin
`count_pct`	$n_i / N$	Proportion of total population
`pos_count`	$n_{i,1}$	Event count (Bad, target=1)
`neg_count`	$n_{i,0}$	Non-event count (Good, target=0)
`pos_rate`	$n_{i,1} / n_i$	Event rate (Bad rate) in bin
`neg_rate`	$n_{i,0} / n_i$	Non-event rate (Good rate)
`pos_pct`	$n_{i,1} / N_1$	Distribution of events
`neg_pct`	$n_{i,0} / N_0$	Distribution of non-events
`odds`	$n_{i,1} / n_{i,0}$	Odds of event
`log_odds`	$\ln(\text{odds})$	Log-odds (logit)
`woe`	$\ln(p_i / q_i)$	Weight of Evidence
`iv`	$(p_i - q_i) \cdot WoE_i$	Information Value contribution
`cum_pos_pct`	$\sum_{j \le i} p_j$	Cumulative events captured
`cum_neg_pct`	$\sum_{j \le i} q_j$	Cumulative non-events
`ks`	$\|F_1(i) - F_0(i)\|$	KS statistic at bin
`lift`	$\text{pos\_rate} / \bar{p}$	Lift over random
`capture_rate`	$cum\_pos\_pct$	Cumulative capture rate

Global Performance Metrics

Kolmogorov-Smirnov (KS) Statistic: Maximum absolute difference between cumulative distributions of events and non-events. Measures the model's ability to separate populations.

$$KS = \max_i |F_1(i) - F_0(i)|$$

KS Range	Interpretation
< 20%	Poor discrimination
20-40%	Acceptable
40-60%	Good
60-75%	Very good
> 75%	Excellent (verify for data leakage)

Gini Coefficient: Measure of inequality between event and non-event distributions. Equivalent to 2*AUC - 1, representing the area between the Lorenz curve and the line of equality.

$$Gini = 2 \times AUC - 1$$

Area Under ROC Curve (AUC): Probability that a randomly chosen event is ranked higher than a randomly chosen non-event. Computed via the trapezoidal rule.

Total Information Value (IV): Sum of IV contributions across all bins. See obwoe for interpretation guidelines.

Use Cases

Model Validation: Verify rank-ordering (monotonic event rates) and acceptable KS/Gini.

Cutoff Selection: Identify the bin where the model provides optimal separation for business rules (e.g., auto-approve above score X).

Population Stability: Compare gains tables over time to detect model drift.

Regulatory Reporting: Generate metrics required by Basel II/III and IFRS 9 frameworks.

References

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons. tools:::Rd_expr_doi("10.1002/9781119201731")

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. SIAM Monographs on Mathematical Modeling and Computation. tools:::Rd_expr_doi("10.1137/1.9780898718317")

Anderson, R. (2007). The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management. Oxford University Press.

Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of the Royal Statistical Society: Series A, 160(3), 523-541. tools:::Rd_expr_doi("10.1111/j.1467-985X.1997.00078.x")

Examples

Run this code

# \donttest{
# =============================================================================
# Example 1: From obwoe Object (Standard Usage)
# =============================================================================
set.seed(42)
n <- 1000
df <- data.frame(
  age = rnorm(n, 40, 15),
  income = exp(rnorm(n, 10, 0.8)),
  score = rnorm(n, 600, 100),
  target = rbinom(n, 1, 0.15)
)

model <- obwoe(df, target = "target")
gains <- obwoe_gains(model, feature = "age")
print(gains)

# Access metrics
cat("KS:", gains$metrics$ks, "%\n")
cat("Gini:", gains$metrics$gini, "%\n")

# =============================================================================
# Example 2: From obwoe_apply Output - Using Bin Column
# =============================================================================
scored <- obwoe_apply(df, model)

# Default: uses age_bin column
gains_bin <- obwoe_gains(scored,
  target = df$target, feature = "age",
  use_column = "bin"
)

# =============================================================================
# Example 3: From obwoe_apply Output - Using WoE Column
# =============================================================================
# Group by WoE values (continuous analysis)
gains_woe <- obwoe_gains(scored,
  target = df$target, feature = "age",
  use_column = "woe", n_groups = 5
)

# =============================================================================
# Example 4: Any Variable - Score Decile Analysis
# =============================================================================
# Create score deciles manually
df$score_decile <- cut(df$score,
  breaks = quantile(df$score, probs = seq(0, 1, 0.1)),
  include.lowest = TRUE, labels = 1:10
)

# Analyze score deciles directly
gains_score <- obwoe_gains(df,
  target = "target", feature = "score_decile",
  use_column = "direct"
)
print(gains_score)

# =============================================================================
# Example 5: Automatic Decile Creation
# =============================================================================
# Use n_groups to automatically create quantile groups
gains_auto <- obwoe_gains(df,
  target = "target", feature = "score",
  use_column = "direct", n_groups = 10
)
# }

Run the code above in your browser using DataLab