Woe: Weight of evidence for each level of a factor.

Description

Computes the weight of evidence for each level of a factor and a dependent variable.

Usage

## S3 method for class 'factor':
Woe(iv, dv, maxOdds=10000, civ=NULL, \dots)

Arguments

A factor, the independent variable. Missing values, if present, are replaced using CleanNaFromFactor.

The dependent variable, which may have only two unique values. Missing values are not allowed.

maxOdds

When the odds are greater than maxOdds or less than 1/maxOdds then the odds are replaced with the threshold value.

civ

If iv is a discretized version of a continuous variable, then the original continuos variable can be provided in this argument so that linearity can be calculated. See the Value section below for more information.

...

Extra unused arguments.

Value

A list with the following elements:
woe.levelsA vector of WOE values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.
woeA vector of WOE values with the same length as iv. Essentially each factor value is replaced with the associated log odds.
oddsA vector of odds values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.
bin.countA count of data points in each level of the factor iv.
true.countA count of "true" dependent variable values in each level of the factor iv. The number of "false" values is bin.count - true.count.
log.density.ratioA vector of log density ratio values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.
information.valueA vector of information values corresponding to each level of the factor iv. The values are ordered to match the input factor iv.
linearityA measure of correlation between the log-odds of the dependent variable and the binned values of the continuous independent variable civ. This is calculated if the civ argument was provided, otherwise it's NA.

Details

This function computes the log odds (aka weight of evidence) for each level in a factor as follows: $$woe = \log \frac{nPositive}{nNegative}$$ where nPositive is the number of "positive" values in the dependent variable, and nNegative is the number of "negative" values.

By default the second level of dv is used as the "positive" class during power calculations. This can be controlled by ordering the levels in a factor supplied as dv.

Other metrics returned include the information value and the log density ratio.

Examples

Run this code

library(stringr)

# create a factor with three levels
# - odds of 1 for a:  1:2 = 2.0
# - odds of 1 for b:  2:1 = 0.5
# - odds of 1 for NA: 1:1 = 1.0
f1  <- factor(c(str_split("a a a b b b", "")[[1]], NA,NA))
dv1 <- c(                  1,1,0,0,0,1,              1, 0 )
fw1 <- Woe(f1,dv1)
fw1$odds

# discretize a continuous variable into a factor with 10 levels and compute WOE,
data(df.causata)
dv <- df.causata$has.responded.mobile.logoff_next.hour_466
f2 <- BinaryCut(df.causata$online.average.authentications.per.month_all.past_406, dv)
fw2 <- Woe(f2, dv, civ=df.causata$online.average.authentications.per.month_all.past_406)
fw2$odds
fw2$linearity