PredictivePower: Predictive power for a single variable.

Description

This function computes predictive power for a single independent variable and a binary dependent variable.

Usage

"PredictivePower"(iv, dv, warn.levels=30, cv=NULL, debug=FALSE, ...)
"PredictivePower"(iv, dv, warn.levels=30, cv=NULL, debug=FALSE, ...)
PredictivePowerCv(iv, dv, warn.levels=30, debug=FALSE, folds=10, ...)

Arguments

The independent variable.

The dependent variable, which may have only two unique values.

warn.levels

If the number of levels in iv exceeds this value then a warning will be issued.

debug

If set to TRUE then debugging information is printed to the screen.

If NULL then all data are used to compute the predictive power. If an index of boolean values is provided then they are used to separate the data into two parts for cross validation. See the Details below for more information.

...

Additional arguments are passed to BinaryCut.

folds

This argument is used to specify the folds used for cross validation. If a number between 2 and 10 is provided then data will be assigned to the selected number of folds at random. If a vector of values is provided then it will be used as an index to assign data to folds. The number of unique values must be between 2 to 10, and the vector length must match iv.

Value

The PredictivePower functions returns a numeric value representing the predictive power, between 0 and 1.PredictivePowerCv returns a list as follows:

Details

Predictive power is defined as the area under the gains chart for the provided independent variable divided by the area under the gains chart for a perfect predictor. A random predictor would have a predictive power value of 0, and a perfect predictor would have a value of 1.

The power calculation is derived from a discretized gains chart. As such it only works with categorical variables. Numeric variables are discretized before power is computed. The PredictivePower.numeric function discretizes continuous data using the BinaryCut function. Note that the predictive power will depend, in part, on the discretization method.

By default the second level of dv is used as the "positive" class during power calculations. This can be controlled by ordering the levels in a factor supplied as dv.

Missing values in iv are allowed in PredictivePower.factor -- they are ignored during the calculations, as are the corresponding dependent variable values. The missing values can be used in the power calculations if the missing values are mapped to a non-missing level in the factor. See CleanNaFromFactor. Missing values are not allowed in dv.

Cross validation is executed using the PredictivePowerCv function as a wrapper for the PredictivePower functions. When constructing the gains chart the bins are ordered by the odds for a "positve" within each bin. During cross validation the ordering is derived from one set of data, and the area under the curve is calculated with the other set.

References

Inspired by Miller, H. (2009) Predicting customer behaviour: The University of Melbourne's KDD Cup report.

Examples

Run this code

library(stringr)

# Power is 1/3 where levels differ by 1/3, missing values in iv are ignored.
PredictivePower(factor(c(str_split("a a a b b b", " ")[[1]], NA,NA)),
              c(                    1,1,0,0,0,1,              1, 1 ) )

# Power is 1.0 for perfect predictor
PredictivePower(factor(c(str_split("a a a a a b b b b b", " "))[[1]]),
                factor(c(str_split("1 1 1 1 1 0 0 0 0 0", " "))[[1]]) )

# Power is 0 for random predictor
PredictivePower(factor(c(str_split("a a a a b b b b", " "))[[1]]),
                factor(c(str_split("1 1 0 0 1 1 0 0", " "))[[1]]) )

# compute power for random data, power and robustness should be low
set.seed(1234)
fl <- as.factor(sample(letters, size=1e5, replace=TRUE))
dv <- sample(c(0,1), size=1e5, replace=TRUE)
PredictivePowerCv(fl,dv)

# compute power for numeric data, send nbins arguments to BinaryCut
ivn <- rnorm(1e5)
dvn <- rep(0, 1e5)
dvn[(ivn + rnorm(1e5, sd=0.5))>0] <- 1
PredictivePower(ivn,dvn, nbins=10)

Run the code above in your browser using DataLab