C5imp: Variable Importance Measures for C5.0 Models
Description
This function calculates the variable importance (aka attribute usage) for C5.0 models.
Usage
C5imp(object, metric = "usage", pct = TRUE, ...)
Arguments
object
an object of class C5.0
metric
either 'usage' or 'splits' (see Details below)
pct
a logical: should the importance values be converted to be between 0 and 100?
...
other options (not currently used)
Value
a data frame with a column Overall with the predictor usage values. The row names indicate the predictor.
Details
By default, C5.0 measures predictor importance by determining the
percentage of training set samples that fall into all the terminal
nodes after the split (this is used when metric = "usage"). For
example, the predictor in the first split automatically has an
importance measurement of 100 percent. Other predictors may be used
frequently in splits, but if the terminal nodes cover only a handful
of training set samples, the importance scores may be close to
zero. The same strategy is applied to rule-based models as well as the
corresponding boosted versions of the model.
There is a difference in the attribute usage numbers between this
output and the nominal command line output. Although the calculations
are almost exactly the same (we do not add 1/2 to everything), the C
code does not display that an attribute was used if the percentage of
training samples covered by the corresponding splits is very
low. Here, the threshold was lowered and the fractional usage is
shown.
When metric = "splits", the percentage of splits associated
with each predictor is calculated.
References
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
http://www.rulequest.com/see5-unix.html