Calculation of variable importance for regression and classification models
A generic method for calculating variable importance for objects produced by
train and method specific methods
## S3 method for class 'train': varImp(object, useModel = TRUE, nonpara = TRUE, scale = TRUE, ...)
## S3 method for class 'earth': varImp(object, value = "gcv", ...)
## S3 method for class 'fda': varImp(object, value = "gcv", ...)
## S3 method for class 'rpart': varImp(object, surrogates = FALSE, competes = TRUE, ...)
## S3 method for class 'randomForest': varImp(object, ...)
## S3 method for class 'gbm': varImp(object, numTrees, ...)
## S3 method for class 'classbagg': varImp(object, ...)
## S3 method for class 'regbagg': varImp(object, ...)
## S3 method for class 'pamrtrained': varImp(object, threshold, data, ...)
## S3 method for class 'lm': varImp(object, ...)
## S3 method for class 'mvr': varImp(object, estimate = NULL, ...)
## S3 method for class 'bagEarth': varImp(object, ...)
## S3 method for class 'RandomForest': varImp(object, ...)
## S3 method for class 'rfe': varImp(object, drop = FALSE, ...)
## S3 method for class 'dsa': varImp(object, cuts = NULL, ...)
## S3 method for class 'multinom': varImp(object, ...)
## S3 method for class 'gam': varImp(object, ...)
## S3 method for class 'cubist': varImp(object, weights = c(0.5, 0.5), ...)
- an object corresponding to a fitted model
- use a model based technique for measuring variable importance? This is only used for some models (lm, pls, rf, rpart, gbm, pam and mars)
- should nonparametric methods be used to assess the relationship
between the features and response (only used with
useModel = FALSEand only passed to
- should the importance values be scaled to 0 and 100?
- parameters to pass to the specific
- the number of iterations (trees) to use in a boosted tree model
- the shrinkage threshold (
- the training set predictors (
- the statistic that will be used to calculate importance:
- should surrogate splits contribute to the importance calculation?
- should competing splits contribute to the importance calculation?
- which estimate of performance should be used? See
- a logical: should variables not included in the final set be calculated?
- the number of rule sets to use in the model (for
- a numeric vector of length two that weighs the usage of variabels in the rule conditions and the usuage in the linear models (see details below).
For models that do not have corresponding
varImp methods, see
Linear Models: the absolute value of the t--statistic for each model parameter is used.
varImp.RandomForest are wrappers around the importance functions from the
rpart.control. This method does not currently provide
class--specific measures of importance when the response is a factor.
Bagged Trees: The same methodology as a single tree is applied to
all bootstrapped trees and the total importance is returned
varImp.gbm is a wrapper around the function from that package (see the
varImp function tracks the changes in
model statistics, such as the GCV, for each predictor and
accumulates the reduction in the statistic when each
predictor's feature is added to the model. This total reduction
is used as the variable importance measure. If a predictor was
never used in any of the MARS basis functions in the final model
(after pruning), it has an importance
value of zero. Prior to June 2008, the package used an internal function
for these calculations. Currently, the
varImp is a wrapper to
evimp function in the
earth package. There are three statistics that can be used to
estimate variable importance in MARS models. Using
varImp(object, value = "gcv") tracks the reduction in the
generalized cross-validation statistic as terms are added.
However, there are some cases when terms are retained
in the model that result in an increase in GCV. Negative variable
importance values for MARS are set to zero.
varImp(object, value = "rss") monitors the change in the
residual sums of squares (RSS) as terms are added, which will
never be negative.
Also, the option
varImp(object, value ="nsubsets"), which
counts the number of subsets where the variable is used (in the final,
Nearest shrunken centroids: The difference between the class centroids and the overall centroid is used to measure the variable influence (see
pamr.predict). The larger the difference between the class centroid and the overall center of the data, the larger the separation between the classes. The training set predictions must be supplied when an object of class
pamrtrained is given to
Cubist: The Cubist output contains variable usage statistics. It gives the percentage of times where each variable was used in a condition and/or a linear model. Note that this output will probably be inconsistent with the rules shown in the output from
summary.cubist. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The percentages shown in the Cubist output reflects all the models involved in prediction (as opposed to the terminal models shown in the output). The variable importance used here is a linear combination of the usage in the rule conditions and the model.
- A data frame with class
varImp.trainor a matrix for other models.