varimp: Variable Importance

Description

Standard and conditional variable importance for `cforest', following the permutation principle of the `mean decrease in accuracy' importance in `randomForest'.

Usage

varimp(object, mincriterion = 0, conditional = FALSE, 
       threshold = 0.2, nperm = 1, OOB = TRUE, pre1.0_0 = conditional)
varimpAUC(object, mincriterion = 0, conditional = FALSE, 
       threshold = 0.2, nperm = 1, OOB = TRUE, pre1.0_0 = conditional)

Arguments

object

an object as returned by cforest.

mincriterion

the value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default mincriterion = 0 guarant

conditional

a logical determining whether unconditional or conditional computation of the importance is performed.

threshold

the value of the test statistic or 1 - p-value of the association between the variable of interest and a covariate that must be exceeded inorder to include the covariate in the conditioning sc

nperm

the number of permutations performed.

OOB

a logical determining whether the importance is computed from the out-of-bag sample or the learning sample (not suggested).

pre1.0_0

Prior to party version 1.0-0, the actual data values were permuted according to the original permutation importance suggested by Breiman (2001). Now the assignments to child nodes of splits in th

Value

A vector of `mean decrease in accuracy' importance scores.

Details

Function varimp can be used to compute variable importance measures similar to those computed by importance. Besides the standard version, a conditional version is available, that adjusts for correlations between predictor variables. If conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest. The resulting variable importance score is conditional in the sense of beta coefficients in regression models, but represents the effect of a variable in both main effects and interactions. See Strobl et al. (2008) for details.

Note, however, that all random forest results are subject to random variation. Thus, before interpreting the importance ranking, check whether the same ranking is achieved with a different random seed -- or otherwise increase the number of trees ntree in ctree_control. Note that in the presence of missings in the predictor variables the procedure described in Hapfelmeier et al. (2012) is performed. Function varimpAUC implements AUC-based variables importances as described by Janitza et al. (2012). Here, the area under the curve instead of the accuracy is used to calculate the importance of each variable. This AUC-based variable importance measure is more robust towards class imbalance. For right-censored responses, varimp uses the integrated Brier score as a risk measure for computing variable importances. This feature is extremely slow and experimental; use at your own risk.

References

Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5--32. Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, http://dx.doi.org/10.1007/s11222-012-9349-1

Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. http://www.biomedcentral.com/1471-2105/14/119

Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. http://www.biomedcentral.com/1471-2105/9/307

Examples

Run this code

set.seed(290875)
   readingSkills.cf <- cforest(score ~ ., data = readingSkills, 
       control = cforest_unbiased(mtry = 2, ntree = 50))

   # standard importance
   varimp(readingSkills.cf)
   # the same modulo random variation
   varimp(readingSkills.cf, pre1.0_0 = TRUE)

   # conditional importance, may take a while...
   varimp(readingSkills.cf, conditional = TRUE)

   data("GBSG2", package = "TH.data")
   ### add a random covariate for sanity check
   set.seed(29)
   GBSG2$rand <- runif(nrow(GBSG2))
   object <- cforest(Surv(time, cens) ~ ., data = GBSG2, 
                     control = cforest_unbiased(ntree = 20)) 
   vi <- varimp(object)
   ### compare variable importances and absolute z-statistics
   layout(matrix(1:2))
   barplot(vi)
   barplot(abs(summary(coxph(Surv(time, cens) ~ ., data = GBSG2))$coeff[,"z"]))
   ### looks more or less the same

Run the code above in your browser using DataLab