permimp: Random Forest Permutation Importance for random forests

Description

Standard and partial/conditional permutation importance for random forest-objects fit using the party or randomForest packages, following the permutation principle of the `mean decrease in accuracy' importance in randomForest . The partial/conditional permutation importance is implemented differently, selecting the predictions to condition on in each tree using Pearson Chi-squared tests applied to the by-split point-categorized predictors. In general the new implementation has similar results as the original varimp function. With asParty = TRUE, the partial/conditional permutation importance is fully backward-compatible but faster than the original varimp function in party.

Usage

permimp(object, ...)
# S3 method for randomForest
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
     conditional = FALSE, threshold = .95, whichxnames = NULL,   
     thresholdDiagnostics = FALSE, progressBar = interactive(), do_check = TRUE, 
     oldSeedSelection = FALSE, cl = NULL, ...)
# S3 method for RandomForest
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
     conditional = FALSE, threshold = .95, whichxnames = NULL,   
     thresholdDiagnostics = FALSE, progressBar = interactive(), 
     pre1.0_0 = conditional, AUC = FALSE, asParty = FALSE, mincriterion = 0, 
     oldSeedSelection = FALSE, cl = NULL, ...)

Value

An object of class varimp, with the mean decrease in accuracy as its $values.

Arguments

object: an object as returned by cforest or randomForest.
mincriterion: the value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default mincriterion = 0 guarantees that all splits are included.
conditional: a logical that determines whether unconditional or conditional permutation is performed.
threshold: the threshold value for (1 - p-value) of the association between the predictor of interest and another predictor, which must be exceeded in order to include the other predictor in the conditioning scheme for the predictor of interest (only relevant if conditional = TRUE). A threshold value of zero includes all other predictors.
nperm: the number of permutations performed.
OOB: a logical that determines whether the importance is computed from the out-of-bag sample or the learning sample (not suggested).
pre1.0_0: Prior to party version 1.0-0, the actual data values were permuted according to the original permutation importance suggested by Breiman (2001). Now the assignments to child nodes of splits in the variable of interest are permuted as described by Hapfelmeier et al. (2012), which allows for missing values in the predictors and is more efficient with respect to memory consumption and computing time. This method does not apply to the conditional permutation importance, nor to random forests that were not fit using the party package.
scaled: a logical that determines whether the differences in prediction accuracy should be scaled by the total (null-model) error.
AUC: a logical that determines whether the Area Under the Curve (AUC) instead of the accuracy is used to compute the permutation importance (cf. Janitza et al., 2012). The AUC-based permutation importance is more robust towards class imbalance, but it is only applicable to binary classification.
asParty: a logical that determines whether or not exactly the same values as the original varimp function in party should be obtained.
whichxnames: a character vector containing the predictor variable names for which the permutation importance should be computed. Only use when aware of the implications, see section 'Details'.
thresholdDiagnostics: a logical that specifies whether diagnostics with respect to the threshold-value should be prompted as warnings.
progressBar: a logical that determines whether a progress bar should be displayed.
do_check: a logical that determines whether a check requiring user input should be included.
oldSeedSelection: a logical that determines whether the selection of random numbers should be the same is in the 1.1 version of the package. The default is FALSE, so that seeds are generated for each tree, and the results are reproducible, also when parallel processing is used.
cl: A cluster object created by makeCluster, or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evaluations (see Details on parallel computing). NULL (default) refers to sequential evaluation.
...: additional arguments to be passed to the Methods

Details

Function permimp is highly comparable to varimp in party, but the partial/conditional variable importance has a different, more efficient implementation. Compared to the original varimp in party, permimp applies a different strategy to select the predictors to condition on (ADD REFERENCE TO PAPER).

With asParty = TRUE, permimp returns exactly the same values as varimp in party, but the computation is done more efficiently.

If conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the predictors that are associated (with 1 - p-value greater than threshold) to the variable of interest. The threshold can be interpreted as a parameter that moves the permutation importance across a dimension from fully conditional (threshold = 0) to completely unconditional (threshold = 1), see Debeer and Strobl (2020).

Using the wichxnames argument, the computation of the permutation importance can be limited to a smaller number of specified predictors. Note, however, that when conditional = TRUE, the (other) predictors to condition on are also limited to this selection of predictors. Only use when fully aware of the implications.

For parallel processing, the pbapply package, a wrapper around the parallel package is used. Parallel processing can be enabled through the cl argument. parLapply is called when cl is a 'cluster' object, mclapply is called when cl is an integer.

When doing parallel processing, other objects might need to pushed to the workers, and random numbers must be handled with care (see the Examples of the pbapply package).

When using parallel processing, showing the progress bar increases the communication overhead between the main process and nodes / child processes compared to the parallel equivalents of the functions without the progress bar. The functions fall back to their original equivalents when progressBar = FALSE. This is the default when interactive() is FALSE (i.e. called from command line R script)

For further details, please refer to the documentation of varimp.

References

Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5--32.

Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, https://link.springer.com/article/10.1007/s11222-012-9349-1

Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from https://www.zeileis.org/papers/Hothorn+Hornik+Zeileis-2006.pdf

Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-119

Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307

Debeer Dries and Carolin Strobl (2020). Conditional Permutation Importance Revisited. BMC Bioinformatics, 21, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03622-2

Examples

Run this code

  
  ### for RandomForest-objects, by party::cforest()  
  set.seed(290875)
  readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, 
                              control = party::cforest_unbiased(mtry = 2, ntree = 25))
  
  ### conditional importance, may take a while...
  # party implementation:
  set.seed(290875)
  party::varimp(readingSkills.cf, conditional = TRUE)
  # faster implementation but same results
  set.seed(290875)
  permimp(readingSkills.cf, conditional = TRUE, asParty = TRUE)
  
  # different implementation with similar results
  set.seed(290875)
  permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE)
  
  ### standard (unconditional) importance is unchanged
  set.seed(290875)
  party::varimp(readingSkills.cf)
  set.seed(290875)
  permimp(readingSkills.cf, oldSeedSelection = TRUE)
  
  
  ###
  set.seed(290875)
  readingSkills.rf <- randomForest::randomForest(score ~ ., data = party::readingSkills, 
                              mtry = 2, ntree = 25, importance = TRUE, 
                              keep.forest = TRUE, keep.inbag = TRUE)
                              
    
  ### (unconditional) Permutation Importance
  set.seed(290875)
  permimp(readingSkills.rf, do_check = FALSE)
  
  # very close to
  readingSkills.rf$importance[,1]
  
  ### Conditional Permutation Importance
  set.seed(290875)
  permimp(readingSkills.rf, conditional = TRUE, threshold = .8, do_check = FALSE)
                              
  if (FALSE) {
  ### Parallel processing - Windows
  # Only relevant for large trees, for small trees, there may not even be a 
  # 'speed up', but a 'slow down'
  
  # Make a larger forest
  set.seed(290875)
  readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, 
                                     control = party::cforest_unbiased(mtry = 2, 
                                                                       ntree = 200))
  
  # sequentiall processing
  set.seed(290875)
  system.time(print(permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE)))
  
  # parallel processing
  # note that the results are reproducible despite using multiple cores
  cluster <- parallel::makeCluster(2)

  set.seed(290875)
  system.time(print(permimp(readingSkills.cf, conditional = TRUE, 
                            asParty = FALSE, cl = cluster, progressBar = FALSE)))
  parallel::stopCluster(cluster)
  }

Run the code above in your browser using DataLab