VSURF: Variable Selection Using Random Forests

Description

Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.

Usage

VSURF(x, y, ntree=500,
      mtry=if (!is.factor(y)) max(floor(ncol(x)/3), 1)
           else floor(sqrt(ncol(x))),
      nfor.thres=50, nmin=1, nfor.interp=25, nsd=1, nfor.pred=25, nmj=1)

Arguments

A data frame or a matrix of predictors, the columns represent the variables.

A response vector (must be a factor for classification problems and numeric for regression ones).

ntree

Number of trees in each forests grown. Standard parameter of randomForest.

mtry

Number of variables randomly sampled as candidates at each split. Standard parameter of randomForest.

nfor.thres

Number of forests grown for "thresholding step" (first of the three steps).

nmin

Number of times the "minimum value" is multiplied to set threshold value.

nfor.interp

Number of forests grown for "intepretation step" (second of the three steps).

nsd

Number of times the standard deviation of the minimum value of err.interp is multiplied.

nfor.pred

Number of forests grown for "prediction step" (last of the three steps).

nmj

Number of times the mean jump is multiplied.

Value

An object of class VSURF, which is a list with the following components:
varselect.thresA vector of indexes of variables selected after "thresholding step", sorted according to their mean VI, in decreasing order.
imp.varselect.thresA vector of importances of the varselect.thres variables.
min.thresThe minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI.
num.varselect.thresNumber of variables selected by "thresholding step".
ord.impA list containing the order of all variables mean importance. $x contains the mean importances sorted in decreasing order. $ix contains indexes of the variables.
ord.sdA vector of standard deviations of all variables importance. The order is given by ord.imp.
mean.perfMean OOB error rate, obtained by a random forests build on all variables.
pred.pruned.treeePredictions of the CART tree fitted to the curve of the standard deviations of VI.
varselect.interpA vector of indexes of variables selected after "interpretation step".
err.interpA vector of the mean OOB error rates of the embedded random forests models build during the "interpretation step".
sd.minThe standard deviation of OOB error rates associated to the random forests model attaining the minimum mean OOB error rate during the "interpretation step".
num.varselect.interpNumber of variables selected by "interpretation step".
varselect.predA vector of indexes of variables selected after "prediction step".
err.predA vector of the mean OOB error rates of the random forests models build during the "prediction step".
mean.jumpThe mean jump value computed during the "prediction step".
num.varselect.predNumber of variables selected by "prediction step".
nminNumber of times the "minimum value" is multiplied to set threshold value.
nsdNumber of times the standard deviation of the minimum value of err.interp is multiplied.
nmjNumber of times the mean jump is multiplied.
comput.timeOverall computation time

Details

First step ("thresholding step"): first,nfor.thresrandom forests are computed using the functionrandomForestwith argumentsimportance=TRUE. Then variables are sorted according to their mean variable importance (VI), in decreasing order. This order is kept all along the procedure. Next, a threshold is computed:min.thres, the minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. Finally, the actual "thresholding step" is performed: only variables with a mean VI larger thannmin*min.thresare kept.
Second step ("intepretation step"): the variables selected by the first step are considered.nfor.interpembedded random forests models are grown, starting with the random forest build with only the most important variable and ending with all variables selected in the first step. Then,err.minthe minimum mean out-of-bag (OOB) error of these models and its associated standard deviationsd.minare computed. Finally, the smallest model (and hence its corresponding variables) having a mean OOB error less thanerr.min+nsd*sd.minis selected.
Third step ("prediction step"): the starting point is the same than in the second step. However, now the variables are added to the model in a stepwise manner.mean.jump, the mean jump value is calculated using variables that have been left out by the second step, and is set as the mean absolute difference between mean OOB errors of one model and its first following model. Hence a variable is included in the model if the mean OOB error decrease is larger thannmj*mean.jump.

References

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236

Examples

Run this code

data(iris)
iris.vsurf <- VSURF(x=iris[,1:4], y=iris[,5], ntree=100, nfor.thres=20,
                    nfor.interp=10, nfor.pred=10)
iris.vsurf

# A more interesting example with toys data (see ?toys)
# (less than 1 min to execute)
data(toys)
toys.vsurf <- VSURF(x=toys$x, y=toys$y)
toys.vsurf

Run the code above in your browser using DataLab