varSelImpSpecRF: Variable selection using the "importance spectrum"

Description

Perform variable selection based on a simple heuristic using the importance spectrum of the original data compared to the importance spectra from the same data with the class labels randomly permuted.

Usage

varSelImpSpecRF(forest, xdata = NULL, Class = NULL,
                randomImps = NULL,
                threshold = 0.1,
                numrandom = 20,
                whichImp = "impsUnscaled",
                usingCluster = TRUE,
                TheCluster = NULL, ...)

Arguments

forest

A previously fitted random forest (see randomForest).

xdata

A data frame or matrix, with subjects/cases in rows and variables in columns. NAs not allowed.

Class

The dependent variable; must be a factor.

randomImps

A list with a structure such as the object return by randomVarImpsRF

threshold

The threshold for the selection of variables. See details.

numrandom

The number of random permutations of the class labels.

whichImp

One of impsUnscaled, impsScaled, impsGini, that correspond, respectively, to the (unscaled) mean decrease in accuracy, the scaled mean decrease in accuracy, and the Gini index. See below and randomForest, importance and the references for further explanations of the measures of variable importance.

usingCluster

If TRUE use a cluster to parallelize the calculations.

TheCluster

The name of the cluster, if one is used.

…

Not used.

Value

A vector with the names of the selected variables, ordered by decreasing importance.

Details

You can either pass as arguments a valid object for randomImps, obtained from a previous call to randomVarImpsRF OR you can pass a covariate data frame and a dependent variable, and these will be used to obtain the random importances. The former is preferred for normal use, because this function will not returned the computed random variable importances, and this computation can be lengthy. If you pass both randomImps, xdata, and Class, randomImps will be used.

To select variables, start by ordering from largest (\(i=1\)) to smallest (\(i = p\), where \(p\) is the number of variables), the variable importances from the original data and from each of the data sets with permuted class labels. (So the ordering is done in each data set independently). Compute \(q_i\), the \(1 - threshold\) quantile of the ordered variable importances from the permuted data at ordered postion \(i\). Then, starting from \(i = 1\), let \(i_a\) be the first \(i\) for which the variable importance from the original data is smaller than \(q_i\). Select all variables from \(i=1\) to \(i = i_a - 1\).

References

Breiman, L. (2001) Random forests. Machine Learning, 45, 5--32.

Diaz-Uriarte, R. , Alvarez de Andres, S. (2005) Variable selection from random forests: application to gene expression data. Tech. report. http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html

Friedman, J., Meulman, J. (2005) Clustering objects on subsets of attributes (with discussion). J. Royal Statistical Society, Series B, 66, 815--850.

Examples

Run this code

# NOT RUN {
x <- matrix(rnorm(45 * 30), ncol = 30)
x[1:20, 1:2] <- x[1:20, 1:2] + 2
cl <- factor(c(rep("A", 20), rep("B", 25)))  

rf <- randomForest(x, cl, ntree = 200, importance = TRUE)
rf.rvi <- randomVarImpsRF(x, cl, 
                          rf, 
                          numrandom = 20, 
                          usingCluster = FALSE) 
varSelImpSpecRF(rf, randomImps = rf.rvi)



# }
# NOT RUN {
## Identical, but using a cluster
psockCL <- makeCluster(2, "PSOCK")
clusterSetRNGStream(psockCL, iseed = 456)
clusterEvalQ(psockCL, library(varSelRF))

rf.rvi <- randomVarImpsRF(x, cl, 
                          rf, 
                          numrandom = 20, 
                          usingCluster = TRUE,
                          TheCluster = psockCL) 
varSelImpSpecRF(rf, randomImps = rf.rvi)
stopCluster(psockCL)

# }
# NOT RUN {


# }

Run the code above in your browser using DataLab