Thresholding step is dedicated to roughly eliminate irrelevant variables a the
dataset. This is the first step of the VSURF
function. For
refined variable selection, see VSURF other steps: VSURF_interp
and VSURF_pred
.
VSURF_thres(x, ...)# S3 method for default
VSURF_thres(
x,
y,
mtry = max(floor(ncol(x)/3), 1),
ntree.thres = 500,
nfor.thres = 20,
nmin = 1,
RFimplem = "randomForest",
parallel = FALSE,
clusterType = "PSOCK",
ncores = parallel::detectCores() - 1,
verbose = TRUE,
ntree = NULL,
...
)
# S3 method for formula
VSURF_thres(formula, data, ..., na.action = na.fail)
An object of class VSURF_thres
, which is a list with the
following components:
A vector of indices of selected variables, sorted according to their mean VI, in decreasing order.
A vector of importance of the
varselect.thres
variables.
The minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI.
The number of selected variables.
A vector of the variables importance means (over
nfor.thres
runs), in decreasing order.
The ordering index vector associated to the sorting of variables importance means.
A vector of standard deviations of all variables
importance. The order is given by imp.mean.dec.ind
.
The mean OOB error rate, obtained by a random forests build with all variables.
The predictions of the CART tree fitted to the curve of the standard deviations of VI.
Value of the parameter in the call.
Computation time.
The RF implementation used to run
VSURF_thres
.
The number of cores used to run VSURF_thres
in parallel
(NULL if VSURF_thres did not run in parallel).
The type of the cluster used to run VSURF_thres
in
parallel (NULL if VSURF_thres did not run in parallel).
The original call to VSURF
.
Terms associated to the formula (only if formula-type call was used).
A data frame or a matrix of predictors, the columns represent the variables. Or a formula describing the model to be fitted.
others parameters to be passed on to the randomForest
function (see ?randomForest for further information).
A response vector (must be a factor for classification problems and numeric for regression ones).
Number of variables randomly sampled as candidates at each split.
Standard parameter of randomForest
.
Number of trees of each forest grown.
Number of forests grown.
Number of times the "minimum value" is multiplied to set threshold value. See details below.
Choice of the random forests implementation to use :
"randomForest" (default), "ranger" or "Rborist" (not that if "Rborist" is
chosen, "randoForest" will still be used for the first step
VSURF_thres
). If a vector of length 3 is given, each coordinate is
passed to each intermediate function: VSURF_thres
,
VSURF_interp
, VSURF_pred
, in this order.
A logical indicating if you want VSURF to run in parallel on
multiple cores (default to FALSE). If a vector of length 3 is given,
each coordinate is passed to each intermediate function: VSURF_thres
,
VSURF_interp
, VSURF_pred
, in this order.
Type of the multiple cores cluster used to run VSURF in
parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available
locally on all OS), "FORK" (local too, only available for Linux and Mac
OS), "MPI" (can be used on a remote cluster, which needs snow
and
Rmpi
packages installed), "ranger" and "Rborist" for internal
parallelizations of those packages (not that if "Rborist" is
chosen, "SOCKET" will still be used for the first step
VSURF_thres
). If a vector of length 2 is given, each
coordinate is passed to each intermediate function: VSURF_thres
,
VSURF_interp
, in this order.
Number of cores to use. Default is set to the number of cores detected by R minus 1.
A logical indicating if information about method's progress (included progress bars for each step) must be printed (default to TRUE). Adds a small extra overload.
(deprecated) Number of trees in each forest grown for "thresholding step".
a data frame containing the variables in the model.
A function to specify the action to be taken if NAs are
found. (NOTE: If given, this argument must be named, and as
randomForest
it is only used with the formula-type call.)
Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot
First, nfor.thres
random forests are computed using the function
randomForest
with arguments importance=TRUE
, and our choice of
default values for ntree
and mtry
(which are higher than default
in randomForest
to get a more stable variable importance
measure). Then variables are sorted according to their mean variable
importance (VI), in decreasing order. This order is kept all along the
procedure. Next, a threshold is computed: min.thres
, the minimum
predicted value of a pruned CART tree fitted to the curve of the standard
deviations of VI. Finally, the actual thresholding is performed: only
variables with a mean VI larger than nmin
* min.thres
are kept.
Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236
Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2015), VSURF: An R Package for Variable Selection Using Random Forests, The R Journal 7(2):19-33
VSURF
, tune
data(iris)
iris.thres <- VSURF_thres(iris[,1:4], iris[,5])
iris.thres
if (FALSE) {
# A more interesting example with toys data (see \code{\link{toys}})
# (a few minutes to execute)
data(toys)
toys.thres <- VSURF_thres(toys$x, toys$y)
toys.thres}
Run the code above in your browser using DataLab