Control Object for Selection By Filtering (SBF)
Controls the execution of models with simple filters for feature selection
sbfControl(functions = NULL, method = "boot", saveDetails = FALSE, number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25), repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number), verbose = FALSE, returnResamp = "final", p = 0.75, index = NULL, indexOut = NULL, timingSamps = 0, seeds = NA, allowParallel = TRUE, multivariate = FALSE)
- a list of functions for model fitting, prediction and variable filtering (see Details below)
- The external resampling method:
LGOCV(for repeated training/test splits
- Either the number of folds or number of resampling iterations
- For repeated k-fold cross-validation only: the number of complete sets of folds to compute
- a logical to save the predictions and variable importances from the selection process
- a logical to print a log for each external resampling iteration
- A character string indicating how much of the resampled summary metrics should be saved. Values can be ``final'' or ``none''
- For leave-group out cross-validation: the training percentage
- a list with elements for each external resampling iteration. Each list element is the sample rows used for training at that iteration.
- a list (the same length as
index) that dictates which sample are held-out for each resample. If
NULL, then the unique set of samples not contained in
- the number of training set samples that will be used to measure the time for predicting samples (zero indicates that the prediction time should not be estimated).
- an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of
NAwill stop the seed from being set within the worker processes while a value of
NULLwill set the seeds using a random set of integers. Alternatively, a vector of integers can be used. The vector should have
Bis the number of resamples. See the Examples section below.
- if a parallel backend is loaded and available, should the function use it?
- a logical; should all the columns of
xbe exposed to the
scorefunction at once?
More details on this function can be found at http://topepo.github.io/caret/featureselection.html#filter.
Simple filter-based feature selection requires function to be specified for some operations.
fit function builds the model based on the current data set. The arguments for the function must be:
xthe current training set of predictor data with the appropriate subset of variables (i.e. after filtering)
ythe current outcome data (either a numeric or factor vector)
...optional arguments to pass to the fit function in the call to
The function should return a model object that can be used to generate predictions.
pred function returns a vector of predictions (numeric or factors) from the current model. The arguments are:
objectthe model generated by the
xthe current set of predictor set for the held-back samples
score function is used to return scores with names for each predictor (such as a p-value). Inputs are:
xthe predictors for the training samples. If
TRUE, this will be the full predictor matrix. Otherwise it is a vector for a specific predictor.
ythe current training outcomes
score function should return a named vector where
length(scores) == ncol(x). Otherwise, the function's output should be a single value. Univariate examples are give by
anovaScores for classification and
gamScores for regression and the example below.
filter function is used to return a logical vector with names for each predictor (
TRUE indicates that the prediction should be retained). Inputs are:
scorethe output of the
xthe predictors for the training samples
ythe current training outcomes
The function should return a named logical vector.
The web page http://topepo.github.io/caret/ has more details and examples related to this function.
a list that echos the specified arguments
## Not run: # data(BloodBrain) # # ## Use a GAM is the filter, then fit a random forest model # set.seed(1) # RFwithGAM <- sbf(bbbDescr, logBBB, # sbfControl = sbfControl(functions = rfSBF, # verbose = FALSE, # seeds = sample.int(100000, 11), # method = "cv")) # RFwithGAM # # # ## A simple example for multivariate scoring # rfSBF2 <- rfSBF # rfSBF2$score <- function(x, y) apply(x, 2, rfSBF$score, y = y) # # set.seed(1) # RFwithGAM2 <- sbf(bbbDescr, logBBB, # sbfControl = sbfControl(functions = rfSBF2, # verbose = FALSE, # seeds = sample.int(100000, 11), # method = "cv", # multivariate = TRUE)) # RFwithGAM2 # # # ## End(Not run)