sbfControl: Control Object for Selection By Filtering (SBF)

Description

Controls the execution of models with simple filters for feature selection

Usage

sbfControl(functions = NULL, method = "boot", saveDetails = FALSE,
  number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
  repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number),
  verbose = FALSE, returnResamp = "final", p = 0.75, index = NULL,
  indexOut = NULL, timingSamps = 0, seeds = NA, allowParallel = TRUE,
  multivariate = FALSE)

Arguments

functions

a list of functions for model fitting, prediction and variable filtering (see Details below)

method

The external resampling method: boot, cv, LOOCV or LGOCV (for repeated training/test splits

saveDetails

a logical to save the predictions and variable importances from the selection process

number

Either the number of folds or number of resampling iterations

repeats

For repeated k-fold cross-validation only: the number of complete sets of folds to compute

verbose

a logical to print a log for each external resampling iteration

returnResamp

A character string indicating how much of the resampled summary metrics should be saved. Values can be ``final'' or ``none''

For leave-group out cross-validation: the training percentage

index

a list with elements for each external resampling iteration. Each list element is the sample rows used for training at that iteration.

indexOut

a list (the same length as index) that dictates which sample are held-out for each resample. If NULL, then the unique set of samples not contained in index is used.

timingSamps

the number of training set samples that will be used to measure the time for predicting samples (zero indicates that the prediction time should not be estimated).

seeds

an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of NULL will set the seeds using a random set of integers. Alternatively, a vector of integers can be used. The vector should have B+1 elements where B is the number of resamples. See the Examples section below.

allowParallel

if a parallel backend is loaded and available, should the function use it?

multivariate

a logical; should all the columns of x be exposed to the score function at once?

Value

a list that echos the specified arguments

Details

More details on this function can be found at http://topepo.github.io/caret/feature-selection-using-univariate-filters.html.

Simple filter-based feature selection requires function to be specified for some operations.

The fit function builds the model based on the current data set. The arguments for the function must be:

x the current training set of predictor data with the appropriate subset of variables (i.e. after filtering)
y the current outcome data (either a numeric or factor vector)
... optional arguments to pass to the fit function in the call to sbf

The function should return a model object that can be used to generate predictions.

The pred function returns a vector of predictions (numeric or factors) from the current model. The arguments are:

object the model generated by the fit function
x the current set of predictor set for the held-back samples

The score function is used to return scores with names for each predictor (such as a p-value). Inputs are:

x the predictors for the training samples. If sbfControl()$multivariate is TRUE, this will be the full predictor matrix. Otherwise it is a vector for a specific predictor.
y the current training outcomes

When sbfControl()$multivariate is TRUE, the score function should return a named vector where length(scores) == ncol(x). Otherwise, the function's output should be a single value. Univariate examples are give by anovaScores for classification and gamScores for regression and the example below.

The filter function is used to return a logical vector with names for each predictor (TRUE indicates that the prediction should be retained). Inputs are:

score the output of the score function
x the predictors for the training samples
y the current training outcomes

The function should return a named logical vector.

Examples of these functions are included in the package: caretSBF, lmSBF, rfSBF, treebagSBF, ldaSBF and nbSBF.

The web page http://topepo.github.io/caret/ has more details and examples related to this function.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
data(BloodBrain)

## Use a GAM is the filter, then fit a random forest model
set.seed(1)
RFwithGAM <- sbf(bbbDescr, logBBB,
                 sbfControl = sbfControl(functions = rfSBF,
                                         verbose = FALSE,
                                         seeds = sample.int(100000, 11),
                                         method = "cv"))
RFwithGAM


## A simple example for multivariate scoring
rfSBF2 <- rfSBF
rfSBF2$score <- function(x, y) apply(x, 2, rfSBF$score, y = y)

set.seed(1)
RFwithGAM2 <- sbf(bbbDescr, logBBB,
                  sbfControl = sbfControl(functions = rfSBF2,
                                          verbose = FALSE,
                                          seeds = sample.int(100000, 11),
                                          method = "cv",
                                          multivariate = TRUE))
RFwithGAM2


# }

Run the code above in your browser using DataLab