Selection By Filtering (SBF)

Model fitting after applying univariate filters

sbf(x, ...)

## S3 method for class 'default': sbf(x, y, sbfControl = sbfControl(), ...)

## S3 method for class 'formula': sbf(form, data, ..., subset, na.action, contrasts = NULL)

## S3 method for class 'sbf': predict(object, newdata = NULL, ...)


This function can be used to get resampling estimates for models when simple, filter-based feature selection is applied to the training data.

For each iteration of resampling, the predictor variables are univariately filtered prior to modeling. Performance of this approach is estimated using resampling. The same filter and model are then applied to the entire training set and the final model (and final features) are saved.

sbf can be used with "explicit parallelism", where different resamples (e.g. cross-validation group) can be split up and run on multiple machines or processors. By default, sbf will use a single processor on the host machine. As of version 4.99 of this package, the framework used for parallel processing uses the foreach package. To run the resamples in parallel, the code for sbf does not change; prior to the call to sbf, a parallel backend is registered with foreach (see the examples below).

The modeling and filtering techniques are specified in sbfControl. Example functions are given in lmSBF.


  • for sbf, an object of class sbf with elements:
  • predif sbfControl$saveDetails is TRUE, this is a list of predictions for the hold-out samples at each resampling iteration. Otherwise it is NULL
  • variablesa list of variable names that survived the filter at each resampling iteration
  • resultsa data frame of results aggregated over the resamples
  • fitthe final model fit with only the filtered variables
  • optVariablesthe names of the variables that survived the filter using the training set
  • callthe function call
  • controlthe control object
  • resampleif sbfControl$returnResamp is "all", a data frame of the resampled performance measures. Otherwise, NULL
  • metricsa character vector of names of the performance measures
  • dotsa list of optional arguments that were passed in
  • For predict.sbf, a vector of predictions.

See Also



## Use a GAM is the filter, then fit a random forest model
RFwithGAM <- sbf(bbbDescr, logBBB,
                 sbfControl = sbfControl(functions = rfSBF,
                                         verbose = FALSE, 
                                         method = "cv"))

predict(RFwithGAM, bbbDescr[1:10,])

## classification example with parallel processing

## Note: if the underlying model also uses foreach, the
## number of cores specified above will double (along with
## the memory requirements)
registerDoMC(cores = 2)

mdrrDescr <- mdrrDescr[,-nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]

filteredNB <- sbf(mdrrDescr, mdrrClass,
                 sbfControl = sbfControl(functions = nbSBF,
                                         verbose = FALSE, 
                                         method = "repeatedcv",
                                         repeats = 5))
Documentation reproduced from package caret, version 5.07-001, License: GPL-2

Community examples

Looks like there are no examples yet.