Backwards Feature Selection Helper Functions
Ancillary fuctions for backwards selection
pickSizeTolerance(x, metric, tol = 1.5, maximize) pickSizeBest(x, metric, maximize)
caretFuncs lmFuncs rfFuncs treebagFuncs ldaFuncs nbFuncs gamFuncs lrFuncs
- a matrix or data frame with the performance metric of interest
- a character string with the name of the performance metric that should be used to choose the appropriate number of variables
- a logical; should the metric be maximized?
- a scalar to denote the acceptable difference in optimal performance (see Details below)
- a list of data frames with variables
- an integer for the number of variables to retain
This page describes the functions that are used in backwards selection (aka recursive
feature elimination). The funcitons described here are passed to the algorithm via the
functions argument of
rfeControl for details on how these functions should be defined.
The 'pick' functions are used to find the appropriate subset size for
pickBest will find the position
associated with the numerically best value (see the
argument to help define this).
pickSizeTolerance picks the lowest position (i.e. the smallest
subset size) that has no more of an X percent loss in
performances. When maximizing, it calculates (O-X)/O*100, where X is
the set of performance values and O is max(X). This is the percent
loss. When X is to be minimized, it uses (X-O)/O*100 (so that values
greater than X have a positive "loss"). The function finds the
smallest subset size that has a percent loss less than
Both of the 'pick' functions assume that the data are sorted from smallest subset size to largest.
## For picking subset sizes: ## Minimize the RMSE example <- data.frame(RMSE = c(1.2, 1.1, 1.05, 1.01, 1.01, 1.03, 1.00), Variables = 1:7) ## Percent Loss in performance (positive) example$PctLoss <- (example$RMSE - min(example$RMSE))/min(example$RMSE)*100 xyplot(RMSE ~ Variables, data= example) xyplot(PctLoss ~ Variables, data= example) absoluteBest <- pickSizeBest(example, metric = "RMSE", maximize = FALSE) within5Pct <- pickSizeTolerance(example, metric = "RMSE", maximize = FALSE) cat("numerically optimal:", example$RMSE[absoluteBest], "RMSE in position", absoluteBest, "\n") cat("Accepting a 1.5 pct loss:", example$RMSE[within5Pct], "RMSE in position", within5Pct, "\n") ## Example where we would like to maximize example2 <- data.frame(Rsquared = c(0.4, 0.6, 0.94, 0.95, 0.95, 0.95, 0.95), Variables = 1:7) ## Percent Loss in performance (positive) example2$PctLoss <- (max(example2$Rsquared) - example2$Rsquared)/max(example2$Rsquared)*100 xyplot(Rsquared ~ Variables, data= example2) xyplot(PctLoss ~ Variables, data= example2) absoluteBest2 <- pickSizeBest(example2, metric = "Rsquared", maximize = TRUE) within5Pct2 <- pickSizeTolerance(example2, metric = "Rsquared", maximize = TRUE) cat("numerically optimal:", example2$Rsquared[absoluteBest2], "R^2 in position", absoluteBest2, "") cat("Accepting a 1.5 pct loss:", example2$Rsquared[within5Pct2], "R^2 in position", within5Pct2, "")