CV.SuperLearner: Function to get V-fold cross-validated risk estimate for super learner

Description

Function to get V-fold cross-validated risk estimate for super learner. This function simply splits the data into V folds and then calls SuperLearner. Most of the arguments are passed directly to SuperLearner.

Usage

CV.SuperLearner(Y, X, V = 20, family = gaussian(), SL.library, 
  method = "method.NNLS", id = NULL, verbose = FALSE, 
  control = list(saveFitLibrary = FALSE), cvControl = list(), 
  obsWeights = NULL, saveAll = TRUE, parallel = "seq")

Arguments

The outcome.

The covariates.

The number of folds for CV.SuperLearner. This is not the number of folds for SuperLearner. The number of folds for SuperLearner is controlled with cvControl.

family

Currently allows gaussian or binomial to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.

SL.library

Either a character vector of prediction algorithms or a list containing character vectors. See details below for examples on the structure. A list of functions included in the SuperLearner package can be found with listWrappers().

method

A list (or a function to create a list) containing details on estimating the coefficients for the super learner and the model to combine the individual algorithms in the library. See ?method.template for details. Currently, the built in opti

Optional cluster identification variable. For the cross-validation splits, id forces observations in the same cluster to be in the same validation fold. id is passed to the prediction and screening algorithms in SL.library, but b

verbose

Logical; TRUE for printing progress during the computation (helpful for debugging).

control

A list of parameters to control the estimation process. Parameters include saveFitLibrary and trimLogit. See SuperLearner.control for details.

cvControl

A list of parameters to control the cross-validation process. Parameters include V, stratifyCV, shuffle and validRows. See SuperLearner.CV.control

obsWeights

Optional observation weights variable. As with id above, obsWeights is passed to the prediction and screening algorithms, but many of the built in wrappers ignore (or can't use) the information. If you are using observation weigh

saveAll

Logical; Should the entire SuperLearner object be saved for each fold?

parallel

Options for parallel computation of the V-fold step. Use "seq" (the default) for sequential computation. parallel = 'multicore' to use mclapply for the V-fold step (but note that SuperLearner() will still be sequenti

Value

An object of class CV.SuperLearner (a list) with components:
callThe matched call.
AllSLIf saveAll = TRUE, a list with output from each call to SuperLearner, otherwise NULL.
SL.predictThe predicted values from the super learner when each particular row was part of the validation fold.
discreteSL.predictThe traditional cross-validated selector. Picks the algorithm with the smallest cross-validated risk (in super learner terms, gives that algorithm coefficient 1 and all others 0).
whichDiscreteSLA list of length V. The elements in the list are the algorithm that had the smallest cross-validated risk estimate for that fold.
library.predictA matrix with the predicted values from each algorithm in SL.library. The columns are the algorithms in SL.library and the rows represent the predicted values when that particular row was in the validation fold (i.e. not used to fit that estimator).
coefA matrix with the coefficients for the super learner on each fold. The columns are the algorithms in SL.library the rows are the folds.
foldsA list containing the row numbers for each validation fold.
VNumber of folds for CV.SuperLearner.
libraryNamesA character vector with the names of the algorithms in the library. The format is 'predictionAlgorithm_screeningAlgorithm' with '_All' used to denote the prediction algorithm run on all variables in X.
SL.libraryReturns SL.library in the same format as the argument with the same name above.
methodA list with the method functions.
YThe outcome

Details

The SuperLearner function builds a estimator, but does not contain an estimate on the performance of the estimator. Various methods exist for estimator performance evaluation. If you are familiar with the super learner algorithm, it should be no surprise we recommend using cross-validation to evaluate the honest performance of the super learner estimator. The function CV.SuperLearner computes the usual V-fold cross-validated risk estimate for the super learner (and all algorithms in SL.library for comparison).

Examples

Run this code

set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
  
# build Library and run Super Learner
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam", "SL.polymars", "SL.mean")
test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
  verbose = TRUE, method = "method.NNLS")
test
summary(test)
# Look at the coefficients across folds
coef(test)