CV.SuperLearner(Y, X, V = 20, family = gaussian(), SL.library, method = "method.NNLS", id = NULL, verbose = FALSE, control = list(saveFitLibrary = FALSE), cvControl = list(), obsWeights = NULL, saveAll = TRUE, parallel = "seq")CV.SuperLearner. This is not the number of folds for SuperLearner. The number of folds for SuperLearner is controlled with cvControl.
gaussian or binomial to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.
listWrappers().
?method.template for details. Currently, the built in options are either "method.NNLS" (the default), "method.NNLS2", "method.NNloglik", "method.CC_LS", "method.CC_nloglik", or "method.AUC". NNLS and NNLS2 are non-negative least squares based on the Lawson-Hanson algorithm and the dual method of Goldfarb and Idnani, respectively. NNLS and NNLS2 will work for both gaussian and binomial outcomes. NNloglik is a non-negative binomial likelihood maximization using the BFGS quasi-Newton optimization method. NN* methods are normalized so weights sum to one. CC_LS uses Goldfarb and Idnani's quadratic programming algorithm to calculate the best convex combination of weights to minimize the squared error loss. CC_nloglik calculates the convex combination of weights that minimize the negative binomial log likelihood on the logistic scale using the sequential quadratic programming algorithm. AUC, which only works for binary outcomes, uses the Nelder-Mead method via the optim function to minimize rank loss (equivalent to maximizing AUC).
id forces observations in the same cluster to be in the same validation fold. id is passed to the prediction and screening algorithms in SL.library, but be sure to check the individual wrappers as many of them ignore the information.
saveFitLibrary and trimLogit. See SuperLearner.control for details.
V, stratifyCV, shuffle and validRows. See SuperLearner.CV.control for details.
id above, obsWeights is passed to the prediction and screening algorithms, but many of the built in wrappers ignore (or can't use) the information. If you are using observation weights, make sure the library you specify uses the information.
SuperLearner object be saved for each fold?
parallel = 'multicore' to use mclapply for the V-fold step (but note that SuperLearner() will still be sequential). The default for mclapply is to check the mc.cores option, and if not set to default to 2 cores. Be sure to set options()$mc.cores to the desired number of cores if you don't want the default. Or parallel can be the name of a snow cluster and will use parLapply for the V-fold step. For both multicore and snow, the inner SuperLearner calls will be sequential.
CV.SuperLearner (a list) with components:saveAll = TRUE, a list with output from each call to SuperLearner, otherwise NULL.
V. The elements in the list are the algorithm that had the smallest cross-validated risk estimate for that fold.
SL.library. The columns are the algorithms in SL.library and the rows represent the predicted values when that particular row was in the validation fold (i.e. not used to fit that estimator).
SL.library the rows are the folds.
CV.SuperLearner.
SL.library in the same format as the argument with the same name above.
SuperLearner function builds a estimator, but does not contain an estimate on the performance of the estimator. Various methods exist for estimator performance evaluation. If you are familiar with the super learner algorithm, it should be no surprise we recommend using cross-validation to evaluate the honest performance of the super learner estimator. The function CV.SuperLearner computes the usual V-fold cross-validated risk estimate for the super learner (and all algorithms in SL.library for comparison).
SuperLearner
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
# build Library and run Super Learner
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam", "SL.polymars", "SL.mean")
## Not run:
# test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
# verbose = TRUE, method = "method.NNLS")
# test
# summary(test)
# # Look at the coefficients across folds
# coef(test)
# ## End(Not run)
Run the code above in your browser using DataLab