cvrisk: Cross-Validation

Description

Cross-validated estimation of the empirical risk for hyper-parameter selection.

Usage

cvrisk(object, folds = cv(model.weights(object)), 
       grid = 1:mstop(object),
       papply = if (require("multicore")) mclapply else lapply, 
       fun = NULL, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
   B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)

Arguments

object

an object of class mboost.

folds

a weight matrix with number of rows equal to the number of observations. The number of columns corresponds to the number of cross-validation runs. Can be computed using function cv and defaults

grid

a vector of stopping parameters the empirical risk is to be evaluated for.

papply

(parallel) apply function. In the absence of package multicore sequential computations via lapply are performed. Alternatively, parallel computing via

fun

if fun is NULL, the out-of-sample risk is returned. fun, as a function of object, may extract any other characteristic of the cross-validated models. These are returned as is.

weights

a numeric vector of weights for the model to be cross-validated.

type

character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented.

number of folds, per default 25 for bootstrap and subsampling and 10 for kfold.

prob

percentage of observations to be included in the learning samples for subsampling.

strata

a factor of the same length as weights for stratification.

...

additional arguments passed to mclapply eventually.

Value

An object of class cvrisk (when fun wasn't specified), basically a matrix containing estimates of the empirical risk for a varying number of bootstrap iterations. plot and print methods are available as well as a mstop method.

Details

The number of boosting iterations is a hyper-parameter of the boosting algorithms implemented in this package. Honest, i.e., cross-validated, estimates of the empirical risk for different stopping parameters mstop are computed by this function which can be utilized to choose an appropriate number of boosting iterations to be applied.

Different forms of cross-validation can be applied, for example 10-fold cross-validation or bootstrapping. The weights (zero weights correspond to test cases) are defined via the folds matrix.

If package multicore is available, cvrisk can be easily used in parallel on cores/processors available by specifying papply = mcapply. The scheduling can be changed by the corresponding arguments of mclapply (via the dot arguments).

The function cv can be used to build an appropriate weight matrix to be used with cvrisk. If strata is defined sampling is performed in each stratum separately thus preserving the distribution of the strata variable in each fold.

References

Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675--699.

Examples

Run this code

data("bodyfat", package = "mboost")

  ### fit linear model to data
  model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)

  ### AIC-based selection of number of boosting iterations
  maic <- AIC(model)
  maic

  ### inspect coefficient path and AIC-based stopping criterion
  par(mai = par("mai") * c(1, 1, 1, 1.8))
  plot(model)
  abline(v = mstop(maic), col = "lightgray")

  ### 10-fold cross-validation
  cv10f <- cv(model.weights(model), type = "kfold")
  cvm <- cvrisk(model, folds = cv10f, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### 25 bootstrap iterations (manually)
  set.seed(290875)
  n <- nrow(bodyfat)
  bs25 <- rmultinom(25, n, rep(1, n)/n)
  cvm <- cvrisk(model, folds = bs25, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### same by default
  set.seed(290875)
  cvrisk(model, papply = lapply)

  ### 25 bootstrap iterations (using cv)
  set.seed(290875)
  bs25_2 <- cv(model.weights(model), type="bootstrap")
  all(bs25 == bs25_2)

  ### trees
  blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
  cvtree <- cvrisk(blackbox, papply = lapply)
  plot(cvtree)


  ### cvrisk in parallel modes:

  ## multicore only runs properly on unix systems
    library("multicore")
    cvrisk(model)

  ## infrastructure needs to be set up in advance
    library("snow")
    cl <- makePVMcluster(25) # e.g. to run cvrisk on 25 nodes via PVM
    myApply <- function(X, FUN, cl, ...) {
      clusterEvalQ(cl, library("mboost")) # load mboost on nodes
      ## further set up steps as required
      clusterApplyLB(cl = cl, X, FUN, ...)
    }
    cvrisk(model, papply = myApply, cl = cl)
    stopCluster(cl)

Run the code above in your browser using DataLab