Last chance! 50% off unlimited learning
Sale ends in
Cross-validated estimation of the empirical risk for hyper-parameter selection.
# S3 method for mboost
cvrisk(object, folds = cv(model.weights(object)),
grid = 0:mstop(object),
papply = mclapply,
fun = NULL, mc.preschedule = FALSE, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)## Plot cross-valiation results
# S3 method for cvrisk
plot(x,
xlab = "Number of boosting iterations", ylab = attr(x, "risk"),
ylim = range(x), main = attr(x, "type"), ...)
an object of class mboost
.
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function cv
and defaults to 25 bootstrap samples.
a vector of stopping parameters the empirical risk is to be evaluated for.
if fun
is NULL, the out-of-sample risk is returned. fun
,
as a function of object
, may extract any other characteristic
of the cross-validated models. These are returned as is.
a numeric vector of weights for the model to be cross-validated.
character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented.
number of folds, per default 25 for bootstrap
and
subsampling
and 10 for kfold
.
percentage of observations to be included in the learning samples for subsampling.
a factor of the same length as weights
for stratification.
an object of class cvrisk
.
axis labels.
limits of y-axis.
main title of graphic.
An object of class cvrisk
(when fun
wasn't specified), basically a matrix
containing estimates of the empirical risk for a varying number
of bootstrap iterations. plot
and print
methods
are available as well as a mstop
method.
The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters mstop
are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.
Different forms of cross-validation can be applied, for example
10-fold cross-validation or bootstrapping. The weights (zero weights
correspond to test cases) are defined via the folds
matrix.
cvrisk
runs in parallel on OSes where forking is possible
(i.e., not on Windows) and multiple cores/processors are available.
The scheduling
can be changed by the corresponding arguments of
mclapply
(via the dot arguments).
The function cv
can be used to build an appropriate
weight matrix to be used with cvrisk
. If strata
is defined
sampling is performed in each stratum separately thus preserving
the distribution of the strata
variable in each fold.
There exist various functions to display and work with
cross-validation results. One can print
and plot
(see above)
results and extract the optimal iteration via mstop
.
Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675--699.
Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The importance of knowing when to stop - a sequential stopping rule for component-wise gradient boosting. Methods of Information in Medicine, 51, 178--186. DOI: http://dx.doi.org/10.3414/ME11-02-0030
AIC.mboost
for
AIC
based selection of the stopping iteration. Use mstop
to extract the optimal stopping iteration from cvrisk
object.
# NOT RUN {
data("bodyfat", package = "TH.data")
### fit linear model to data
model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)
### AIC-based selection of number of boosting iterations
maic <- AIC(model)
maic
### inspect coefficient path and AIC-based stopping criterion
par(mai = par("mai") * c(1, 1, 1, 1.8))
plot(model)
abline(v = mstop(maic), col = "lightgray")
### 10-fold cross-validation
cv10f <- cv(model.weights(model), type = "kfold")
cvm <- cvrisk(model, folds = cv10f, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### 25 bootstrap iterations (manually)
set.seed(290875)
n <- nrow(bodyfat)
bs25 <- rmultinom(25, n, rep(1, n)/n)
cvm <- cvrisk(model, folds = bs25, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### same by default
set.seed(290875)
cvrisk(model, papply = lapply)
### 25 bootstrap iterations (using cv)
set.seed(290875)
bs25_2 <- cv(model.weights(model), type="bootstrap")
all(bs25 == bs25_2)
# }
# NOT RUN {
############################################################
## Do not run this example automatically as it takes
## some time (~ 5 seconds depending on the system)
### trees
blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
cvtree <- cvrisk(blackbox, papply = lapply)
plot(cvtree)
## End(Not run this automatically)
# }
# NOT RUN {
### cvrisk in parallel modes:
# }
# NOT RUN {
## at least not automatically
## parallel::mclapply() which is used here for parallelization only runs
## on unix systems (here we use 2 cores)
cvrisk(model, mc.cores = 2)
## infrastructure needs to be set up in advance
cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM
myApply <- function(X, FUN, ...) {
myFun <- function(...) {
library("mboost") # load mboost on nodes
FUN(...)
}
## further set up steps as required
parLapply(cl = cl, X, myFun, ...)
}
cvrisk(model, papply = myApply)
stopCluster(cl)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab