The cross validation function of xgboost
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL, callbacks = list(), ...)
- the list of parameters. Commonly used ones are:
objectiveobjective function, common ones are
binary:logisticlogistic regression for classification
etastep size of each boosting step
max_depthmaximum depth of the tree
nthreadnumber of thread used in training, if not set, all threads are used
xgb.trainfor further details. See also demo/ for walkthrough example in R.
- takes an
dgCMatrixas the input.
- the max number of iterations
- the original dataset is randomly partitioned into
nfoldequal size subsamples.
- vector of response values. Should be provided only when data is an R-matrix.
- is only used when input is a dense matrix. By default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
- A logical value indicating whether to return the test fold predictions
from each CV model. This parameter engages the
boolean, whether to show standard deviation of cross validation
- list of evaluation metrics to be used in cross validation,
when it is not specified, the evaluation metric is chosen according to objective function.
Possible options are:
errorbinary classification error rate
rmseRooted mean square error
loglossnegative log-likelihood function
aucArea under curve
merrorExact matching error, used to evaluate multi-class classification
- customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
- custimized evaluation function. Returns
list(metric='metric-name', value='metric-value')with given prediction and dtrain.
booleanindicating whether sampling of folds should be stratified by the values of outcome labels.
listprovides a possibility to use a list of pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, the
stratifiedparameters are ignored.
boolean, print the statistics during the process
- Print each n-th iteration evaluation messages when
verbose>0. Default is 1 which means all messages are printed. This parameter is passed to the
NULL, the early stopping function is not triggered. If set to an integer
k, training with a validation set will stop if the performance doesn't improve for
krounds. Setting this parameter engages the
early_stopping_roundsare set, then this parameter must be set as well. When it is
TRUE, it means the larger the evaluation score the better. This parameter is passed to the
- a list of callback functions to perform various task during boosting.
callbacks. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.
- other parameters to pass to
The original sample is randomly partitioned into
nfold equal size subsamples.
nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining
nfold - 1 subsamples are used as training data.
The cross-validation process is then repeated
nrounds times, with each of the
nfold subsamples used exactly once as the validation data.
All observations are used for both training and validation.
An object of class
calla function call.
paramsparameters that were passed to the xgboost library. Note that it does not capture parameters changed by the
callbackscallback functions that were either automatically assigned or explicitely passed.
evaluation_logevaluation history storead as a
data.tablewith the first column corresponding to iteration number and the rest corresponding to the CV-based evaluation means and standard deviations for the training and test CV-sets. It is created by the
niternumber of boosting iterations.
foldsthe list of CV folds' indices - either those passed through the
foldsparameter or randomly generated.
best_iterationiteration number with the best evaluation metric value (only available with early stopping).
ntreelimitvalue corresponding to the best iteration, which could further be used in
predictmethod (only available with early stopping).
predCV prediction values available when
predictionis set. It is either vector or matrix (see
modelsa liost of the CV folds' models. It is only available with the explicit setting of the
cb.cv.predict(save_models = TRUE)callback.
xgb.cv.synchronouswith the following elements:
data(agaricus.train, package='xgboost') dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"), max_depth = 3, eta = 1, objective = "binary:logistic") print(cv) print(cv, verbose=TRUE)