xgb.cv
Cross Validation
The cross validation function of xgboost
Usage
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,
prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL,
feval = NULL, stratified = TRUE, folds = NULL, verbose = TRUE,
print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL,
callbacks = list(), ...)
Arguments
- params
the list of parameters. Commonly used ones are:
objective
objective function, common ones arereg:linear
linear regressionbinary:logistic
logistic regression for classification
eta
step size of each boosting stepmax_depth
maximum depth of the treenthread
number of thread used in training, if not set, all threads are used
See
xgb.train
for further details. See also demo/ for walkthrough example in R.- data
takes an
xgb.DMatrix
,matrix
, ordgCMatrix
as the input.- nrounds
the max number of iterations
- nfold
the original dataset is randomly partitioned into
nfold
equal size subsamples.- label
vector of response values. Should be provided only when data is an R-matrix.
- missing
is only used when input is a dense matrix. By default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
- prediction
A logical value indicating whether to return the test fold predictions from each CV model. This parameter engages the
cb.cv.predict
callback.- showsd
boolean
, whether to show standard deviation of cross validation- metrics,
list of evaluation metrics to be used in cross validation, when it is not specified, the evaluation metric is chosen according to objective function. Possible options are:
error
binary classification error ratermse
Rooted mean square errorlogloss
negative log-likelihood functionauc
Area under curvemerror
Exact matching error, used to evaluate multi-class classification
- obj
customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
- feval
custimized evaluation function. Returns
list(metric='metric-name', value='metric-value')
with given prediction and dtrain.- stratified
a
boolean
indicating whether sampling of folds should be stratified by the values of outcome labels.- folds
list
provides a possibility to use a list of pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, thenfold
andstratified
parameters are ignored.- verbose
boolean
, print the statistics during the process- print_every_n
Print each n-th iteration evaluation messages when
verbose>0
. Default is 1 which means all messages are printed. This parameter is passed to thecb.print.evaluation
callback.- early_stopping_rounds
If
NULL
, the early stopping function is not triggered. If set to an integerk
, training with a validation set will stop if the performance doesn't improve fork
rounds. Setting this parameter engages thecb.early.stop
callback.- maximize
If
feval
andearly_stopping_rounds
are set, then this parameter must be set as well. When it isTRUE
, it means the larger the evaluation score the better. This parameter is passed to thecb.early.stop
callback.- callbacks
a list of callback functions to perform various task during boosting. See
callbacks
. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.- ...
other parameters to pass to
params
.
Details
The original sample is randomly partitioned into nfold
equal size subsamples.
Of the nfold
subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1
subsamples are used as training data.
The cross-validation process is then repeated nrounds
times, with each of the nfold
subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation
Value
An object of class xgb.cv.synchronous
with the following elements:
call
a function call.params
parameters that were passed to the xgboost library. Note that it does not capture parameters changed by thecb.reset.parameters
callback.callbacks
callback functions that were either automatically assigned or explicitely passed.evaluation_log
evaluation history storead as adata.table
with the first column corresponding to iteration number and the rest corresponding to the CV-based evaluation means and standard deviations for the training and test CV-sets. It is created by thecb.evaluation.log
callback.niter
number of boosting iterations.folds
the list of CV folds' indices - either those passed through thefolds
parameter or randomly generated.best_iteration
iteration number with the best evaluation metric value (only available with early stopping).best_ntreelimit
thentreelimit
value corresponding to the best iteration, which could further be used inpredict
method (only available with early stopping).pred
CV prediction values available whenprediction
is set. It is either vector or matrix (seecb.cv.predict
).models
a liost of the CV folds' models. It is only available with the explicit setting of thecb.cv.predict(save_models = TRUE)
callback.
Examples
# NOT RUN {
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)
# }