The cross valudation function of xgboost
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NULL, prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T, print.every.n = 1L, early.stop.round = NULL, maximize = NULL, ...)
- the list of parameters. Commonly used ones are:
objectiveobjective function, common ones are
binary:logisticlogistic regression for classification
etastep size of each boosting step
max.depthmaximum depth of the tree
nthreadnumber of thread used in training, if not set, all threads are used
See xgb.train for further details. See also demo/ for walkthrough example in R.
- takes an
Matrixas the input.
- the max number of iterations
- the original dataset is randomly partitioned into
nfoldequal size subsamples.
- option field, when data is
- Missing is only used when input is dense matrix, pick a float value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.
- A logical value indicating whether to return the prediction vector.
boolean, whether show standard deviation of cross validation
- list of evaluation metrics to be used in corss validation,
when it is not specified, the evaluation metric is chosen according to objective function.
Possible options are:
errorbinary classification error rate
rmseRooted mean square error
loglossnegative log-likelihood function
aucArea under curve
merrorExact matching error, used to evaluate multi-class classification
- customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
- custimized evaluation function. Returns
list(metric='metric-name', value='metric-value')with given prediction and dtrain.
booleanwhether sampling of folds should be stratified by the values of labels in
listprovides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices). If folds are supplied, the nfold and stratified parameters would be ignored.
boolean, print the statistics during the process
- Print every N progress messages when
verbose>0. Default is 1 which means all messages are printed.
NULL, the early stopping function is not triggered. If set to an integer
k, training with a validation set will stop if the performance keeps getting worse consecutively for
early.stop.roundare set, then
maximizemust be set as well.
maximize=TRUEmeans the larger the evaluation score the better.
- other parameters to pass to
The original sample is randomly partitioned into
nfold equal size subsamples.
nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining
nfold - 1 subsamples are used as training data.
The cross-validation process is then repeated
nrounds times, with each of the
nfold subsamples used exactly once as the validation data.
All observations are used for both training and validation.
data.tablewith each mean and standard deviation stat for training set and test set
predan array or matrix (for multiclass classification) with predictions for each CV-fold for the model having been trained on the data in all other folds.
prediction = TRUE, a list with the following elements is returned:
prediction = FALSE, just a
data.tablewith each mean and standard deviation stat for training set and test set is returned.
data(agaricus.train, package='xgboost') dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"), max.depth =3, eta = 1, objective = "binary:logistic") print(history)