The cross valudation function of xgboost
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NULL, prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T, print.every.n = 1L, early.stop.round = NULL, maximize = NULL, ...)
- the list of parameters. Commonly used ones are:
objectiveobjective function, common ones are
binary:logisticlogistic regression for classification
- takes an
Matrixas the input.
- the max number of iterations
- the original dataset is randomly partitioned into
nfoldequal size subsamples.
- option field, when data is
- Missing is only used when input is dense matrix, pick a float value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.
- A logical value indicating whether to return the prediction vector.
boolean, whether show standard deviation of cross validation
- list of evaluation metrics to be used in corss validation,
when it is not specified, the evaluation metric is chosen according to objective function.
Possible options are:
errorbinary classification error rate
- customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
- custimized evaluation function. Returns
list(metric='metric-name', value='metric-value')with given prediction and dtrain.
booleanwhether sampling of folds should be stratified by the values of labels in
listprovides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices). If folds are supplied, the nfold and stratified parameters would be ignored.
boolean, print the statistics during the process
- Print every N progress messages when
verbose>0. Default is 1 which means all messages are printed.
NULL, the early stopping function is not triggered. If set to an integer
k, training with a validation set will stop if the performance keeps getting worse consecutively for
early.stop.roundare set, then
maximizemust be set as well.
maximize=TRUEmeans the larger the evaluation score the better.
- other parameters to pass to
The original sample is randomly partitioned into
nfold equal size subsamples.
nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining
nfold - 1 subsamples are used as training data.
The cross-validation process is then repeated
nrounds times, with each of the
nfold subsamples used exactly once as the validation data.
All observations are used for both training and validation.
prediction = TRUE, a list with the following elements is returned:
data.tablewith each mean and standard deviation stat for training set and test set
predan array or matrix (for multiclass classification) with predictions for each CV-fold for the model having been trained on the data in all other folds.
prediction = FALSE, just a
data.tablewith each mean and standard deviation stat for training set and test set is returned.
data(agaricus.train, package='xgboost') dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"), max.depth =3, eta = 1, objective = "binary:logistic") print(history)