xgb.train
eXtreme Gradient Boosting Training
xgb.train
is an advanced interface for training an xgboost model.
The xgboost
function is a simpler wrapper for xgb.train
.
Usage
xgb.train(
params = list(),
data,
nrounds,
watchlist = list(),
obj = NULL,
feval = NULL,
verbose = 1,
print_every_n = 1L,
early_stopping_rounds = NULL,
maximize = NULL,
save_period = NULL,
save_name = "xgboost.model",
xgb_model = NULL,
callbacks = list(),
...
)xgboost(
data = NULL,
label = NULL,
missing = NA,
weight = NULL,
params = list(),
nrounds,
verbose = 1,
print_every_n = 1L,
early_stopping_rounds = NULL,
maximize = NULL,
save_period = NULL,
save_name = "xgboost.model",
xgb_model = NULL,
callbacks = list(),
...
)
Arguments
- params
the list of parameters. The complete list of parameters is available in the online documentation. Below is a shorter summary:
1. General Parameters
booster
which booster to use, can begbtree
orgblinear
. Default:gbtree
.
2. Booster Parameters
2.1. Parameter for Tree Booster
eta
control the learning rate: scale the contribution of each tree by a factor of0 < eta < 1
when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value foreta
implies larger value fornrounds
: loweta
value means model more robust to overfitting but slower to compute. Default: 0.3gamma
minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.max_depth
maximum depth of a tree. Default: 6min_child_weight
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1subsample
subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter witheta
and increasenrounds
. Default: 1colsample_bytree
subsample ratio of columns when constructing each tree. Default: 1num_parallel_tree
Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (setcolsample_bytree < 1
,subsample < 1
andround = 1
) accordingly. Default: 1monotone_constraints
A numerical vector consists of1
,0
and-1
with its length equals to the number of features in the training data.1
is increasing,-1
is decreasing and0
is no constraint.interaction_constraints
A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from0
(0
references the first column). Leave argument unspecified for no interaction constraints.
2.2. Parameter for Linear Booster
lambda
L2 regularization term on weights. Default: 0lambda_bias
L2 regularization term on bias. Default: 0alpha
L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
3. Task Parameters
objective
specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:reg:squarederror
Regression with squared loss (Default).reg:squaredlogerror
: regression with squared log loss \(1/2 * (log(pred + 1) - log(label + 1))^2\). All inputs are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.reg:logistic
logistic regression.reg:pseudohubererror
: regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.binary:logistic
logistic regression for binary classification. Output probability.binary:logitraw
logistic regression for binary classification, output score before logistic transformation.binary:hinge
: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.count:poisson
: poisson regression for count data, output mean of poisson distribution.max_delta_step
is set to 0.7 by default in poisson regression (used to safeguard optimization).survival:cox
: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard functionh(t) = h0(t) * HR)
.survival:aft
: Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.aft_loss_distribution
: Probabilty Density Function used bysurvival:aft
andaft-nloglik
metric.multi:softmax
set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 tonum_class - 1
.multi:softprob
same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.rank:pairwise
set xgboost to do ranking task by minimizing the pairwise loss.rank:ndcg
: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized.rank:map
: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized.reg:gamma
: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.reg:tweedie
: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
base_score
the initial prediction score of all instances, global bias. Default: 0.5eval_metric
evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
- data
training dataset.
xgb.train
accepts only anxgb.DMatrix
as the input.xgboost
, in addition, also acceptsmatrix
,dgCMatrix
, or name of a local data file.- nrounds
max number of boosting iterations.
- watchlist
named list of xgb.DMatrix datasets to use for evaluating model performance. Metrics specified in either
eval_metric
orfeval
will be computed for each of these datasets during each boosting iteration, and stored in the end as a field namedevaluation_log
in the resulting object. When eitherverbose>=1
orcb.print.evaluation
callback is engaged, the performance results are continuously printed out during the training. E.g., specifyingwatchlist=list(validation1=mat1, validation2=mat2)
allows to track the performance of each round's model on mat1 and mat2.- obj
customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
- feval
customized evaluation function. Returns
list(metric='metric-name', value='metric-value')
with given prediction and dtrain.- verbose
If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting
verbose > 0
automatically engages thecb.print.evaluation(period=1)
callback function.- print_every_n
Print each n-th iteration evaluation messages when
verbose>0
. Default is 1 which means all messages are printed. This parameter is passed to thecb.print.evaluation
callback.- early_stopping_rounds
If
NULL
, the early stopping function is not triggered. If set to an integerk
, training with a validation set will stop if the performance doesn't improve fork
rounds. Setting this parameter engages thecb.early.stop
callback.- maximize
If
feval
andearly_stopping_rounds
are set, then this parameter must be set as well. When it isTRUE
, it means the larger the evaluation score the better. This parameter is passed to thecb.early.stop
callback.- save_period
when it is non-NULL, model is saved to disk after every
save_period
rounds, 0 means save at the end. The saving is handled by thecb.save.model
callback.- save_name
the name or path for periodically saved model file.
- xgb_model
a previously built model to continue the training from. Could be either an object of class
xgb.Booster
, or its raw data, or the name of a file with a previously saved model.- callbacks
a list of callback functions to perform various task during boosting. See
callbacks
. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.- ...
other parameters to pass to
params
.- label
vector of response values. Should not be provided when data is a local data file name or an
xgb.DMatrix
.- missing
by default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix.
- weight
a vector indicating the weight for each row of the input.
Details
These are the training functions for xgboost
.
The xgb.train
interface supports advanced features such as watchlist
,
customized objective and evaluation metric functions, therefore it is more flexible
than the xgboost
interface.
Parallelization is automatically enabled if OpenMP
is present.
Number of threads can also be manually specified via nthread
parameter.
The evaluation metric is chosen automatically by Xgboost (according to the objective)
when the eval_metric
parameter is not provided.
User may set one or several eval_metric
parameters.
Note that when using a customized metric, only this single metric can be used.
The following is the list of built-in metrics for which Xgboost provides optimized implementation:
rmse
root mean square error. https://en.wikipedia.org/wiki/Root_mean_square_errorlogloss
negative log-likelihood. https://en.wikipedia.org/wiki/Log-likelihoodmlogloss
multiclass logloss. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.htmlerror
Binary classification error rate. It is calculated as(# wrong cases) / (# all cases)
. By default, it uses the 0.5 threshold for predicted values to define negative and positive instances. Different threshold (e.g., 0.) could be specified as "error@0."merror
Multiclass classification error rate. It is calculated as(# wrong cases) / (# all cases)
.mae
Mean absolute errormape
Mean absolute percentage errorauc
Area under the curve. https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve for ranking evaluation.aucpr
Area under the PR curve. https://en.wikipedia.org/wiki/Precision_and_recall for ranking evaluation.ndcg
Normalized Discounted Cumulative Gain (for ranking task). https://en.wikipedia.org/wiki/NDCG
The following callbacks are automatically created when certain parameters are set:
cb.print.evaluation
is turned on whenverbose > 0
; and theprint_every_n
parameter is passed to it.cb.evaluation.log
is on whenwatchlist
is present.cb.early.stop
: whenearly_stopping_rounds
is set.cb.save.model
: whensave_period > 0
is set.
Value
An object of class xgb.Booster
with the following elements:
handle
a handle (pointer) to the xgboost model in memory.raw
a cached memory dump of the xgboost model saved as R'sraw
type.niter
number of boosting iterations.evaluation_log
evaluation history stored as adata.table
with the first column corresponding to iteration number and the rest corresponding to evaluation metrics' values. It is created by thecb.evaluation.log
callback.call
a function call.params
parameters that were passed to the xgboost library. Note that it does not capture parameters changed by thecb.reset.parameters
callback.callbacks
callback functions that were either automatically assigned or explicitly passed.best_iteration
iteration number with the best evaluation metric value (only available with early stopping).best_ntreelimit
thentreelimit
value corresponding to the best iteration, which could further be used inpredict
method (only available with early stopping).best_score
the best evaluation metric value during early stopping. (only available with early stopping).feature_names
names of the training dataset features (only when column names were defined in training data).nfeatures
number of features in training data.
References
Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016, https://arxiv.org/abs/1603.02754
See Also
Examples
# NOT RUN {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)
## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
## An xgb.train example where custom objective and evaluation metric are used:
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
# These functions could be used by passing them either:
# as 'objective' and 'eval_metric' parameters in the params list:
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = logregobj, eval_metric = evalerror)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
# or through the ... arguments:
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
objective = logregobj, eval_metric = evalerror)
# or as dedicated 'obj' and 'feval' parameters of xgb.train:
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
obj = logregobj, feval = evalerror)
## An xgb.train example of using variable learning rates at each iteration:
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")
my_etas <- list(eta = c(0.5, 0.1))
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
callbacks = list(cb.reset.parameters(my_etas)))
## Early stopping:
bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
early_stopping_rounds = 3)
## An 'xgboost' interface example:
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
objective = "binary:logistic")
pred <- predict(bst, agaricus.test$data)
# }