Usage
gbts(x, y, w = rep(1, nrow(x)), nitr = 100, nlhs = floor(nitr/2), nprd = 1000, kfld = 10, nwrk = 2, srch = c("bayes", "random"), rpkg = c("gbm", "xgb"), pfmc = c("acc", "dev", "ks", "auc", "roc", "mse", "rsq", "mae"), cdfx = "fpr", cdfy = "tpr", cutoff = 0.5, max_depth_range = c(2, 10), leaf_size_range = c(10, 200), bagn_frac_range = c(0.1, 1), coln_frac_range = c(0.1, 1), shrinkage_range = c(0.01, 0.1), num_trees_range = c(50, 1000), scl_pos_w_range = c(1, 10), print_progress = TRUE)
Arguments
x
a data.frame of predictors. If rpkg
(described below) is
set to "gbm"
, then x
is allowed to have categorical predictors
represented as factors. Otherwise, all predictors in x
must be numeric.
y
a vector of response values. For binary classification, y
must contain values of 0 and 1. There is no need to convert y
to a
factor variable. For regression, y
must contain at least two unique
values.
w
an optional vector of observation weights.
nitr
an integer of the number of iterations for the optimization.
nlhs
an integer of the number of Latin Hypercube samples (each sample
is a combination of hyperparameter values) used to generate the initial
performance model. This is used for Bayesian optimization only. Random search
ignores this argument.
nprd
an integer of the number of samples (each sample is a combination
of hyperparameter values) at which performance prediction is made, and the
best is selected to run the next iteration of GBT.
kfld
an integer of the number of folds for cross-validation used at
each iteration.
nwrk
an integer of the number of computing workers (CPU cores) to be
used. If nwrk
is less than the number of available cores on the
machine, it uses all available cores.
srch
a character indicating the search method such that
srch="bayes"
uses Bayesian optimization (default), and
srch="random"
uses random search.
rpkg
a character indicating which package of GBT to use. Setting
rpkg="gbm"
uses the gbm
R package (default). Setting
rpkg="xgb"
uses the xgboost
R package. Note that with
gbm
, predictors can be categorical represented as factors, as opposed
to xgboost
which requires all predictors to be numeric.
pfmc
a character of the performance metric to optimize.
For binary classification, pfmc
accepts:
-
"acc"
: accuracy.
-
"dev"
: deviance.
-
"ks"
: Kolmogorov-Smirnov (KS) statistic.
-
"auc"
: area under the ROC curve. This is used in conjunction
with the cdfx
and cdfy
arguments (described below) which
specify the cumulative distributions for the x-axis and y-axis of the ROC
curve, respectively. The default ROC curve is given by true positive rate
(on the y-axis) vs. false positive rate (on the x-axis).
-
"roc"
: this is used when a point on the ROC curve is used as the
performance metric, such as the true positive rate at a fixed false positive
rate. This is used in conjunction with the cdfx
, cdfy
, and
cutoff
arguments which specify the cumulative distributions for the
x-axis and y-axis of the ROC curve, and the cutoff (value on the x-axis) at
which evaluation of the ROC curve is obtained as a performance metric. For
example, if the desired performance metric is the true positive rate at
the 5% false positive rate, specify pfmc="roc"
, cdfx="fpr"
,
cdfy="tpr"
, and cutoff=0.05
.
For regression, pfmc
accepts:
-
"mse"
: mean squared error.
-
"mae"
: mean absolute error.
-
"rsq"
: r-squared (coefficient of determination).
cdfx
a character of the cumulative distribution for the x-axis.
Supported values are
-
"fpr"
: false positive rate.
-
"fnr"
: false negative rate.
-
"rpp"
: rate of positive prediction.
cdfy
a character of the cumulative distribution for the y-axis.
Supported values are
-
"tpr"
: true positive rate.
-
"tnr"
: true negative rate.
cutoff
a value in [0, 1] used for binary classification. If
pfmc="acc"
, instances with probabilities <= cutoff are
predicted as negative, and those with probabilities > cutoff
are
predicted as positive. If pfmc="roc"
, cutoff
can be used in
conjunction with the cdfx
and cdfy
arguments (described above)
to specify the operating point. For example, if the desired performance
metric is the true positive rate at the 5% false positive rate, specify
pfmc="roc"
, cdfx="fpr"
, cdfy="tpr"
, and
cutoff=0.05
.=>
max_depth_range
a vector of the minimum and maximum values for:
maximum tree depth.
leaf_size_range
a vector of the minimum and maximum values for:
leaf node size.
bagn_frac_range
a vector of the minimum and maximum values for:
bag fraction.
coln_frac_range
a vector of the minimum and maximum values for:
fraction of predictors to try for each split.
shrinkage_range
a vector of the minimum and maximum values for:
shrinkage.
num_trees_range
a vector of the minimum and maximum values for:
number of trees.
scl_pos_w_range
a vector of the minimum and maximum values for:
scale of weights for positive cases.
print_progress
a logical of whether optimization progress should be
printed to the console.