run_gb
is a wrapper function that applies the gradient boosting
classifier to data provided by the user, evaluates prediction performance,
and chooses the best-performing model.
run_gb(
y,
L1.x,
L2.x,
L2.eval.unit,
L2.unit,
L2.reg,
loss.unit,
loss.fun,
interaction.depth,
shrinkage,
n.trees.init,
n.trees.increase,
n.trees.max,
cores = cores,
n.minobsinnode,
data,
verbose
)
The tuned gradient boosting parameters. A list with three elements:
interaction_depth
contains the interaction depth parameter,
shrinkage
contains the learning rate, n_trees
the number of
trees to be grown.
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in survey
.
Individual-level covariates. A character vector containing the
column names of the individual-level variables in survey
and
census
used to predict outcome y
. Note that geographic unit
is specified in argument L2.unit
.
Context-level covariates. A character vector containing the
column names of the context-level variables in survey
and
census
used to predict outcome y
. To exclude context-level
variables, set L2.x = NULL
.
Geographic unit for the loss function. A character scalar
containing the column name of the geographic unit in survey
and
census
.
Geographic unit. A character scalar containing the column
name of the geographic unit in survey
and census
at which
outcomes should be aggregated.
Geographic region. A character scalar containing the column
name of the geographic region in survey
and census
by which
geographic units are grouped (L2.unit
must be nested within
L2.reg
). Default is NULL
.
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents (individuals
) or geographic units (L2 units
).
Default is individuals
.
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error (MSE
)
or the mean absolute error (MAE
). Default is MSE
.
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is c(1, 2, 3)
.
GB learning rate. A numeric vector whose values specify the
learning rate or step-size reduction of GB. Values between \(0.001\)
and \(0.1\) usually work, but a smaller learning rate typically requires
more trees. Default is c(0.04, 0.01, 0.008, 0.005, 0.001)
.
GB initial total number of trees. An integer-valued scalar specifying the initial number of total trees to fit by GB. Default is \(50\).
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until n.trees.max
is reached)
or an integer-valued vector of length length(shrinkage)
with each
of its values being associated with a learning rate in shrinkage
.
Default is \(50\).
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB or an integer-valued
vector of length length(shrinkage)
with each of its values being
associated with a learning rate and an increase in the total number of
trees. Default is \(1000\).
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1.
GB minimum number of observations in the terminal nodes. An integer-valued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is \(5\).
Data for cross-validation. A list
of \(k\)
data.frames
, one for each fold to be used in \(k\)-fold
cross-validation.
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is TRUE
.