boost_family
objects provide a convenient way to specify loss functions
and corresponding risk functions to be optimized by one of the boosting
algorithms implemented in this package.Family(ngradient, loss = NULL, risk = NULL,
offset = function(y, w)
optimize(risk, interval = range(y),
y = y, w = w)$minimum,
check_y = function(y) y,
weights = c("any", "none", "zeroone", "case"),
nuisance = function() return(NA),
name = "user-specified", fW = NULL,
response = function(f) NA,
rclass = function(f) NA)
AdaExp()
AUC()
Binomial(link = c("logit", "probit"), ...)
GaussClass()
GaussReg()
Gaussian()
Huber(d = NULL)
Laplace()
Poisson()
GammaReg(nuirange = c(0, 100))
CoxPH()
QuantReg(tau = 0.5, qoffset = 0.5)
ExpectReg(tau = 0.5)
NBinomial(nuirange = c(0, 100))
PropOdds(nuirange = c(-0.5, -1), offrange = c(-5, 5))
Weibull(nuirange = c(0, 100))
Loglog(nuirange = c(0, 100))
Lognormal(nuirange = c(0, 100))
y
, f
and w
implementing the
negative gradient of the loss
function (which is to be minimized).y
and f
.y
, f
and w
to be minimized (!),
the weighted mean of the loss function by default.y
and w
(weights)
for computing a scalar offset.link = "norm"
), parameters of which may
be specified via the ...argument.PropOdds
, the starting values for
the nuisance parameters.boost_family
.Binomial
are $1/2$ of the coefficients of a logit model
obtained via glm
. This is due to the internal recoding
of the response to $-1$ and $+1$ (see above).
For AUC()
, variables should be centered and scaled and observations with weight > 0 must not contain missing values.
The estimated coefficients for AUC()
have no probabilistic interpretation.mboost
minimizes the
(weighted) empirical risk function risk(y, f, w)
with respect to f
.
By default, the risk function is the weighted sum of the loss function loss(y, f)
but can be chosen arbitrarily. The ngradient(y, f)
function is the negative
gradient of loss(y, f)
with respect to f
.
Pre-fabricated functions for the most commonly used loss functions are
available as well. Buehlmann and Hothorn (2007) give a detailed
overview of the available loss functions. The offset
function
returns the population minimizers evaluated at the response, i.e.,
$1/2 \log(p / (1 - p))$ for Binomial()
or AdaExp()
and $(\sum w_i)^{-1} \sum w_i y_i$ for Gaussian()
and the
median for Huber()
and Laplace()
. A short summary of the
available families is given in the following paragraphs:
AdaExp()
, Binomial()
and AUC()
implement
families for binary classification. AdaExp()
uses the
exponential loss, which essentially leads to the AdaBoost algorithm
of Freund and Schapire (1996). Binomial()
implements the
negative binomial log-likelihood of a logistic regression model
as loss function. Thus, using Binomial
family closely corresponds
to fitting a logistic model. Alternative link functions
can be specified via the name of the corresponding distribution, for
example link = "cauchy"
lead to pcauchy
used as link function. This feature is still experimental and
not well tested.
However, the coefficients resulting from boosting with family
Binomial(link = "logit")
are $1/2$ of the coefficients of a logit model
obtained via glm
. This is due to the internal recoding
of the response to $-1$ and $+1$ (see below).
However, Buehlmann and Hothorn (2007) argue that the
family Binomial
is the preferred choice for binary
classification. For binary classification problems the response
y
has to be a factor
. Internally y
is re-coded
to $-1$ and $+1$ (Buehlmann and Hothorn 2007).
AUC()
uses $1-AUC(y, f)$ as the loss function.
The area under the ROC curve (AUC) is defined as
$AUC = (n_{-1} n_1)^{-1} \sum_{i: y_i = 1} \sum_{j: y_j = -1} I(f_i > f_j)$.
Since this is not differentiable in f
, we approximate the jump function
$I((f_i - f_j) > 0)$ by the distribution function of the triangular
distribution on $[-1, 1]$ with mean $0$, similar to the logistic
distribution approximation used in Ma and Huang (2005).
Gaussian()
is the default family in mboost
. It
implements $L_2$Boosting for continuous response. Note
that families GaussReg()
and GaussClass()
(for regression
and classification) are deprecated now.
Huber()
implements a robust version for boosting with
continuous response, where the Huber-loss is used. Laplace()
implements another strategy for continuous outcomes and uses the
$L_1$-loss instead of the $L_2$-loss as used by
Gaussian()
.
Poisson()
implements a family for fitting count data with
boosting methods. The implemented loss function is the negative
Poisson log-likelihood. Note that the natural link function
$\log(\mu) = \eta$ is assumed. The default step-site nu = 0.1
is probably too large for this family (leading to
infinite residuals) and smaller values are more appropriate.
GammaReg()
implements a family for fitting nonnegative response
variables. The implemented loss function is the negative Gamma
log-likelihood with logarithmic link function (instead of the natural
link).
CoxPH()
implements the negative partial log-likelihood for Cox
models. Hence, survival models can be boosted using this family.
QuantReg()
implements boosting for quantile regression, which is
introduced in Fenske et al. (2009). ExpectReg
works in analogy,
only for expectiles, which were introduced to regression by Newey and Powell (1987).
Families with an additional scale parameter can be used for fitting
models as well: PropOdds()
leads to proportional odds models
for ordinal outcome variables. When using this family, an ordered set of
threshold parameters is re-estimated in each boosting iteration.
NBinomial()
leads to regression models with a negative binomial
conditional distribution of the response. Weibull()
, Loglog()
,
and Lognormal()
implement the negative log-likelihood functions
of accelerated failure time models with Weibull, log-logistic, and
lognormal distributed outcomes, respectively. Hence, parametric survival
models can be boosted using these families. For details see Schmid and
Hothorn (2008) and Schmid et al. (2010).mboost
for the usage of Family
s. See
boost_family-class
for objects resulting from a call to Family
.Laplace()
MyGaussian <- function(){
Family(ngradient = function(y, f, w = 1) y - f,
loss = function(y, f) (y - f)^2,
name = "My Gauss Variant")
}
Run the code above in your browser using DataLab