# Family

##### Gradient Boosting Families

`boost_family`

objects provide a convenient way to specify loss functions
and corresponding risk functions to be optimized by one of the boosting
algorithms implemented in this package.

- Keywords
- models

##### Usage

```
Family(ngradient, loss = NULL, risk = NULL,
offset = function(y, w)
optimize(risk, interval = range(y),
y = y, w = w)$minimum,
check_y = function(y) y,
weights = c("any", "none", "zeroone", "case"),
nuisance = function() return(NA),
name = "user-specified", fW = NULL,
response = function(f) NA,
rclass = function(f) NA)
AdaExp()
AUC()
Binomial(type = c("adaboost", "glm"),
link = c("logit", "probit", "cloglog", "cauchit", "log"), ...)
GaussClass()
GaussReg()
Gaussian()
Huber(d = NULL)
Laplace()
Poisson()
GammaReg(nuirange = c(0, 100))
CoxPH()
QuantReg(tau = 0.5, qoffset = 0.5)
ExpectReg(tau = 0.5)
NBinomial(nuirange = c(0, 100))
PropOdds(nuirange = c(-0.5, -1), offrange = c(-5, 5))
Weibull(nuirange = c(0, 100))
Loglog(nuirange = c(0, 100))
Lognormal(nuirange = c(0, 100))
Gehan()
Hurdle(nuirange = c(0, 100))
Multinomial()
Cindex(sigma = 0.1, ipcw = 1)
RCG(nuirange = c(0, 1), offrange = c(-5, 5))
```

##### Arguments

- ngradient
a function with arguments

`y`

,`f`

and`w`

implementing the*negative*gradient of the`loss`

function (which is to be minimized).- loss
an optional loss function with arguments

`y`

and`f`

.- risk
an optional risk function with arguments

`y`

,`f`

and`w`

to be minimized (!), the weighted mean of the loss function by default.- offset
a function with argument

`y`

and`w`

(weights) for computing a*scalar*offset.- fW
transformation of the fit for the diagonal weights matrix for an approximation of the boosting hat matrix for loss functions other than squared error.

- response
inverse link function of a GLM or any other transformation on the scale of the response.

- rclass
function to derive class predictions from conditional class probabilities (for models with factor response variable).

- check_y
a function for checking and transforming the class / mode of a response variable.

- nuisance
a function for extracting nuisance parameters from the family.

- weights
a character indicating what type of weights are allowed. These can be either arbitrary (non-negative) weights code"any", only zero and one weights

`"zeroone"`

, (non-negative) interger weights`"case"`

, or no weights are allowed`"none"`

.- name
a character giving the name of the loss function for pretty printing.

- type
which parameterization of

`Binomial`

shoule be used?- link
link function. For possible values see Usage section.

- d
delta parameter for Huber loss function. If omitted, it is chosen adaptively.

- tau
the quantile or expectile to be estimated, a number strictly between 0 and 1.

- qoffset
quantile of response distribution to be used as offset, i.e., starting values for the intercept. Per default the median of the response is used, which is in general a good choice (see Fenske et al. 2011, for details).

- nuirange
a vector containing the end-points of the interval to be searched for the minimum risk w.r.t. the nuisance parameter. In case of

`PropOdds`

, the starting values for the nuisance parameters.- offrange
interval to search in for offset.

- sigma
smoothness parameter for sigmoid functions inside

`Cindex`

.- ipcw
vector containing inverse probability of censoring weights for all observations. If omitted, it is estimated inside

`Cindex`

family.- …
additional arguments to link functions.

##### Details

The boosting algorithm implemented in `mboost`

minimizes the
(weighted) empirical risk function `risk(y, f, w)`

with respect to `f`

.
By default, the risk function is the weighted sum of the loss function `loss(y, f)`

but can be chosen arbitrarily. The `ngradient(y, f)`

function is the negative
gradient of `loss(y, f)`

with respect to `f`

.

Pre-fabricated functions for the most commonly used loss functions are available as well. Buehlmann and Hothorn (2007) give a detailed overview of the available loss functions. An updated overview can be found in Hofner et al (2014).

The `offset`

function returns the population minimizers evaluated at the
response, i.e., \(1/2 \log(p / (1 - p))\) for `Binomial()`

or `AdaExp()`

and \((\sum w_i)^{-1} \sum w_i y_i\) for `Gaussian()`

and the
median for `Huber()`

and `Laplace()`

. The offset is used as starting
value for the boosting algorithm.

Note that all families are functions and thus need to be specified either with
empty brackets (e.g., `family = Gaussian()`

for Gaussian regression) or
with additional arguments if these are supported by the respective family
(e.g., `family = QuantReg(tau = 0.2)`

for quantile regression for the
20% quantile).

**A short summary of the available families is given in the following paragraphs:**

`AdaExp()`

, `Binomial()`

and `AUC()`

implement
families for binary classification. `AdaExp()`

uses the
exponential loss, which essentially leads to the AdaBoost algorithm
of Freund and Schapire (1996). `Binomial()`

implements the
negative binomial log-likelihood of a logistic regression model
as loss function. Thus, using `Binomial`

family closely corresponds
to fitting a logistic model. Alternative link functions
can be specified.

However, the coefficients resulting from boosting with family
`Binomial(link = "logit")`

are \(1/2\) of the coefficients of a logit model
obtained via `glm`

. Buehlmann and Hothorn (2007) argue that the
family `Binomial`

is the preferred choice for binary
classification. For binary classification problems the response
`y`

has to be a `factor`

. Internally `y`

is re-coded
to \(-1\) and \(+1\) (Buehlmann and Hothorn 2007).

`Binomial(type = "glm")`

is an alternative to `Binomial()`

leading to
coefficients of the same size as coefficients from a classical logit
model via `glm`

. Additionally, it works not only with a
two-level factor but also with a two-column matrix containing the number
of successes and number of failures (again, similar to `glm`

).

`AUC()`

uses \(1-AUC(y, f)\) as the loss function.
The area under the ROC curve (AUC) is defined as
\(AUC = (n_{-1} n_1)^{-1} \sum_{i: y_i = 1} \sum_{j: y_j = -1} I(f_i > f_j)\).
Since this is not differentiable in `f`

, we approximate the jump function
\(I((f_i - f_j) > 0)\) by the distribution function of the triangular
distribution on \([-1, 1]\) with mean \(0\), similar to the logistic
distribution approximation used in Ma and Huang (2005).

`Gaussian()`

is the default family in `mboost`

. It
implements \(L_2\)Boosting for continuous response. Note
that families `GaussReg()`

and `GaussClass()`

(for regression
and classification) are deprecated now.
`Huber()`

implements a robust version for boosting with
continuous response, where the Huber-loss is used. `Laplace()`

implements another strategy for continuous outcomes and uses the
\(L_1\)-loss instead of the \(L_2\)-loss as used by
`Gaussian()`

.

`Poisson()`

implements a family for fitting count data with
boosting methods. The implemented loss function is the negative
Poisson log-likelihood. Note that the natural link function
\(\log(\mu) = \eta\) is assumed. The default step-site `nu = 0.1`

is probably too large for this family (leading to
infinite residuals) and smaller values are more appropriate.

`GammaReg()`

implements a family for fitting nonnegative response
variables. The implemented loss function is the negative Gamma
log-likelihood with logarithmic link function (instead of the natural
link).

`CoxPH()`

implements the negative partial log-likelihood for Cox
models. Hence, survival models can be boosted using this family.

`QuantReg()`

implements boosting for quantile regression, which is
introduced in Fenske et al. (2009). `ExpectReg`

works in analogy,
only for expectiles, which were introduced to regression by Newey and Powell (1987).

Families with an additional scale parameter can be used for fitting
models as well: `PropOdds()`

leads to proportional odds models
for ordinal outcome variables (Schmid et al., 2011). When using this
family, an ordered set of threshold parameters is re-estimated in each
boosting iteration. An example is given below which also shows how to
obtain the thresholds. `NBinomial()`

leads to regression models with
a negative binomial conditional distribution of the response.
`Weibull()`

, `Loglog()`

, and `Lognormal()`

implement
the negative log-likelihood functions of accelerated failure time
models with Weibull, log-logistic, and lognormal distributed outcomes,
respectively. Hence, parametric survival models can be boosted using
these families. For details see Schmid and Hothorn (2008) and Schmid
et al. (2010).

`Gehan()`

implements rank-based estimation of survival data in an
accelerated failure time model. The loss function is defined as the sum
of the pairwise absolute differences of residuals. The response needs to
be defined as `Surv(y, delta)`

, where `y`

is the observed survial
time (subject to censoring) and `delta`

is the non-censoring indicator
(see `Surv`

for details). For details on `Gehan()`

see
Johnson and Long (2011).

`Cindex()`

optimizes the concordance-index for survival data (often denoted
as Harrell's C or C-index). The concordance index evaluates the rank-based
concordance probability between the model and the outcome. The C-index measures
whether large values of the model are associated with short survival times and
vice versa. The interpretation is similar to the AUC: A C-index of 1 represents a
perfect discrimination while a C-index of 0.5 will be achieved by a completely
non-informative marker. The `Cindex()`

family is based on an estimator by
Uno et al. (2011), which incorporates inverse probability of censoring weighting
`ipcw`

. To make the estimator differentiable, sigmoid functions are applied;
the corresponding smoothness can be controlled via `sigma`

. For details on
`Cindex()`

see Mayr and Schmid (2014).

Hurdle models for zero-inflated count data can be fitted by using a combination
of the `Binomial()`

and `Hurdle()`

families. While the `Binomial()`

family allows for fitting the zero-generating process of the Hurdle model,
`Hurdle()`

fits a negative binomial regression model to the non-zero
counts. Note that the specification of the Hurdle model allows for using
`Binomial()`

and `Hurdle()`

independently of each other.

Linear or additive multinomial logit models can be fitted using
`Multinomial()`

; although is family requires some extra effort for
model specification (see example). More specifically, the predictor must
be in the form of a linear array model (see `%O%`

). Note
that this family does not work with tree-based base-learners at the
moment. The class corresponding to the last level of the factor coding
of the response is used as reference class.

`RCG()`

implements the ratio of correlated gammas (RCG) model proposed
by Weinhold et al. (2016).

##### Value

An object of class `boost_family`

.

##### Warning

The coefficients resulting from boosting with family
`Binomial(link = "logit")`

are \(1/2\) of the coefficients of a logit model
obtained via `glm`

(see above).

For `AUC()`

, variables should be centered and scaled and observations with weight > 0 must not contain missing values.
The estimated coefficients for `AUC()`

have no probabilistic interpretation.

##### References

Peter Buehlmann and Torsten Hothorn (2007),
Boosting algorithms: regularization, prediction and model fitting.
*Statistical Science*, **22**(4), 477--505.

Nora Fenske, Thomas Kneib, and Torsten Hothorn (2011),
Identifying risk factors for severe childhood malnutrition by
boosting additive quantile regression.
*Journal of the American Statistical Association*, **106**:494-510.

Yoav Freund and Robert E. Schapire (1996),
Experiments with a new boosting algorithm.
In *Machine Learning: Proc. Thirteenth International Conference*,
148--156.

Shuangge Ma and Jian Huang (2005), Regularized ROC method for
disease classification and biomarker selection with microarray
data. *Bioinformatics*, **21**(24), 4356--4362.

Andreas Mayr and Matthias Schmid (2014).
Boosting the concordance index for survival data -- a unified
framework to derive and evaluate biomarker combination.
*PloS ONE*, **9**(1):84483.

Whitney K. Newey and James L. Powell (1987),
Asymmetric least squares estimation and testing.
*Econometrika*, **55**, 819--847.

Matthias Schmid and Torsten Hothorn (2008),
Flexible boosting of accelerated failure time models.
*BMC Bioinformatics*, **9**(269).

Matthias Schmid, Sergej Potapov, Annette Pfahlberg,
and Torsten Hothorn (2010). Estimation and regularization techniques for
regression models with multidimensional prediction functions.
*Statistics and Computing*, **20**, 139--150.

Schmid, M., T. Hothorn, K. O. Maloney, D. E. Weller and S. Potapov
(2011): Geoadditive regression modeling of stream biological
condition. *Environmental and Ecological Statistics*,
**18**(4), 709--733.

Uno H, Cai T, Pencina MJ, D Agostino RB and Wei LJ (2011).
On the C-statistics for evaluating overall adequacy of risk prediction
procedures with censored survival data.
*Statistics in Medicine*, **30**(10), 1105--17.

Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Matthias Schmid
(2014). Model-based Boosting in R: A Hands-on Tutorial Using the R
Package mboost. *Computational Statistics*, **29**, 3--35.
http://dx.doi.org/10.1007/s00180-012-0382-5

Available as vignette via: `vignette(package = "mboost", "mboost_tutorial")`

Brent A. Johnson and Qi Long (2011) Survival ensembles by the sum of pairwise
differences with application to lung cancer microarray studies.
*Annals of Applied Statistics*, **5**, 1081--1101.

Weinhold, L., S. Pechlivanis, S. Wahl, P. Hoffmann and M. Schmid (2016) A Statistical Model for the
Analysis of Bounded Response Variables in DNA Methylation Studies.
*BMC Bioinformatics*. 2016; 17: 480. http://dx.doi.org/10.1186/s12859-016-1347-4

##### See Also

`mboost`

for the usage of `Family`

s. See
`boost_family-class`

for objects resulting from a call to `Family`

.

##### Examples

```
# NOT RUN {
### Define a new family
MyGaussian <- function(){
Family(ngradient = function(y, f, w = 1) y - f,
loss = function(y, f) (y - f)^2,
name = "My Gauss Variant")
}
# Now use the new family
data(bodyfat, package = "TH.data")
mod <- mboost(DEXfat ~ ., data = bodyfat, family = MyGaussian())
# N.B. that the family needs to be called with empty brackets
# }
# NOT RUN {
############################################################
## Do not run and check these examples automatically as
## they take some time
### Proportional odds model
data(iris)
iris$Species <- factor(iris$Species, ordered = TRUE)
if (require("MASS")) {
(mod.polr <- polr(Species ~ Sepal.Length, data = iris))
}
mod.PropOdds <- glmboost(Species ~ Sepal.Length, data = iris,
family = PropOdds(nuirange = c(-0.5, 3)))
mstop(mod.PropOdds) <- 1000
## thresholds are treated as nuisance parameters, to extract these use
nuisance(mod.PropOdds)
## effect estimate
coef(mod.PropOdds)["Sepal.Length"]
## make thresholds comparable to a model without intercept
nuisance(mod.PropOdds) - coef(mod.PropOdds)["(Intercept)"] -
attr(coef(mod.PropOdds), "offset")
## End(Not run and test)
# }
# NOT RUN {
### Multinomial logit model via a linear array model
## One needs to convert the data to a list
myiris <- as.list(iris)
## ... and define a dummy vector with one factor level less
## than the outcome, which is used as reference category.
myiris$class <- factor(levels(iris$Species)[-nlevels(iris$Species)])
## Now fit the linear array model
mlm <- mboost(Species ~ bols(Sepal.Length, df = 2) %O%
bols(class, df = 2, contrasts.arg = "contr.dummy"),
data = myiris,
family = Multinomial())
coef(mlm) ## one should use more boosting iterations.
head(round(pred <- predict(mlm, type = "response"), 2))
## Prediction with new data:
newdata <- as.list(iris[1,])
## One always needs to keep the dummy vector class as above!
newdata$class <- factor(levels(iris$Species)[-nlevels(iris$Species)])
pred2 <- predict(mlm, type = "response", newdata = newdata)
## check results
pred[1, ]
pred2
# }
# NOT RUN {
############################################################
## Do not run and check these examples automatically as
## they take some time
## Compare results with nnet::multinom
if (require("nnet")) {
mlmn <- multinom(Species ~ Sepal.Length, data = iris)
max(abs(fitted(mlm[1000], type = "response") -
fitted(mlmn, type = "prob")))
}
## End(Not run and test)
# }
# NOT RUN {
# }
# NOT RUN {
############################################################
## Do not run and check these examples automatically as
## they take some time
### Example for RCG model
## generate covariate values
set.seed(12345)
x1 <- rnorm(500)
x2 <- rnorm(500)
## generate linear predictors
zetaM <- 0.1 + 0.3 * x1 - 0.5 * x2
zetaU <- 0.1 - 0.1 * x1 + 0.2 * x2
## generate beta values
M <- rgamma(500, shape = 2, rate = exp(zetaM))
U <- rgamma(500, shape = 2, rate = exp(zetaU))
y <- M / (M + U)
## fit RCG model
data <- data.frame(y, x1, x2)
RCGmodel <- glmboost(y ~ x1 + x2, data = data, family = RCG(),
control = boost_control(mstop = 2000,
trace = TRUE, nu = 0.01))
## true coefficients: gamma = (0.0, 0.4, -0.7),
## alpha (= shape) = 2,
## rho = 0
## compare to coefficient estimates
coef(RCGmodel)
nuisance(RCGmodel)
## compute downstream tests
## (only suitable without early stopping, i.e., if likelihood based model converged)
downstream.test(RCGmodel)
## compute conditional expectations
predictions <- predict(RCGmodel, type = "response")
plot(predictions, y)
abline(0,1)
## End(Not run and test)
# }
# NOT RUN {
# }
```

*Documentation reproduced from package mboost, version 2.9-1, License: GPL-2*