Base-learners for Gradient Boosting
Base-learners for fitting base-models in the generic implementation of
component-wise gradient boosting in function
bols(..., by = NULL, index = NULL, intercept = TRUE, df = NULL, lambda = 0, contrasts.arg = "contr.treatment") bbs(..., by = NULL, index = NULL, knots = 20, boundary.knots = NULL, degree = 3, differences = 2, df = 4, lambda = NULL, center = FALSE, cyclic = FALSE) bspatial(..., df = 6) brad(..., by = NULL, index = NULL, knots = 100, df = 4, lambda = NULL, covFun = stationary.cov, args = list(Covariance="Matern", smoothness = 1.5, theta=NULL)) brandom(..., df = 4) btree(..., tree_controls = ctree_control(stump = TRUE, mincriterion = 0, savesplitstats = FALSE)) bmono(..., constraint = c("increasing", "decreasing", "convex", "concave"), by = NULL, index = NULL, knots = 20, boundary.knots = NULL, degree = 3, differences = 2, df = 4, lambda = NULL, lambda2 = 1e6, niter=10, intercept = TRUE, contrasts.arg = "contr.treatment") bmrf(..., by = NULL, index = NULL, bnd = NULL, df = 4, lambda = NULL, center = FALSE) buser(X, K = NULL, by = NULL, index = NULL, df = 4, lambda = NULL) bl1 %+% bl2 bl1 %X% bl2 bl1 %O% bl2
- one or more predictor variables or one data frame of predictor variables.
- an optional variable defining varying coefficients,
either a factor or numeric variable.
byis a factor, the coding is determined by the global
options("contrasts")or as specified "locally" for
- a vector of integers for expanding the variables in
.... For example,
bols(x, index = index)is equal to
indexis an integer of length greater or equ
- trace of the hat matrix for the base-learner defining the base-learner
complexity. Low values of
dfcorrespond to a large amount of smoothing and thus to "weaker" base-learners. Certain restrictions have to be kept f
- smoothing penalty, computed from
- either the number of knots or a vector of the positions
of the interior knots (for more details see below). For multiple
knotsmay be a named list where the names in the list are the variable names.
- boundary points at which to anchor the B-spline basis
(default the range of the data). A vector (of length 2)
for the lower and the upper boundary knot can be specified.This is
only advised for
bbs(..., cyclic = TRUE), where the b
- degree of the regression spline.
- a non-negative integer, typically 1, 2 or 3. If
differences= k, k-th-order differences are used as a penalty (0-th order differences specify a ridge penalty).
intercept = TRUEan intercept is added to the design matrix of a linear base-learner. If
intercept = FALSE, continuous covariates should be (mean-) centered.
center = TRUEthe corresponding effect is re-parameterized such that the unpenalized part of the fit is subtracted and only the deviation effect is fitted. The unpenalized, parametric part has then
cyclic = TRUEthe fitted values coincide at the boundaries (useful for cyclic covariates such as day time etc.).
- the covariance function (i.e. radial basis)
needed to compute the basis functions. Per
- a named list of arguments to be passed to
cov.function. Thus strongly dependent on the specified
- a named list of characters suitable for input to the
contrastsreplacement function, see
- an object of class
"TreeControl", which can be obtained using
ctree_control. Defines hyper-parameters for the trees which are used as base-learner
- type of constraint to be used. The constraint can be either monotonic "increasing" (default), "decreasing" or "convex" or "concave".
- penalty parameter for the (monotonicity) constraint.
- maximum number of iterations used to compute constraint estimates. Increase this number if a warning is displayed.
- Object of class
bnd, in which the boundaries of a map are defined and from which neighborhood relations can be construed. See
read.bnd. If a boundary object is not available, the
- design matrix as it should be used in the penalized least
squares estimation. Effect modifiers do not need to be included here
bycan be used for convenience).
- penalty matrix as it should be used in the penalized least
squares estimation. If
NULL(default), unpenalized estimation is used.
- a linear base-learner or a list of linear base-learners.
- a linear base-learner or a list of linear base-learners.
bols refers to linear base-learners (potentially estimated with a ridge penalty), while
bbs provide penalized regression splines.
fits bivariate surfaces and
brandom defines random effects base-learners.
In combination with option
by, these base-learners can be turned into varying
coefficient terms. The linear base-learners are fitted using Ridge Regression
where the penalty parameter
lambda is either computed from
brandom) or specified directly
lambda = 0 means no penalization as default for
x may be a numeric vector or factor.
x can be a data frame containing numeric or
factor variables. In this case, or when multiple predictor variables
are specified, e.g., using
bols(x1, x2), the model is
lm(y ~ ., data = x) or
lm(y ~ x1 + x2),
respectively. By default, an intercept term is added to the
corresponding design matrix (which can be omitted using
intercept = FALSE). It is strongly advised to (mean-)
center continuous covariates, if no intercept is used in
bols (see Hofner
et al., 2011). When
df is given, a ridge estimator with
df degrees of freedom (trace of hat matrix) is used as
base-learner. Note that all variables are treated as a group, i.e.,
they enter the model together if the corresponding base-learner is
selected. For ordinal variables, a ridge penalty for the differences
of the adjacent categories (Gertheiss and Tutz 2009, Hofner et al.
2011) is applied.
bbs, the P-spline approach of Eilers and Marx (1996) is
used. P-splines use a squared k-th-order difference penalty
which can be interpreted as an approximation of the integrated squared
k-th derivative of the spline. In
bbs the argument
knots specifies either the number of (equidistant) interior
knots to be used for the regression spline fit or a vector including
the positions of the interior knots. Additionally,
boundary.knots can be specified. However, this is only advised
if one uses cyclic constraints, where the
specify the points where the function is joined (e.g.,
boundary.knots = c(0, 2 * pi) for angles as in a sine function
boundary.knots = c(0, 24) for hours during the day).
bspatial implements bivariate tensor product P-splines for the
estimation of either spatial effects or interaction surfaces. Note
bspatial(x, y) is equivalent to
bbs(x, y, df = 6).
For possible arguments and defaults see there. The penalty term is
constructed based on bivariate extensions of the univariate penalties
y directions, see Kneib, Hothorn and Tutz
(2009) for details. Note that the dimensions of the penalty matrix
increase (quickly) with the number of knots with strong impact on
computational time. Thus, both should not be chosen to large.
Different knots for
y can be specified by a named
brandom(x) specifies a random effects base-learner based on a
x that defines the grouping structure of the
data set. For each level of
x, a separate random intercept is
fitted, where the random effects variance is governed by the
specification of the degrees of freedom
df. Note that
brandom(...) is essentially a wrapper to
bols(..., df =
4), i.e., a wrapper that utilizes ridge-penalized categorical
effects. For possible arguments and defaults see
For all linear base-learners the amount of smoothing is determined by
the trace of the hat matrix, as indicated by
by is specified as an additional argument, a varying
coefficients term is estimated, where
by is the interaction
variable and the effect modifier is given by either
y (specified via
used, this corresponds to the classical situation of varying
coefficients, where the effect of
by varies over the co-domain
x. In case of
bspatial as base-learner, the effect of
by varies with respect to both
y, i.e. an
interaction surface between
y is specified as
effect modifier. For
brandom specification of
to the estimation of random slopes for covariate
grouping structure defined by factor
x instead of a simple
random intercept. In
the computation of the smoothing parameter
lambda for given
df, or vice versa, might become (numerically) instable if the
values of the interaction variable
by become too large. In this
case, we recommend to rescale the interaction covariate e.g. by
specified with an factor variable
by with more than two
factors, the degrees of freedom are shared for the complete
base-learner (i.e., spread over all factor levels). Note that the null
space (see next paragraph) increases, as a separate null space for
each factor level is present. Thus, the minimum degrees of freedom
increase with increasing number of levels of
center = FALSE).
center requests that
the fitted effect is centered around its parametric, unpenalized part
(the so called null space). For example, with second order difference
penalty, a linear effect of
x remains unpenalized by
and therefore the degrees of freedom for the base-learner have to be
larger than two. To avoid this restriction, option
TRUE subtracts the unpenalized linear effect from the fit, allowing
to specify any positive number as
df. Note that in this case
the linear effect
x should generally be specified as an
bspatial and, for
example, second order differences, a linear effect of
bols(x)), a linear effect of
their interaction (
bols(x*y)) are subtracted from the effect
and have to be added separately to the model equation. More details on
centering can be found in Kneib, Hothorn and Tutz (2009) and Fahrmeir,
Kneib and Lang (2004).
brad(x) specifies penalized radial basis functions as used in
knots is used to specify the number of knots, the
cover.design is used to specify the
location of the knots such that they minimize a geometric
space-filling criterion. Furthermore, knots can be specified directly
via a matrix. The
cov.function allows to specify the
radial basis functions. Per default, the flexible Matern correlation
function is used. This is specified using
Covariance = "Matern" specified via
args. If an effective range
theta is applicable for the
correlation function (e.g., the Matern family) the user can specify
this value. Per default (if
theta = NULL) the effective range is
chosen as $\theta = max(||x_i - x_j||)/c$ such that the correlation function
$$\rho(c; \theta = 1) = \varepsilon,$$
where $\varepsilon = 0.001$.
bmrf builds a base of a Markov random field consisting of several regions
with a neighborhood structure. The input variable is the observed region.
The penalty matrix is either construed from a boundary object or must be
given directly via the option
bnd. In that case the
the matrix have to be the region names, on the diagonal the number of neighbors
have to be given for each region, and for each neighborhood relation the value
in the matrix has to be -1, else 0. With a boundary object at hand, the fitted
or predicted values can be directly plotted into the map using
buser(X, K) specifies a base-learner with user-specified design
X and penalty matrix
K are used to minimize a (penalized) least squares
criterion with quadratic penalty. This can be used to easily specify
base-learners that are not implemented (yet). See examples
below for details how
buser can be used to mimic existing
base-learners. Note that for predictions you need to set up the
design matrix for the new data manually.
For a categorical covariate with non-observed categories
brandom(x) both assign a zero effect
these categories. However, the non-observed categories must be
levels(x). Thus, predictions are possible
for new observations if they correspond to this category.
By default, all linear base-learners include an intercept term (which can
be removed using
intercept = FALSE for
center = TRUE for
bbs). In this case, an explicit global
intercept term should be added to
example below). With
bols(x, intercept = FALSE) with categorical
x a separate effect for each group (mean effect) is
estimated (see examples for resulting design matrices).
Three global options affect the base-learners:
TRUE indicates that the base-learner may use
sparse matrix techniques for its computations. This reduces the memory
consumption but might (for smaller sample sizes) require more computing
option("mboost_indexmin") is an integer for the sample
size required to optimize model fitting by taking ties into account.
option("mboost_dftraceS"), which is also
TRUE by default,
indicates that the trace of the smoother matrix is used as degrees
of freedom. If
FALSE, an alternative is used (see
Hofner et al., 2011).
Smooth estimates with constraints can be computed using the base-learner
bmono() which specifies P-spline base-learners with an
additional asymmetric penalty enforcing monotonicity or
convexity/concavity (see and Eilers, 2005). For more details in the boosting
context and monotonic effects of ordinal factors see Hofner, Mueller and Hothorn (2011).
Two or more linear base-learners can be joined using
tensor product of two or more linear base-learners is returned by
%X%. When the design matrix can be written as the
Kronecker product of two matrices
X = kronecker(X2, X1),
bl1 %X% bl2 with design matrices X1 and X2, respectively,
can be used to efficiently compute Ridge-estimates
following Currie, Durban, Eilers (2006). In cases the overall degrees of freedom of the
combined base-learner increase (additive or multiplicative,
respectively). These three features are experimental and for expert use
btree fits a stump to one or more variables. Note that
blackboost is more efficient for boosting stumps.
Note that the base-learners
bss are deprecated
(and no longer available). Please use
bbs instead, which
results in qualitatively the same models but is computationally much
- An object of class
blg(base-learner generator) with a
The call of
dppreturns an object of class
bl(base-learner) with a
fitfunction. The call to
fitfinally returns an object of class
Iain D. Currie, Maria Durban, and Paul H. C. Eilers (2006), Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society, Series B--Statistical Methodology, 68(2), 259--280.
Paul H. C. Eilers (2005), Unimodal smoothing. Journal of Chemometrics, 19, 317--328.
Paul H. C. Eilers and Brian D. Marx (1996), Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89-121.
Ludwig Fahrmeir, Thomas Kneib and Stefan Lang (2004), Penalized structured additive regression for space-time data: a Bayesian perspective. Statistica Sinica, 14, 731-761.
Jan Gertheiss and Gerhard Tutz (2009), Penalized regression with ordinal predictors, International Statistical Review, 77(3), 345--365.
Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid (2011), A framework for unbiased model selection based on boosting. Journal of Computational and Graphical Statistics, in press. doi:10.1198/jcgs.2011.09220
Preliminary version: Department of Statistics, Technical Report Nr.
72, LMU Muenchen.
B. Hofner, J. Mueller, and T. Hothorn (2011), Monotonicity-Constrained Species Distribution Models, Ecology, 92:1895-1901.
Thomas Kneib, Torsten Hothorn and Gerhard Tutz (2009), Variable selection and model choice in geoadditive regression models, Biometrics, 65(2), 626--634.
set.seed(290875) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) + 0.25 * x1 x3 <- as.factor(sample(0:1, 100, replace = TRUE)) x4 <- gl(4, 25) y <- 3 * sin(x1) + x2^2 + rnorm(n) weights <- drop(rmultinom(1, n, rep.int(1, n) / n)) ### set up base-learners spline1 <- bbs(x1, knots = 20, df = 4) attributes(spline1) knots.x2 <- quantile(x2, c(0.25, 0.5, 0.75)) spline2 <- bbs(x2, knots = knots.x2, df = 5) attributes(spline2) attributes(ols3 <- bols(x3)) attributes(ols4 <- bols(x4)) ### compute base-models drop(ols3$dpp(weights)$fit(y)$model) ## same as: coef(lm(y ~ x3, weights = weights)) drop(ols4$dpp(weights)$fit(y)$model) ## same as: coef(lm(y ~ x4, weights = weights)) ### fit model, component-wise mod1 <- mboost_fit(list(spline1, spline2, ols3, ols4), y, weights) ### more convenient formula interface mod2 <- mboost(y ~ bbs(x1, knots = 20, df = 4) + bbs(x2, knots = knots.x2, df = 5) + bols(x3) + bols(x4), weights = weights) all.equal(coef(mod1), coef(mod2)) ### grouped linear effects # center x1 and x2 first x1 <- scale(x1, center = TRUE, scale = FALSE) x2 <- scale(x2, center = TRUE, scale = FALSE) model <- gamboost(y ~ bols(x1, x2, intercept = FALSE) + bols(x1, intercept = FALSE) + bols(x2, intercept = FALSE), control = boost_control(mstop = 400)) coef(model, which = 1) # one base-learner for x1 and x2 coef(model, which = 2:3) # two separate base-learners for x1 and x2 ### example for bspatial x1 <- runif(250,-pi,pi) x2 <- runif(250,-pi,pi) y <- sin(x1) * sin(x2) + rnorm(250, sd = 0.4) spline3 <- bspatial(x1, x2, knots = 12) attributes(spline3) ## specify number of knots separately form2 <- y ~ bspatial(x1, x2, knots = list(x1 = 12, x2 = 12)) ## decompose spatial effect into parametric part and ## deviation with one df form2 <- y ~ bols(x1) + bols(x2) + bols(x1*x2) + bspatial(x1, x2, knots = 12, center = TRUE, df = 1) ### random intercept id <- factor(rep(1:10, each = 5)) raneff <- brandom(id) attributes(raneff) ## random intercept with non-observed category set.seed(1907) y <- rnorm(50, mean = rep(rnorm(10), each = 5), sd = 0.1) plot(y ~ id) # category 10 not observed obs <- c(rep(1, 45), rep(0, 5)) model <- gamboost(y ~ brandom(id), weights = obs) coef(model) fitted(model)[46:50] # just the grand mean as usual for # random effects models ### random slope z <- runif(50) raneff <- brandom(id, by = z) attributes(raneff) ### specify simple interaction model (with main effect) n <- 210 x <- rnorm(n) X <- model.matrix(~ x) z <- gl(3, n/3) Z <- model.matrix(~z) beta <- list(c(0,1), c(-3,4), c(2, -4)) y <- rnorm(length(x), mean = (X * Z[,1]) %*% beta[] + (X * Z[,2]) %*% beta[] + (X * Z[,3]) %*% beta[]) plot(y ~ x, col = z) ## specify main effect and interaction mod_glm <- gamboost(y ~ bols(x) + bols(x, by = z), control = boost_control(mstop = 1000)) nd <- data.frame(x, z) nd <- nd[order(x),] nd$pred_glm <- predict(mod_glm, newdata = nd) for (i in seq(along = levels(z))) with(nd[nd$z == i,], lines(x, pred_glm, col = z)) mod_gam <- gamboost(y ~ bbs(x) + bbs(x, by = z), control = boost_control(mstop = 1000)) nd$pred_gam <- predict(mod_gam, newdata = nd) for (i in seq(along = levels(z))) with(nd[nd$z == i,], lines(x, pred_gam, col = z, lty = "dashed")) ### convenience function for plotting par(mfrow = c(1,3)) plot(mod_gam) ### remove intercept from base-learner ### and add explicit intercept to the model tmpdata <- data.frame(x = 1:100, y = rnorm(1:100), int = rep(1, 100)) mod <- gamboost(y ~ bols(int, intercept = FALSE) + bols(x, intercept = FALSE), data = tmpdata, control = boost_control(mstop = 2500)) cf <- unlist(coef(mod)) cf <- cf + mod$offset cf coef(lm(y ~ x, data = tmpdata)) ### quicker and better with (mean-) centering tmpdata$x_center <- tmpdata$x - mean(tmpdata$x) mod_center <- gamboost(y ~ bols(int, intercept = FALSE) + bols(x_center, intercept = FALSE), data = tmpdata, control = boost_control(mstop = 500)) cf_center <- unlist(coef(mod_center, which=1:2)) ## due to the shift in x direction we need to subtract ## beta_1 * mean(x) to get the correct intercept cf_center <- cf_center + mod_center$offset - cf_center * mean(tmpdata$x) cf_center coef(lm(y ~ x, data = tmpdata)) ### large data set with ties nunique <- 100 xindex <- sample(1:nunique, 1000000, replace = TRUE) x <- runif(nunique) y <- rnorm(length(xindex)) w <- rep.int(1, length(xindex)) ### brute force computations op <- options() options(mboost_indexmin = Inf, mboost_useMatrix = FALSE) ## data pre-processing b1 <- bbs(x[xindex])$dpp(w) ## model fitting c1 <- b1$fit(y)$model options(op) ### automatic search for ties, faster b2 <- bbs(x[xindex])$dpp(w) c2 <- b2$fit(y)$model ### manual specification of ties, even faster b3 <- bbs(x, index = xindex)$dpp(w) c3 <- b3$fit(y)$model all.equal(c1, c2) all.equal(c1, c3) ### cyclic P-splines set.seed(781) x <- runif(200, 0,(2*pi)) y <- rnorm(200, mean=sin(x), sd=0.2) newX <- seq(0,2*pi, length=100) ### model without cyclic constraints mod <- gamboost(y ~ bbs(x, knots = 20)) ### model with cyclic constraints mod_cyclic <- gamboost(y ~ bbs(x, cyclic=TRUE, knots = 20, boundary.knots=c(0, 2*pi))) par(mfrow = c(1,2)) plot(x,y, main="bbs (non-cyclic)", cex=0.5) lines(newX, sin(newX), lty="dotted") lines(newX + 2 * pi, sin(newX), lty="dashed") lines(newX, predict(mod, data.frame(x = newX)), col="red", lwd = 1.5) lines(newX + 2 * pi, predict(mod, data.frame(x = newX)), col="blue", lwd=1.5) plot(x,y, main="bbs (cyclic)", cex=0.5) lines(newX, sin(newX), lty="dotted") lines(newX + 2 * pi, sin(newX), lty="dashed") lines(newX, predict(mod_cyclic, data.frame(x = newX)), col="red", lwd = 1.5) lines(newX + 2 * pi, predict(mod_cyclic, data.frame(x = newX)), col="blue", lwd = 1.5) ### use buser() to mimic p-spline base-learner: set.seed(1907) x <- rnorm(100) y <- rnorm(100, mean = x^2, sd = 0.1) mod1 <- gamboost(y ~ bbs(x)) ## now extract design and penalty matrix X <- extract(bbs(x), "design") K <- extract(bbs(x), "penalty") ## use X and K in buser() mod2 <- gamboost(y ~ buser(X, K)) max(abs(predict(mod1) - predict(mod2))) # same results ### use buser() to mimic penalized ordinal base-learner: z <- as.ordered(sample(1:3, 100, replace=TRUE)) y <- rnorm(100, mean = as.numeric(z), sd = 0.1) X <- extract(bols(z)) K <- extract(bols(z), "penalty") index <- extract(bols(z), "index") mod1 <- gamboost(y ~ buser(X, K, df = 1, index = index)) mod2 <- gamboost(y ~ bols(z, df = 1)) max(abs(predict(mod1) - predict(mod2))) # same results ### kronecker product for matrix-valued responses data("volcano", package = "datasets") layout(matrix(1:2, ncol = 2)) ## estimate mean of image treating image as matrix image(volcano, main = "data") x1 <- 1:nrow(volcano) x2 <- 1:ncol(volcano) vol <- as.vector(volcano) mod <- mboost(vol ~ bbs(x1, df = 3, knots = 10)%O% bbs(x2, df = 3, knots = 10), control = boost_control(nu = 0.25)) mod volf <- matrix(fitted(mod), nrow = nrow(volcano)) image(volf, main = "fitted") ## the old-fashioned way, a waste of space and time x <- expand.grid(x1, x2) modx <- mboost(vol ~ bbs(Var2, df = 3, knots = 10)%X% bbs(Var1, df = 3, knots = 10), data = x, control = boost_control(nu = 0.25)) modx max(abs(fitted(mod) - fitted(modx)))