baselearners: Base-learners for Gradient Boosting

Description

Base-learners for fitting base-models in the generic implementation of component-wise gradient boosting in function mboost.

Usage

bols(..., by = NULL, index = NULL, intercept = TRUE, df = NULL,
     lambda = 0, contrasts.arg = "contr.treatment")
bbs(..., by = NULL, index = NULL, knots = 20, boundary.knots = NULL,
    degree = 3, differences = 2, df = 4, lambda = NULL, center = FALSE,
    cyclic = FALSE)
bspatial(..., df = 6)
brad(..., by = NULL, index = NULL, knots = 100, df = 4, lambda = NULL,
     covFun = stationary.cov,
     args = list(Covariance="Matern", smoothness = 1.5, theta=NULL))
brandom(..., df = 4)
btree(..., tree_controls = ctree_control(stump = TRUE,
                                         mincriterion = 0,
                                         savesplitstats = FALSE))
bmono(..., constraint = c("increasing", "decreasing", "convex", "concave"),
      by = NULL, index = NULL, knots = 20, boundary.knots = NULL,
      degree = 3, differences = 2, df = 4, lambda = NULL,
      lambda2 = 1e6, niter=10, intercept = TRUE,
      contrasts.arg = "contr.treatment")
bmrf(..., by = NULL, index = NULL, bnd = NULL, df = 4, lambda = NULL,
    center = FALSE)
buser(X, K = NULL, by = NULL, index = NULL, df = 4, lambda = NULL)
bl1 %+% bl2
bl1 %X% bl2
bl1 %O% bl2

Arguments

...

one or more predictor variables or one data frame of predictor variables.

an optional variable defining varying coefficients, either a factor or numeric variable. If by is a factor, the coding is determined by the global options("contrasts") or as specified "locally" for

index

a vector of integers for expanding the variables in .... For example, bols(x, index = index) is equal to bols(x[index]), where index is an integer of length greater or equ

trace of the hat matrix for the base-learner defining the base-learner complexity. Low values of df correspond to a large amount of smoothing and thus to "weaker" base-learners. Certain restrictions have to be kept f

lambda

smoothing penalty, computed from df when df is specified.

knots

either the number of knots or a vector of the positions of the interior knots (for more details see below). For multiple predictor variables, knots may be a named list where the names in the list are the variable names.

boundary.knots

boundary points at which to anchor the B-spline basis (default the range of the data). A vector (of length 2) for the lower and the upper boundary knot can be specified.This is only advised for bbs(..., cyclic = TRUE), where the b

degree

degree of the regression spline.

differences

a non-negative integer, typically 1, 2 or 3. If differences = k, k-th-order differences are used as a penalty (0-th order differences specify a ridge penalty).

intercept

if intercept = TRUE an intercept is added to the design matrix of a linear base-learner. If

intercept =
		      FALSE

, continuous covariates should be (mean-) centered.

center

if center = TRUE the corresponding effect is re-parameterized such that the unpenalized part of the fit is subtracted and only the deviation effect is fitted. The unpenalized, parametric part has then

cyclic

if cyclic = TRUE the fitted values coincide at the boundaries (useful for cyclic covariates such as day time etc.).

covFun

the covariance function (i.e. radial basis) needed to compute the basis functions. Per default the stationary.cov function (package fiel

args

a named list of arguments to be passed to cov.function. Thus strongly dependent on the specified cov.function.

contrasts.arg

a named list of characters suitable for input to the contrasts replacement function, see model.matrix,

tree_controls

an object of class "TreeControl", which can be obtained using ctree_control. Defines hyper-parameters for the trees which are used as base-learner

constraint

type of constraint to be used. The constraint can be either monotonic "increasing" (default), "decreasing" or "convex" or "concave".

lambda2

penalty parameter for the (monotonicity) constraint.

niter

maximum number of iterations used to compute constraint estimates. Increase this number if a warning is displayed.

bnd

Object of class bnd, in which the boundaries of a map are defined and from which neighborhood relations can be construed. See read.bnd. If a boundary object is not available, the

design matrix as it should be used in the penalized least squares estimation. Effect modifiers do not need to be included here (by can be used for convenience).

penalty matrix as it should be used in the penalized least squares estimation. If NULL (default), unpenalized estimation is used.

bl1

a linear base-learner or a list of linear base-learners.

bl2

a linear base-learner or a list of linear base-learners.

Value

An object of class blg (base-learner generator) with a dpp function.
The call of dpp returns an object of class bl (base-learner) with a fit function. The call to fit finally returns an object of class bm (base-model).

Details

bols refers to linear base-learners (potentially estimated with a ridge penalty), while bbs provide penalized regression splines. bspatial fits bivariate surfaces and brandom defines random effects base-learners. In combination with option by, these base-learners can be turned into varying coefficient terms. The linear base-learners are fitted using Ridge Regression where the penalty parameter lambda is either computed from df (default for bbs, bspatial, and brandom) or specified directly (lambda = 0 means no penalization as default for bols).

In bols(x), x may be a numeric vector or factor. Alternatively, x can be a data frame containing numeric or factor variables. In this case, or when multiple predictor variables are specified, e.g., using bols(x1, x2), the model is equivalent to lm(y ~ ., data = x) or lm(y ~ x1 + x2), respectively. By default, an intercept term is added to the corresponding design matrix (which can be omitted using intercept = FALSE). It is strongly advised to (mean-) center continuous covariates, if no intercept is used in bols (see Hofner et al., 2011). When df is given, a ridge estimator with df degrees of freedom (trace of hat matrix) is used as base-learner. Note that all variables are treated as a group, i.e., they enter the model together if the corresponding base-learner is selected. For ordinal variables, a ridge penalty for the differences of the adjacent categories (Gertheiss and Tutz 2009, Hofner et al. 2011) is applied.

With bbs, the P-spline approach of Eilers and Marx (1996) is used. P-splines use a squared k-th-order difference penalty which can be interpreted as an approximation of the integrated squared k-th derivative of the spline. In bbs the argument knots specifies either the number of (equidistant) interior knots to be used for the regression spline fit or a vector including the positions of the interior knots. Additionally, boundary.knots can be specified. However, this is only advised if one uses cyclic constraints, where the boundary.knots specify the points where the function is joined (e.g., boundary.knots = c(0, 2 * pi) for angles as in a sine function or boundary.knots = c(0, 24) for hours during the day).

bspatial implements bivariate tensor product P-splines for the estimation of either spatial effects or interaction surfaces. Note that bspatial(x, y) is equivalent to bbs(x, y, df = 6). For possible arguments and defaults see there. The penalty term is constructed based on bivariate extensions of the univariate penalties in x and y directions, see Kneib, Hothorn and Tutz (2009) for details. Note that the dimensions of the penalty matrix increase (quickly) with the number of knots with strong impact on computational time. Thus, both should not be chosen to large. Different knots for x and y can be specified by a named list.

brandom(x) specifies a random effects base-learner based on a factor variable x that defines the grouping structure of the data set. For each level of x, a separate random intercept is fitted, where the random effects variance is governed by the specification of the degrees of freedom df. Note that brandom(...) is essentially a wrapper to bols(..., df = 4), i.e., a wrapper that utilizes ridge-penalized categorical effects. For possible arguments and defaults see bols.

For all linear base-learners the amount of smoothing is determined by the trace of the hat matrix, as indicated by df.

If by is specified as an additional argument, a varying coefficients term is estimated, where by is the interaction variable and the effect modifier is given by either x or x and y (specified via ...). If bbs is used, this corresponds to the classical situation of varying coefficients, where the effect of by varies over the co-domain of x. In case of bspatial as base-learner, the effect of by varies with respect to both x and y, i.e. an interaction surface between x and y is specified as effect modifier. For brandom specification of by leads to the estimation of random slopes for covariate by with grouping structure defined by factor x instead of a simple random intercept. In bbs, bspatial and brandom the computation of the smoothing parameter lambda for given df, or vice versa, might become (numerically) instable if the values of the interaction variable by become too large. In this case, we recommend to rescale the interaction covariate e.g. by dividing by max(abs(by)). If bbs or bspatial is specified with an factor variable by with more than two factors, the degrees of freedom are shared for the complete base-learner (i.e., spread over all factor levels). Note that the null space (see next paragraph) increases, as a separate null space for each factor level is present. Thus, the minimum degrees of freedom increase with increasing number of levels of by (if center = FALSE).

For bbs and bspatial, option center requests that the fitted effect is centered around its parametric, unpenalized part (the so called null space). For example, with second order difference penalty, a linear effect of x remains unpenalized by bbs and therefore the degrees of freedom for the base-learner have to be larger than two. To avoid this restriction, option center = TRUE subtracts the unpenalized linear effect from the fit, allowing to specify any positive number as df. Note that in this case the linear effect x should generally be specified as an additional base-learner bols(x). For bspatial and, for example, second order differences, a linear effect of x (bols(x)), a linear effect of y (bols(y)), and their interaction (bols(x*y)) are subtracted from the effect and have to be added separately to the model equation. More details on centering can be found in Kneib, Hothorn and Tutz (2009) and Fahrmeir, Kneib and Lang (2004).

brad(x) specifies penalized radial basis functions as used in Kriging. If knots is used to specify the number of knots, the function cover.design is used to specify the location of the knots such that they minimize a geometric space-filling criterion. Furthermore, knots can be specified directly via a matrix. The cov.function allows to specify the radial basis functions. Per default, the flexible Matern correlation function is used. This is specified using cov.function = stationary.cov with Covariance = "Matern" specified via args. If an effective range theta is applicable for the correlation function (e.g., the Matern family) the user can specify this value. Per default (if theta = NULL) the effective range is chosen as $\theta = max(||x_i - x_j||)/c$ such that the correlation function $$\rho(c; \theta = 1) = \varepsilon,$$ where $\varepsilon = 0.001$.

bmrf builds a base of a Markov random field consisting of several regions with a neighborhood structure. The input variable is the observed region. The penalty matrix is either construed from a boundary object or must be given directly via the option bnd. In that case the dimnames of the matrix have to be the region names, on the diagonal the number of neighbors have to be given for each region, and for each neighborhood relation the value in the matrix has to be -1, else 0. With a boundary object at hand, the fitted or predicted values can be directly plotted into the map using drawmap.

buser(X, K) specifies a base-learner with user-specified design matrix X and penalty matrix K, where X and K are used to minimize a (penalized) least squares criterion with quadratic penalty. This can be used to easily specify base-learners that are not implemented (yet). See examples below for details how buser can be used to mimic existing base-learners. Note that for predictions you need to set up the design matrix for the new data manually.

For a categorical covariate with non-observed categories bols(x) and brandom(x) both assign a zero effect these categories. However, the non-observed categories must be listed in levels(x). Thus, predictions are possible for new observations if they correspond to this category.

By default, all linear base-learners include an intercept term (which can be removed using intercept = FALSE for bols or center = TRUE for bbs). In this case, an explicit global intercept term should be added to gamboost via bols (see example below). With bols(x, intercept = FALSE) with categorical covariate x a separate effect for each group (mean effect) is estimated (see examples for resulting design matrices).

Three global options affect the base-learners: option("mboost_useMatrix") defaulting to TRUE indicates that the base-learner may use sparse matrix techniques for its computations. This reduces the memory consumption but might (for smaller sample sizes) require more computing time. option("mboost_indexmin") is an integer for the sample size required to optimize model fitting by taking ties into account. option("mboost_dftraceS"), which is also TRUE by default, indicates that the trace of the smoother matrix is used as degrees of freedom. If FALSE, an alternative is used (see Hofner et al., 2011).

Smooth estimates with constraints can be computed using the base-learner bmono() which specifies P-spline base-learners with an additional asymmetric penalty enforcing monotonicity or convexity/concavity (see and Eilers, 2005). For more details in the boosting context and monotonic effects of ordinal factors see Hofner, Mueller and Hothorn (2011).

Two or more linear base-learners can be joined using %+%. A tensor product of two or more linear base-learners is returned by %X%. When the design matrix can be written as the Kronecker product of two matrices X = kronecker(X2, X1), then bl1 %X% bl2 with design matrices X1 and X2, respectively, can be used to efficiently compute Ridge-estimates following Currie, Durban, Eilers (2006). In cases the overall degrees of freedom of the combined base-learner increase (additive or multiplicative, respectively). These three features are experimental and for expert use only.

btree fits a stump to one or more variables. Note that blackboost is more efficient for boosting stumps.

Note that the base-learners bns and bss are deprecated (and no longer available). Please use bbs instead, which results in qualitatively the same models but is computationally much more attractive.

References

Iain D. Currie, Maria Durban, and Paul H. C. Eilers (2006), Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society, Series B--Statistical Methodology, 68(2), 259--280.

Paul H. C. Eilers (2005), Unimodal smoothing. Journal of Chemometrics, 19, 317--328.

Paul H. C. Eilers and Brian D. Marx (1996), Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89-121.

Ludwig Fahrmeir, Thomas Kneib and Stefan Lang (2004), Penalized structured additive regression for space-time data: a Bayesian perspective. Statistica Sinica, 14, 731-761.

Jan Gertheiss and Gerhard Tutz (2009), Penalized regression with ordinal predictors, International Statistical Review, 77(3), 345--365.

Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid (2011), A framework for unbiased model selection based on boosting. Journal of Computational and Graphical Statistics, in press. doi:10.1198/jcgs.2011.09220

Preliminary version: Department of Statistics, Technical Report Nr. 72, LMU Muenchen. http://epub.ub.uni-muenchen.de/11243/

B. Hofner, J. Mueller, and T. Hothorn (2011), Monotonicity-Constrained Species Distribution Models, Ecology, 92:1895-1901.

Thomas Kneib, Torsten Hothorn and Gerhard Tutz (2009), Variable selection and model choice in geoadditive regression models, Biometrics, 65(2), 626--634.

Examples

Run this code

set.seed(290875)

  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n) + 0.25 * x1
  x3 <- as.factor(sample(0:1, 100, replace = TRUE))
  x4 <- gl(4, 25)
  y <- 3 * sin(x1) + x2^2 + rnorm(n)
  weights <- drop(rmultinom(1, n, rep.int(1, n) / n))

  ### set up base-learners
  spline1 <- bbs(x1, knots = 20, df = 4)
  attributes(spline1)

  knots.x2 <- quantile(x2, c(0.25, 0.5, 0.75))
  spline2 <- bbs(x2, knots = knots.x2, df = 5)
  attributes(spline2)

  attributes(ols3 <- bols(x3))
  attributes(ols4 <- bols(x4))

  ### compute base-models
  drop(ols3$dpp(weights)$fit(y)$model) ## same as:
  coef(lm(y ~ x3, weights = weights))

  drop(ols4$dpp(weights)$fit(y)$model) ## same as:
  coef(lm(y ~ x4, weights = weights))

  ### fit model, component-wise
  mod1 <- mboost_fit(list(spline1, spline2, ols3, ols4), y, weights)

  ### more convenient formula interface
  mod2 <- mboost(y ~ bbs(x1, knots = 20, df = 4) +
                     bbs(x2, knots = knots.x2, df = 5) +
                     bols(x3) + bols(x4), weights = weights)
  all.equal(coef(mod1), coef(mod2))


  ### grouped linear effects
  # center x1 and x2 first
  x1 <- scale(x1, center = TRUE, scale = FALSE)
  x2 <- scale(x2, center = TRUE, scale = FALSE)
  model <- gamboost(y ~ bols(x1, x2, intercept = FALSE) +
                        bols(x1, intercept = FALSE) +
                        bols(x2, intercept = FALSE),
                        control = boost_control(mstop = 400))
  coef(model, which = 1)   # one base-learner for x1 and x2
  coef(model, which = 2:3) # two separate base-learners for x1 and x2

  ### example for bspatial
  x1 <- runif(250,-pi,pi)
  x2 <- runif(250,-pi,pi)

  y <- sin(x1) * sin(x2) + rnorm(250, sd = 0.4)

  spline3 <- bspatial(x1, x2, knots = 12)
  attributes(spline3)

  ## specify number of knots separately
  form2 <- y ~ bspatial(x1, x2, knots = list(x1 = 12, x2 = 12))

  ## decompose spatial effect into parametric part and
  ## deviation with one df
  form2 <- y ~ bols(x1) + bols(x2) + bols(x1*x2) +
               bspatial(x1, x2, knots = 12, center = TRUE, df = 1)


  ### random intercept
  id <- factor(rep(1:10, each = 5))
  raneff <- brandom(id)
  attributes(raneff)

  ## random intercept with non-observed category
  set.seed(1907)
  y <- rnorm(50, mean = rep(rnorm(10), each = 5), sd = 0.1)
  plot(y ~ id)
  # category 10 not observed
  obs <- c(rep(1, 45), rep(0, 5))
  model <- gamboost(y ~ brandom(id), weights = obs)
  coef(model)
  fitted(model)[46:50] # just the grand mean as usual for
                       # random effects models


  ### random slope
  z <- runif(50)
  raneff <- brandom(id, by = z)
  attributes(raneff)

  ### specify simple interaction model (with main effect)
  n <- 210
  x <- rnorm(n)
  X <- model.matrix(~ x)
  z <- gl(3, n/3)
  Z <- model.matrix(~z)
  beta <- list(c(0,1), c(-3,4), c(2, -4))
  y <- rnorm(length(x), mean = (X * Z[,1]) %*% beta[[1]] +
                               (X * Z[,2]) %*% beta[[2]] +
                               (X * Z[,3]) %*% beta[[3]])
  plot(y ~ x, col = z)
  ## specify main effect and interaction
  mod_glm <- gamboost(y ~ bols(x) + bols(x, by = z),
                  control = boost_control(mstop = 1000))
  nd <- data.frame(x, z)
  nd <- nd[order(x),]
  nd$pred_glm <- predict(mod_glm, newdata = nd)
  for (i in seq(along = levels(z)))
      with(nd[nd$z == i,], lines(x, pred_glm, col = z))
  mod_gam <- gamboost(y ~ bbs(x) + bbs(x, by = z),
                      control = boost_control(mstop = 1000))
  nd$pred_gam <- predict(mod_gam, newdata = nd)
  for (i in seq(along = levels(z)))
      with(nd[nd$z == i,], lines(x, pred_gam, col = z, lty = "dashed"))
  ### convenience function for plotting
  par(mfrow = c(1,3))
  plot(mod_gam)


  ### remove intercept from base-learner
  ### and add explicit intercept to the model
  tmpdata <- data.frame(x = 1:100, y = rnorm(1:100), int = rep(1, 100))
  mod <- gamboost(y ~ bols(int, intercept = FALSE) +
                      bols(x, intercept = FALSE),
                  data = tmpdata,
                  control = boost_control(mstop = 2500))
  cf <- unlist(coef(mod))
  cf[1] <- cf[1] + mod$offset
  cf
  coef(lm(y ~ x, data = tmpdata))

  ### quicker and better with (mean-) centering
  tmpdata$x_center <- tmpdata$x - mean(tmpdata$x)
  mod_center <- gamboost(y ~ bols(int, intercept = FALSE) +
                             bols(x_center, intercept = FALSE),
                         data = tmpdata,
                         control = boost_control(mstop = 500))
  cf_center <- unlist(coef(mod_center, which=1:2))
  ## due to the shift in x direction we need to subtract
  ## beta_1 * mean(x) to get the correct intercept
  cf_center[1] <- cf_center[1] + mod_center$offset -
                  cf_center[2] * mean(tmpdata$x)
  cf_center
  coef(lm(y ~ x, data = tmpdata))

  ### large data set with ties
  nunique <- 100
  xindex <- sample(1:nunique, 1000000, replace = TRUE)
  x <- runif(nunique)
  y <- rnorm(length(xindex))
  w <- rep.int(1, length(xindex))

  ### brute force computations
  op <- options()
  options(mboost_indexmin = Inf, mboost_useMatrix = FALSE)
  ## data pre-processing
  b1 <- bbs(x[xindex])$dpp(w)
  ## model fitting
  c1 <- b1$fit(y)$model
  options(op)

  ### automatic search for ties, faster
  b2 <- bbs(x[xindex])$dpp(w)
  c2 <- b2$fit(y)$model

  ### manual specification of ties, even faster
  b3 <- bbs(x, index = xindex)$dpp(w)
  c3 <- b3$fit(y)$model

  all.equal(c1, c2)
  all.equal(c1, c3)

  ### cyclic P-splines
  set.seed(781)
  x <- runif(200, 0,(2*pi))
  y <- rnorm(200, mean=sin(x), sd=0.2)
  newX <- seq(0,2*pi, length=100)
  ### model without cyclic constraints
  mod <- gamboost(y ~ bbs(x, knots = 20))
  ### model with cyclic constraints
  mod_cyclic <- gamboost(y ~ bbs(x, cyclic=TRUE, knots = 20,
                                 boundary.knots=c(0, 2*pi)))
  par(mfrow = c(1,2))
  plot(x,y, main="bbs (non-cyclic)", cex=0.5)
  lines(newX, sin(newX), lty="dotted")
  lines(newX + 2 * pi, sin(newX), lty="dashed")
  lines(newX, predict(mod, data.frame(x = newX)),
        col="red", lwd = 1.5)
  lines(newX + 2 * pi, predict(mod, data.frame(x = newX)),
        col="blue", lwd=1.5)
  plot(x,y, main="bbs (cyclic)", cex=0.5)
  lines(newX, sin(newX), lty="dotted")
  lines(newX + 2 * pi, sin(newX), lty="dashed")
  lines(newX, predict(mod_cyclic, data.frame(x = newX)),
        col="red", lwd = 1.5)
  lines(newX + 2 * pi, predict(mod_cyclic, data.frame(x = newX)),
        col="blue", lwd = 1.5)

  ### use buser() to mimic p-spline base-learner:
  set.seed(1907)
  x <- rnorm(100)
  y <- rnorm(100, mean = x^2, sd = 0.1)
  mod1 <- gamboost(y ~ bbs(x))
  ## now extract design and penalty matrix
  X <- extract(bbs(x), "design")
  K <- extract(bbs(x), "penalty")
  ## use X and K in buser()
  mod2 <- gamboost(y ~ buser(X, K))
  max(abs(predict(mod1) - predict(mod2)))  # same results

  ### use buser() to mimic penalized ordinal base-learner:
  z <- as.ordered(sample(1:3, 100, replace=TRUE))
  y <- rnorm(100, mean = as.numeric(z), sd = 0.1)
  X <- extract(bols(z))
  K <- extract(bols(z), "penalty")
  index <- extract(bols(z), "index")
  mod1 <- gamboost(y ~  buser(X, K, df = 1, index = index))
  mod2 <- gamboost(y ~  bols(z, df = 1))
  max(abs(predict(mod1) - predict(mod2)))  # same results

  ### kronecker product for matrix-valued responses
  data("volcano", package = "datasets")
  layout(matrix(1:2, ncol = 2))

  ## estimate mean of image treating image as matrix
  image(volcano, main = "data")
  x1 <- 1:nrow(volcano)
  x2 <- 1:ncol(volcano)

  vol <- as.vector(volcano)
  mod <- mboost(vol ~ bbs(x1, df = 3, knots = 10)%O%
                      bbs(x2, df = 3, knots = 10), 
                      control = boost_control(nu = 0.25))
  mod[250]

  volf <- matrix(fitted(mod), nrow = nrow(volcano))
  image(volf, main = "fitted")

  ## the old-fashioned way, a waste of space and time
  x <- expand.grid(x1, x2)
  modx <- mboost(vol ~ bbs(Var2, df = 3, knots = 10)%X%
                       bbs(Var1, df = 3, knots = 10), data = x,
                       control = boost_control(nu = 0.25))
  modx[250]

  max(abs(fitted(mod) - fitted(modx)))

Run the code above in your browser using DataLab