baselearners: Base-learners for Gradient Boosting

Description

Base-learners for fitting base-models in the generic implementation of component-wise gradient boosting in function mboost.

Usage

bols(..., by = NULL, index = NULL, intercept = TRUE, df = NULL, 
     lambda = 0, contrasts.arg = "contr.treatment")
bbs(..., by = NULL, index = NULL, knots = 20, degree = 3,
    differences = 2, df = 4, lambda = NULL, center = FALSE)
bspatial(...)
brandom(..., df = 4)
btree(..., tree_controls = ctree_control(stump = TRUE, 
                                         mincriterion = 0))
bl1 %+% bl2
bl1 %X% bl2

Arguments

...

one or more predictor variables or one data frame of predictor variables.

an optional variable defining varying coefficients, either a binary or numeric variable.

index

a vector of integers for expanding the variables in .... For example, bols(x, index = index) is equal to bols(x[index]), where index is an integer of length greater or e

trace of the hat matrix for the base-learner defining the base-learner complexity. Low values of df correspond to a large amount of smoothing and thus to "weaker" base-learners. Certain restrictions have to be kept

lambda

smoothing penalty, computed from df when df is specified.

knots

either the number of (equidistant) interior knots to be used for the regression spline fit or a vector including the positions of the interior knots. For multiple predictor variables, knots may be a named

degree

degree of the regression spline.

differences

1, 2, or 3. If differences = k, k-th-order differences are used as a penalty.

intercept

if intercept=TRUE an intercept is added to the design matrix of a linear base-learner.

center

if center=TRUE the corresponding effect is re-parameterized such that the unpenalized part of the fit is subtracted and only the deviation effect is fitted. The unpenalized, parametric part has then to

contrasts.arg

a character suitable for input to the contrasts replacement function, see model.matrix.

tree_controls

an object of class "TreeControl", which can be obtained using ctree_control. Defines hyper-parameters for the trees w

bl1

a linear base-learner or a list of linear base-learners.

bl2

a linear base-learner or a list of linear base-learners.

Value

An object of class bl (base-learner) with a dpp function. The call of dpp returns an object of class bm (base-model).

Details

bols refers to linear base-learners (potentially estimated with a ridge penalty), while bbs provide penalized regression splines. bspatial fits bivariate surfaces and brandom defines random effects base-learners. In combination with option by, these base-learners can be turned into varying coefficient terms. The linear base-learners are fitted using Ridge Regression where the penalty parameter lambda is either computed from df (default for bbs, bspatial, and brandom) or specified directly (lambda = 0 means no penalization as default for bols). In bols(x), x may be a numeric vector or factor. Alternatively, x can be a data frame containing numeric or factor variables. In this case, or when multiple predictor variables are specified, e.g., using bols(x1, x2), the model is equivalent to lm(y ~ ., data = x) or lm(y ~ x1 + x2), respectively. By default, an intercept term is added to the corresponding design matrix (which can be omitted using intercept = FALSE). When df is given, a ridge estimator with df degrees of freedom (trace of hat matrix) is used as base-learner. Note that all variables are treated as a group, i.e., they enter the model together if the corresponding base-learner is selected. With bbs, the P-spline approach of Eilers and Marx (1996) is used. P-splines use a squared k-th-order difference penalty which can be interpreted as an approximation of the integrated squared k-th derivative of the spline. bspatial implements bivariate tensor product P-splines for the estimation of either spatial effects or interaction surfaces. Note that bspatial(x, y) is equivalent to bbs(x, y). For possible arguments and defaults see there. The penalty term is constructed based on bivariate extensions of the univariate penalties in x and y directions, see Kneib, Hothorn and Tutz (2009) for details. Note that the dimensions of the penalty matrix increase (quickly) with the number of knots with strong impact on computational time. Thus, both should not be chosen to large. Different knots for x and y can be specified by a named list. brandom(x) specifies a random effects base-learner based on a factor variable x that defines the grouping structure of the data set. For each level of x, a separate random intercept is fitted, where the random effects variance is governed by the specification of the degrees of freedom df. For all linear base-learners the amount of smoothing is determined by the trace of the hat matrix, as indicated by df. If df is specified in bols a ridge penalty with the according degrees of freedom is used. For ordinal variables, a ridge penalty for the differences of the adjacent categories (Gertheiss and Tutz 2009) is applied. If by is specified as an additional argument, a varying coefficients term is estimated, where by is the interaction variable and the effect modifier is given by either x or x and y (specified via ...). If bbs is used, this corresponds to the classical situation of varying coefficients, where the effect of by varies over the co-domain of x. In case of bspatial as base-learner, the effect of by varies with respect to both x and y, i.e. an interaction surface between x and y is specified as effect modifier. For brandom specification of by leads to the estimation of random slopes for covariate by with grouping structure defined by factor x instead of a simple random intercept. For bbs and bspatial, option center requests that the fitted effect is centered around its parametric, unpenalized part. For example, with second order difference penalty, a linear effect of x remains unpenalized by bbs and therefore the degrees of freedom for the base-learner have to be larger than two. To avoid this restriction, option center=TRUE subtracts the unpenalized linear effect from the fit, allowing to specify any positive number as df. Note that in this case the linear effect x should generally be specified as an additional base-learner bols(x). For bspatial and, for example, second order differences, a linear effect of x (bols(x)), a linear effect of y (bols(y)), and their interaction (bols(x*y)) are subtracted from the effect and have to be added separately to the model equation. More details on centering can be found in Kneib, Hothorn and Tutz (2009) and Fahrmeir, Kneib and Lang (2004). For a categorical covariate with non-observed categories bols(x) and brandom(x) both assign a zero effect these categories. However, the non-observed categories must be listed in levels(x). Thus, predictions are possible for new observations if they correspond to this category. By default, all linear base-learners include an intercept term (which can be removed using intercept = FALSE for bols or center = TRUE for bbs). In this case, an explicit global intercept term should be added to gamboost via bols (see example below). Three global options affect the base-learners: option("mboost_useMatrix") defaulting to TRUE indicates that the base-learner may use sparse matrix techniques for its computations. This reduces the memory consumption but might (for smaller sample sizes) require more computing time. option("mboost_indexmin") is an integer for the sample size required to optimize model fitting by taking ties into account. option("mboost_dftraceS"), which is also TRUE by default, indicates that the trace of the smoother matrix is used as degrees of freedom. If FALSE, an alternative is used (see Hofner et al., 2009). Two or more linear base-learners can be joined using %+%. A tensor product of two or more linear base-learners is returned by %X%. These two features are experimental and for expert use only. btree fits a stump to one or more variables. Note that blackboost is more efficient for boosting stumps.

References

Paul H. C. Eilers and Brian D. Marx (1996), Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89-121. Ludwig Fahrmeir, Thomas Kneib and Stefan Lang (2004), Penalized structured additive regression for space-time data: a Bayesian perspective. Statistica Sinica, 14, 731-761. Jan Gertheiss and Gerhard Tutz (2009), Penalized regression with ordinal predictors, International Statistical Review, 77(3), 345--365. Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid (2009), A framework for unbiased model selection based on boosting. Technical Report Nr. 72, Institut fuer Statistik, LMU Muenchen. http://epub.ub.uni-muenchen.de/11243/ Thomas Kneib, Torsten Hothorn and Gerhard Tutz (2009), Variable selection and model choice in geoadditive regression models, Biometrics, 65(2), 626--634.

Examples

Run this code

set.seed(290875)

  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n) + 0.25 * x1
  x3 <- as.factor(sample(0:1, 100, replace = TRUE))
  x4 <- gl(4, 25)
  y <- 3 * sin(x1) + x2^2 + rnorm(n)
  weights <- drop(rmultinom(1, n, rep.int(1, n) / n))

  ### set up base-learners
  spline1 <- bbs(x1, knots = 20, df = 4)
  attributes(spline1)

  knots.x2 <- quantile(x2, c(0.25, 0.5, 0.75))
  spline2 <- bbs(x2, knots = knots.x2, df = 5)
  attributes(spline2)

  attributes(ols3 <- bols(x3))
  attributes(ols4 <- bols(x4))

  ### compute base-models
  drop(ols3$dpp(weights)$fit(y)$model) ## same as:
  coef(lm(y ~ x3, weights = weights))

  drop(ols4$dpp(weights)$fit(y)$model) ## same as:
  coef(lm(y ~ x4, weights = weights))

  ### fit model, component-wise
  mod1 <- mboost_fit(list(spline1, spline2, ols3, ols4), y, weights)

  ### more convenient formula interface
  mod2 <- mboost(y ~ bbs(x1, knots = 20, df = 4) + 
                     bbs(x2, knots = knots.x2, df = 5) + 
                     bols(x3) + bols(x4))
  all.equal(coef(mod1), coef(mod2))


  ### grouped linear effects
  model <- gamboost(y ~ bols(x1, x2, intercept = FALSE) + 
                        bols(x1, intercept = FALSE) + 
                        bols(x2, intercept = FALSE),
                        control = boost_control(mstop = 400))
  coef(model, which=1)   # one base-learner for x1 and x2
  coef(model, which=2:3) # two separate base-learners for x1 and x2


  ### example for bspatial
  x1 <- runif(250,-pi,pi)
  x2 <- runif(250,-pi,pi)

  y <- sin(x1) * sin(x2) + rnorm(250, sd = 0.4)

  spline3 <- bspatial(x1, x2, knots=12)
  attributes(spline3)

  ## specify number of knots separately
  form2 <- y ~ bspatial(x1, x2, knots=list(x1=12, x2=12))

  ## decompose spatial effect into parametric part and 
  ## deviation with one df
  form2 <- y ~ bols(x1) + bols(x2) + bols(x1*x2) +
               bspatial(x1, x2, knots = 12, center = TRUE, df = 1)


  ### random intercept
  id <- factor(rep(1:10, each = 5))
  raneff <- brandom(id)
  attributes(raneff)

  ## random intercept with non-observed category
  set.seed(1907)
  y <- rnorm(50, mean = rep(rnorm(10), each = 5), sd = 0.1)
  plot(y ~ id)
  # category 10 not observed
  obs <- c(rep(1, 45), rep(0, 5))
  model <- gamboost(y ~ brandom(id), weights = obs)
  coef(model)
  fitted(model)[46:50] # just the grand mean as usual for 
                       # random effects models


  ### random slope
  z <- runif(50)
  raneff <- brandom(id, by=z)
  attributes(raneff)


  ### remove intercept from base-learner
  ### and add explicit intercept to the model
  tmpdata <- data.frame(x = 1:100, y = rnorm(1:100), int = rep(1, 100))
  mod <- gamboost(y ~ bols(int, intercept = FALSE) + 
                      bols(x, intercept = FALSE),
                  data = tmpdata, 
                  control = boost_control(mstop = 2500))
  cf <- unlist(coef(mod))
  cf[1] <- cf[1] + mod$offset
  cf
  coef(lm(y ~ x, data = tmpdata))

  
  ### large data set with ties
  nunique <- 100
  xindex <- sample(1:nunique, 1000000, replace = TRUE)
  x <- runif(nunique)
  y <- rnorm(length(xindex))
  w <- rep.int(1, length(xindex))

  ### brute force computations
  op <- options()
  options(mboost_indexmin = Inf, mboost_useMatrix = FALSE)
  ## data pre-processing
  b1 <- bbs(x[xindex])$dpp(w)
  ## model fitting
  c1 <- b1$fit(y)$model
  options(op)

  ### automatic search for ties, faster
  b2 <- bbs(x[xindex])$dpp(w)
  c2 <- b2$fit(y)$model

  ### manual specification of ties, even faster
  b3 <- bbs(x, index = xindex)$dpp(w)
  c3 <- b3$fit(y)$model

  all.equal(c1, c2)
  all.equal(c1, c3)

Run the code above in your browser using DataLab