tvcglm: Coefficient-wise tree-based varying coefficient regression based on generalized linear models

Description

The tvcglm function implements the tree-based varying coefficient regression algorithm for generalized linear models introduced by Buergin and Ritschard (2014b). The algorithm approximates varying coefficients by piecewise constant functions using recursive partitioning, i.e., it estimates the coefficients of the model separately for strata of the value space of partitioning variables. The special feature of the algorithm is to assign each varying coefficient a partition, which enhances the possibilities for model specification and to select moderator variables individually by coefficient

Usage

tvcglm(formula, data, family, 
       weights, subset, offset, na.action, 
       control = tvcglm_control(), ...)
tvcglm_control(minsize = 30, mindev = 2.0,
               maxnomsplit = 5, maxordsplit = 9, maxnumsplit = 9,
               cv = TRUE, folds = folds_control("kfold", 5),
               prune = cv, center = TRUE, ...)

Arguments

formula

a symbolic description of the model to fit, e.g.,

y ~ vc(z1, ..., zL, by = x1 + ...+ xP) + re(1|id) where vc term specifies the varying fixed coefficients. Only one such vc term is allowed. For deta

family

the model family. An object of class family.olmm.

data

a data frame containing the variables in the model.

weights

an optional numeric vector of weights to be used in the fitting process.

subset

an optional logical or integer vector specifying a subset of 'data' to be used in the fitting process.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting.

na.action

a function that indicates what should happen if data contain NAs. See na.action.

control

a list with control parameters as returned by tvcolmm_control.

minsize

numeric (vector). The minimum sum of weights in terminal nodes.

mindev

numeric scalar. The minimum permitted training error reduction a split must exhibit to be considered of a new split. The main role of this parameter is to save computing time by early stopping. May be set lower for very few partitioning varia

maxnomsplit, maxordsplit, maxnumsplit

integer scalars for split candidate reduction. See tvcm_control

logical scalar. Whether or not the cp parameter should be cross-validated. If TRUE cvloss is called.

folds

a list of parameters to create folds as produced by folds_control. Is used for cross-validation.

prune

logical scalar. Whether or not the initial tree should be pruned by the estimated cp parameter from cross-validation. Cannot be TRUE if cv = FALSE.

center

logical integer. Whether the predictor variables of update models during the grid search should be centered. Note that TRUE will not modify the predictors of the fitted model.

...

additional arguments passed to the fitting function fit or to tvcm_control.

Value

An object of class tvcm

Details

The TVCGLM algorithm uses two stages. The first stage (partitioning) builds too overly fine partitions and the second stage (pruning) selects the best-sized partitions by collapsing inner nodes. For the second stage, which is automatically processed, we refer to tvcm-assessment. The partitioning stage iterates the following steps:

Fit the current generalized linear modely ~ NodeA:x1 + ...+ NodeK:xKwithglm, whereNodeKis a categorical variable with terminal node labels1,...for the$K$-th varying coefficient.
Search and globally optimal split among the candidate splits by exhaustive -2 likelihood training error grid search, by cycling through all partitions, nodes and moderator variables.
If the -2 likelihood training error reduction through the best split is smaller thanmindevor there is no candidate split satisfying the minimum node sizeminsize, stop the algorithm.
Else incorporate the best split and repeat the procedure.

The partitioning stage selects, in each iteration, the split that maximizes the -2 likelihood training error reduction, compared to the current model. The default stopping parameters are minsize = 30 (a minimum node size of 30) and mindev = 2 (the training error reduction of the best split must be larger than two to continue).

The algorithm can be seen as an extension of CART (Breiman et. al., 1984) and PartReg (Wang and Hastie, 2014), with the new feature that partitioning can be processed coefficient-wise.

References

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees. Wadsworth. Wang, J. C., Hastie, T. (2014), Boosted Varying-Coefficient Regression Models for Product Demand Prediction, Journal of Computational and Graphical Statistics, 23, 361--382. Buergin R. and Ritschard G. (2014b). Coefficient-wise tree-based varying coefficient regression with vcrpart. Article in progress.

Examples

Run this code

## ------------------------------------------------------------------- #  
## Example 1: Moderated effect of education on poverty
##
## The algorithm is used to find out whether the effect of high
## education 'EduHigh' on poverty 'Poor' is moderated by the civil
## status 'CivStat'. We specify two 'vc' terms in the logistic
## regression model for 'Poor': a first that accounts for the direct
## effect of 'CivStat' and a second that accounts for the moderation of
## 'CivStat' on the relation between 'EduHigh' and 'Poor'. We use here
## the 2-stage procedure with a partitioning- and a pruning stage as
## described in Buergin and Ritschard (2014b). 
## ------------------------------------------------------------------- #

data(poverty)
poverty$EduHigh <- 1 * (poverty$Edu == "high")

## fit the model
model.Pov <-
  tvcglm(Poor ~ -1 +  vc(CivStat) + vc(CivStat, by = EduHigh) + NChild, 
         family = binomial(), data = poverty, subset = 1:200,
         control = tvcm_control(verbose = TRUE, papply = lapply,
           folds = folds_control(K = 1, type = "subsampling", seed = 7)))

## diagnosis
plot(model.Pov, "cv")
plot(model.Pov, "coef")
summary(model.Pov)
splitpath(model.Pov, steps = 1:3)
prunepath(model.Pov, steps = 1)

Run the code above in your browser using DataLab