tvcm-assessment: Model assessment and model selection for tvcm objects.

Description

Out-of-bag loss, cross-validation and pruning for tvcm objects.

Usage

folds_control(type = c("kfold", "subsampling", "bootstrap"),
      K = ifelse(type == "kfold", 5, 30),
      prob = 0.5, weights = c("case", "freq"),
      seed = NULL)
## S3 method for class 'tvcm':
cvloss(object, folds = folds_control(),
       fun = NULL, dfpar = NULL, direction = c("backward", "forward"),
       papply = mclapply, verbose = FALSE, ...)
## S3 method for class 'cvloss.tvcm':
print(x, ...)
## S3 method for class 'cvloss.tvcm':
plot(x, legend = TRUE, details = TRUE, ...)
## S3 method for class 'tvcm':
oobloss(object, newdata = NULL, weights = NULL,
        fun = NULL, ...)
## S3 method for class 'tvcm':
prune(tree, dfsplit = NULL, dfpar = NULL,
      direction = c("backward", "forward"),
      alpha = NULL, maxstep = NULL, terminal = NULL,
      papply = mclapply, keeploss = FALSE, original,...)

Arguments

object, tree

an object of class tvcm.

an object of class cvloss.tvcm as produced by cvloss.

type

character string. The type of sampling scheme to be used to divide the data of the input model in a learning and a validation set.

integer scalar. The number of folds.

prob

numeric between 0 and 1. The probability for the "subsampling" cross-validation scheme.

weights

for folds_control, a character that defines whether the weights of object are case weights or frequencies of cases; for oobloss<

seed

an numeric scalar that defines the seed.

folds

a list with control arguments as produced by folds_control.

fun

the loss function for the validation sets. By default, the (possibly weighted) mean of the deviance residuals as defined by the family of the fitted object is applied.

dfpar

a numeric scalar larger than zero. The per parameter penalty to be applied. If the 2 log-likelihood prediction error is used, this value is typically set to 2. If NULL, the value of dfpar of the partitioning stage is

direction

either "backward" (the default) or "forward". Indicates the pruning algorithm to be used. "backward" applies backward pruning where in each iteration the inner node that produces the smallest per-node

papply

(parallel) apply function, defaults to mclapply. To run cvloss sequentially (i.e. not in parallel), use lapply. Special ar

newdata

a data.frame of out-of-bag data (including the response variable). See also predict.tvcm.

verbose

logical scalar. If TRUE verbose output is generated during the validation.

legend

logical scalar. Whether a legend should be added.

details

logical scalar. Whether the foldwise validation errors and the in-sample prediction error should be shown.

dfsplit

numeric scalar. The per-split cost dfsplit with which the partitions are to be cross-validated. If no dfsplit is specified (default), the parameter is ignored for pruning.

alpha

numeric significance level. Represents the stopping parameter for tvcm objects grown with sctest = TRUE, see tvcm_control.

maxstep

integer. The maximum number of steps of the algorithm.

terminal

a list of integer vectors with the ids of the nodes the subnodes of which should be merged.

keeploss

logical scalar or numeric. If and how many times the statistics should be reused in following iterations. Specifically, the option activates approximating the AIC reduction per split based on AIC reduction of the last iteration adjusted by AIC

original

logical scalar. Whether pruning should be based on the trees from partitioning rather than on the current trees.

...

other arguments to be passed.

Value

folds_control returns a list of parameters for building a cross-validation scheme. cvloss returns an cvloss.tvcm object with the following essential components:
grida list with two matrices dfsplit and nsplit. Specifies the grid of values at which the cross-validated loss was evaluated.
lossa list with two matrices dfsplit and nsplit. The cross-validated loss for each fold corresponding to the values in grid.
dfsplit.minnumeric scalar. The tuning parameter which minimizes the cross-validated loss.
foldsthe used folds to extract the learning and the validation sets.
Further, oobloss returns the loss of the newdata validation set and prune returns a (possibly modified) tvcm object.

Details

As described in the help of tvcm, TVCM (combined with sctest = FALSE) is a two stage procedure that first grows overly fine partitions and second selects the best-sized partitions by pruning. Both steps can be carried out with a single tvcm and several parameters can be specified with tvcm_control. The here presented functions may be interesting for advanced users who want to process the two stages with separate calls.

The prune method collapses inner nodes of the overly large tree fitted with tvcm according to the tuning parameter dfsplit to minimize the estimated in-sample prediction error. The in-sample prediction error is, in what follows, defined as the mean of the in-sample loss plus dfpar times the number of coefficients plus dfsplit times the number of splits. In the common likelihood setting, the loss is equal - 2 times the maximum likelihood and dfpar = 2. The per-split penalty dfsplit generally unknown and estimated by using cross-validation.

folds_control and cvloss allow for estimating dfsplit by cross-validation. The function folds_control is used to specify the cross-validation scheme, where a random 5-fold cross-validation scheme is set as the default. Alternatives are type = "subsampling" (random draws without replacement) and type = "bootstrap" (random draws with replacement). For 2-stage models (with random-effects) fitted by olmm, the subsets are based on subject-wise i.e. first stage sampling. For models where weights represent frequencies of observation units with exactly the same values in all variables (e.g., data from contingency tables), the option weights = "freq" should be used.

cvloss repeatedly fits tvcm objects based on the internally created folds and evaluates mean out-of-bag loss of the model at different levels of the tuning parameter dfsplit. Out-of-bag loss refers here to the prediction error based on a loss function, which is typically the -2 log-likelihood error (see the details for oobloss below). Commonly, dfsplit is used for backward pruning (direction = "backward"), but it is also possible to cross-validate dfsplit for premature stopping (direction = "forward", see argument dfsplit in tvcm_control). cvloss returns an object for which a print and a plot generic is provided. The proposed estimate for dfsplit is the one that minimizes the validated loss and can be extracted from component dfsplit.min. oobloss can be used for estimating the out-of-bag prediction error for out-of-bag data (the newdata argument). By default, the loss is defined as the sum of deviance residuals, see the return value dev.resids of family resp. family.olmm. Otherwise, the loss function can be defined manually by the argument fun, see the examples below. In general the sum of deviance residual is equal the -2 log-likelihood. A special case is the gaussian family, where the deviance residuals are computed as $\sum_{i=1}^N w_i (y_i-\mu)^2$ that is, the deviance residuals ignore the term $\log{2\pi\sigma^2}$. Therefore, the sum of deviance residuals for the gaussian model (and possibly others) is not exactly the -2 log-likelihood prediction error but shifted by a constant. Another special case are models with random effects. For models based on olmm, the deviance residuals are based on the marginal predictions (where random effects are integrated out).

The prune function is used to select a nested model of the current model, i.e. a model which collapses leaves to their inner nodes based on the estimated prediction error. The estimated prediction error is defined as the AIC of the model plus dfsplit times the number of splits. Pruning with direction = "backward" works as follows: In each iteration, all nested models of the current model are evaluated, i.e. models which collapse one of the inner nodes of the current model. The inner node that yields the smallest increase in the estimated prediction error is collapsed and the resulting model substitutes the current model. The algorithm is stopped as soon as all nested models have a higher estimated prediction error than the current model, which will be returned.

References

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees. Wadsworth.

T. Hastie, R. Tibshirani, J. Friedman (2001), The elements of statistical learning, Springer.

Examples

Run this code

## --------------------------------------------------------- #
## Dummy Example 1:
##
## Model selection for the 'vcrpart_2' data. The example is
## merely a syntax template.
## --------------------------------------------------------- #

## load the data
data(vcrpart_2)

## fit the model
control <- tvcm_control(maxstep = 2L, minsize = 5L, cv = FALSE)
model <- tvcglm(y ~ vc(z1, z2, by = x1) + vc(z1, by = x2),
                data = vcrpart_2, family = gaussian(),
                control = control, subset = 1:75)

## cross-validate 'dfsplit'
cv <- cvloss(model, folds = folds_control(type = "kfold", K = 2, seed = 1))
cv
plot(cv)

## out-of-bag error
oobloss(model, newdata = vcrpart_2[76:100,])

## use an alternative loss function
rfun <- function(y, mu, wt) sum(abs(y - mu))
oobloss(model, newdata = vcrpart_2[76:100,], fun = rfun)

Run the code above in your browser using DataLab