tvcm-assessment: Model selection utility functions for tvcm objects.

Description

Pruning, cross-validation to find the optimal pruning parameter and computing validation set errors for tvcm objects.

Usage

## S3 method for class 'tvcm':
prune(tree, cp = NULL, alpha = NULL, maxstep = NULL,
      terminal = NULL, original = FALSE, ...)
folds_control(type = c("kfold", "subsampling", "bootstrap"),
      K = ifelse(type == "kfold", 5, 30),
      prob = 0.5, weights = c("case", "freq"),
      seed = NULL)
## S3 method for class 'tvcm':
cvloss(object, folds = folds_control(), ...)
## S3 method for class 'cvloss.tvcm':
print(x, ...)
## S3 method for class 'cvloss.tvcm':
plot(x, legend = TRUE, details = TRUE, ...)
## S3 method for class 'tvcm':
oobloss(object, newdata = NULL, weights = NULL,
        fun = NULL, ...)

Arguments

object, tree

an object of class tvcm.

an object of class cvloss.tvcm as produced by cvloss.

type

character string. The type of sampling scheme to be used to divide the data of the input model in a learning and a validation set.

integer scalar. The number of folds.

prob

numeric between 0 and 1. The probability for the "subsampling" cross-validation scheme.

weights

for folds_control, a character that defines whether the weights of object are case weights or frequencies of cases; for oobloss<

seed

an numeric scalar that defines the seed.

folds

a list with control arguments as produced by folds_control.

legend

logical scalar. Whether a legend should be added.

details

logical scalar. Whether the foldwise validation errors should be shown.

fun

the loss function for the validation sets. By default, the (possibly weighted) mean of the deviance residuals as defined by the family of the fitted object is applied.

newdata

a data.frame of out-of-bag data (including the response variable). See also predict.tvcm.

numeric scalar. The complexity parameter to be cross-validated resp. the penalty with which the model should be pruned.

alpha

numeric significance level. Represents the stopping parameter for tvcm objects grown with sctest = TRUE, see tvcm_control.

maxstep

integer. The maximum number of steps of the algorithm.

terminal

a list of integer vectors with the ids of the nodes the inner nodes to be set to terminal nodes. The length of the list must be equal the number of partitions.

original

logical scalar. Whether pruning should be based on the trees from partitioning rather than on the current trees.

...

other arguments to be passed.

Value

prune returns a tvcm object, folds_control returns a list of parameters for building a cross-validation scheme. cvloss returns an cvloss.tvcm object with at least the following components:
grida list with values for cp.
ooblossa matrix recording the validated loss for each value in grid for each fold.
cp.hatnumeric scalar. The tuning parameter which minimizes the cross-validated error.
foldsthe used folds to extract the learning and the validation sets.
oobloss returns a scalar representing the total prediction error for newdata.

Details

By default, tvcm is a two stage procedure that first grows overly large trees and second selects the best-sized trees by pruning. The here presented functions may be interesting for advanced users who want to process the model selection stage separately.

In normal practice, the prune function is used to collapse inner nodes of the tree structures by the tuning parameter cp. The aim of pruning by cp is to collapse inner nodes to minimize the cost-complexity criterion

$$error(cp) = error(tree) + cp * complexity(tree)$$ whereby, the training error $error(tree)$ is defined by lossfun and $complexity(tree)$ is defined as the total number of coefficients times dfpar plus the total number of splits times dfsplit. The function lossfun and the parameters dfpar and dfsplit are defined by the control argument of tvcm, see also tvcm_control. By default, $error(tree)$ is minus two times the total likelihood of the model and $complexity(tree)$ the number of splits. The minimization of $error(cp)$ is implemented by the following iterative backward-stepwise algorithm

fit allsubtreemodels that collapse one inner node of the currenttreemodel.
compute the per-complexity increase in the training error$$dev = (error(subtree) - error(tree)) / (complexity(tree) - complexity(subtree))$$for all fittedsubtreemodels
if anydev<cpthen set as thetreemodel thesubtreethat minimizesdevand repeated 1 to 3, otherwise stop.

The penalty cp is generally unknown and is estimated adaptively from the data. cvloss implements the cross-validation method to do this. cvloss repeats for each fold the following steps

fit a new model withtvcmbased on the training data of the fold.
prune the new model for increasingcp. Compute for eachcpthe average validation error.

Doing so yields for each fold a sequence of values for cp and a sequence of average validation errors. The obtained sequences for cp are combined to a fine grid and the average validation error is averaged correspondingly. From these two sequences we choose the cp that minimizes the validation error. Notice that the average validation error is computed as the total prediction error of the validation set divided by the sum of validation set weights. See also the argument ooblossfun in tvcm_control and the function oobloss.

The function folds_control is used to specify the cross-validation scheme, where a random 5-fold cross-validation scheme is set as the default. Alternatives are type = "subsampling" (random draws without replacement) and type = "bootstrap" (random draws with replacement). For 2-stage models (with random-effects) fitted by olmm, the subsets are based on subject-wise i.e. first stage sampling. For models where weights represent frequencies of observation units (e.g., data from contingency tables), the option weights = "freq" should be considered. cvloss returns an object for which a print and a plot generic is provided. oobloss can be used to estimate the total prediction error for validation data (the newdata argument). By default, the loss is defined as the sum of deviance residuals, see the return value dev.resids of family resp. family.olmm. Otherwise, the loss function can be defined manually by the argument fun, see the examples below. In general the sum of deviance residual is equal the sum of the -2 log-likelihood errors. A special case is the gaussian family, where the deviance residuals are computed as $\sum_{i=1}^N w_i (y_i-\mu)^2$, that is, the deviance residuals ignore the term $log 2\pi\sigma^2$. Therefore, the sum of deviance residuals for the gaussian model (and possibly others) is not exactly the sum of -2 log-likelihood prediction errors (but shifted by a constant). Another special case are models with random effects. For models based on olmm, the deviance residuals are retrieved from marginal predictions (where random effects are integrated out).

References

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees. Wadsworth.

T. Hastie, R. Tibshirani, J. Friedman (2001), The elements of statistical learning, Springer.

Examples

Run this code

## --------------------------------------------------------- #
## Dummy Example 1:
##
## Model selection for the 'vcrpart_2' data. The example is
## merely a syntax template.
## --------------------------------------------------------- #

## load the data
data(vcrpart_2)

## fit the model
control <- tvcm_control(maxstep = 2L, minsize = 5L, cv = FALSE)
model <- tvcglm(y ~ vc(z1, z2, by = x1) + vc(z1, by = x2),
                data = vcrpart_2, family = gaussian(),
                control = control, subset = 1:75)

## cross-validate 'dfsplit'
cv <- cvloss(model, folds = folds_control(type = "kfold", K = 2, seed = 1))
cv
plot(cv)

## out-of-bag error
oobloss(model, newdata = vcrpart_2[76:100,])

## use an alternative loss function
rfun <- function(y, mu, wt) sum(abs(y - mu))
oobloss(model, newdata = vcrpart_2[76:100,], fun = rfun)

Run the code above in your browser using DataLab