validate.rpart: Dxy and Mean Squared Error by Cross-validating a Tree Sequence

Description

Uses xval-fold cross-validation of a sequence of trees to derive estimates of the mean squared error and Somers' Dxy rank correlation between predicted and observed responses. In the case of a binary response variable, the mean squared error is the Brier accuracy score. For survival trees, Dxy is negated so that larger is better. There are print and plot methods for objects created by validate.rpart.

Usage

# f <- rpart(formula=y ~ x1 + x2 + \dots) # or rpart
# S3 method for rpart
validate(fit, method, B, bw, rule, type, sls, aics,
    force, estimates, pr=TRUE,
    k, rand, xval=10, FUN, …)
# S3 method for validate.rpart
print(x, …)
# S3 method for validate.rpart
plot(x, what=c("mse","dxy"), legendloc=locator, …)

Arguments

fit

an object created by rpart. You must have specified the model=TRUE argument to rpart.

method,B,bw,rule,type,sls,aics,force,estimates

are there only for consistency with the generic validate function; these are ignored

the result of validate.rpart

a sequence of cost/complexity values. By default these are obtained from calling FUN with no optional arguments or from the rpart cptable object in the original fit object. You may also specify a scalar or vector.

rand

a random sample (usually omitted)

xval

number of splits

FUN

the name of a function which produces a sequence of trees, such prune.

…

additional arguments to FUN (ignored by print,plot).

set to FALSE to prevent intermediate results for each k to be printed

what

a vector of things to plot. By default, 2 plots will be done, one for mse and one for Dxy.

legendloc

a function that is evaluated with a single argument equal to 1 to generate a list with components x, y specifying coordinates of the upper left corner of a legend, or a 2-vector. For the latter, legendloc specifies the relative fraction of the plot at which to center the legend.

Value

a list of class "validate.rpart" with components named k, size, dxy.app, dxy.val, mse.app, mse.val, binary, xval. size is the number of nodes, dxy refers to Somers' D, mse refers to mean squared error of prediction, app means apparent accuracy on training samples, val means validated accuracy on test samples, binary is a logical variable indicating whether or not the response variable was binary (a logical or 0/1 variable is binary). size will not be present if the user specifies k.

Side Effects

prints if pr=TRUE

Examples

Run this code

# NOT RUN {
n <- 100
set.seed(1)
x1 <- runif(n)
x2 <- runif(n)
x3 <- runif(n)
y  <- 1*(x1+x2+rnorm(n) > 1)
table(y)
require(rpart)
f <- rpart(y ~ x1 + x2 + x3, model=TRUE)
v <- validate(f)
v    # note the poor validation
par(mfrow=c(1,2))
plot(v, legendloc=c(.2,.5))
par(mfrow=c(1,1))
# }

Run the code above in your browser using DataLab