errorest: Estimators of Prediction Error

Description

Resampling based estimates of prediction error: misclassification error, root mean squared error or Brier score for survival data.

Usage

"errorest"(formula, data, subset, na.action=na.omit,  model=NULL, predict=NULL, estimator=c("cv", "boot", "632plus"),  est.para=control.errorest(), ...)

Arguments

formula

a formula of the form lhs ~ rhs. Either describing the model of explanatory and response variables in the usual way (see lm) or the model between explanatory and intermediate variables in the framework of indirect classification, see inclass.

data

a data frame containing the variables in the model formula and additionally the class membership variable if model = inclass. data is required for indirect classification, otherwise formula is evaluated in the calling environment.

subset

optional vector, specifying a subset of observations to be used.

na.action

function which indicates what should happen when the data contains NA's, defaults to na.omit.

model

function. Modelling technique whose error rate is to be estimated. The function model can either return an object representing a fitted model or a function with argument newdata which returns predicted values. In this case, the predict argument to errorest is ignored.

predict

function. Prediction method to be used. The vector of predicted values must have the same length as the the number of to-be-predicted observations. Predictions corresponding to missing data must be replaced by NA. Additionally, predict has to return predicted values comparable to the responses (that is: factors for classification problems). See the example on how to make this sure for any predictor.

estimator

estimator of the misclassification error: cv cross-validation, boot bootstrap or 632plus bias corrected bootstrap (classification only).

est.para

a list of additional parameters that control the calculation of the estimator, see control.errorest for details.

...

additional parameters to model.

Value

The class of the object returned depends on the class of the response variable and the estimator used. In each case, it is a list with an element error and additional information. print methods are available for the inspection of the results.

Details

The prediction error for classification and regression models as well as predictive models for censored data using cross-validation or the bootstrap can be computed by errorest. For classification problems, the estimated misclassification error is returned. The root mean squared error is computed for regression problems and the Brier score for censored data (Graf et al., 1999) is reported if the response is censored.

Any model can be specified as long as it is a function with arguments model(formula, data, subset, na.action, ...). If a method predict.model(object, newdata, ...) is available, predict does not need to be specified. However, predict has to return predicted values in the same order and of the same length corresponding to the response. See the examples below.

$k$-fold cross-validation and the usual bootstrap estimator with est.para$nboot bootstrap replications can be computed for all kind of problems. The bias corrected .632+ bootstrap by Efron and Tibshirani (1997) is available for classification problems only. Use control.errorest to specify additional arguments.

errorest is a formula based interface to the generic functions cv or bootest which implement methods for classification, regression and survival problems.

References

Brian D. Ripley (1996), Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.

Bradley Efron and Robert Tibshirani (1997), Improvements on Cross-Validation: The .632+ Bootstrap Estimator. Journal of the American Statistical Association 92(438), 548--560.

Erika Graf, Claudia Schmoor, Willi Sauerbrei and Martin Schumacher (1999), Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine 18(17-18), 2529--2545.

Rosa A. Schiavo and David J. Hand (2000), Ten More Years of Error Rate Research. International Statistical Review 68(3), 296-310.

David J. Hand, Hua Gui Li, Niall M. Adams (2001), Supervised Classification with Structured Class Definitions. Computational Statistics & Data Analysis 36, 209--225.

Examples

Run this code


# Classification

data("iris")
library("MASS")

# force predict to return class labels only
mypredict.lda <- function(object, newdata)
  predict(object, newdata = newdata)$class

# 10-fold cv of LDA for Iris data
errorest(Species ~ ., data=iris, model=lda, 
         estimator = "cv", predict= mypredict.lda)

data("PimaIndiansDiabetes", package = "mlbench")
## Not run: 
# # 632+ bootstrap of LDA for Diabetes data
# errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
#          estimator = "632plus", predict= mypredict.lda)
# ## End(Not run)

#cv of a fixed partition of the data
list.tindx <- list(1:100, 101:200, 201:300, 301:400, 401:500,
        501:600, 601:700, 701:768)

errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
          estimator = "cv", predict = mypredict.lda,
          est.para = control.errorest(list.tindx = list.tindx))

## Not run: 
# #both bootstrap estimations based on fixed partitions
# 
# list.tindx <- vector(mode = "list", length = 25)
# for(i in 1:25) {
#   list.tindx[[i]] <- sample(1:768, 768, TRUE)
# }
# 
# errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
#           estimator = c("boot", "632plus"), predict= mypredict.lda,
#           est.para = control.errorest(list.tindx = list.tindx))
# 
# ## End(Not run)
data("Glass", package = "mlbench")

# LDA has cross-validated misclassification error of
# 38% (Ripley, 1996, page 98)

# Pruned trees about 32% (Ripley, 1996, page 230)

# use stratified sampling here, i.e. preserve the class proportions
errorest(Type ~ ., data=Glass, model=lda, 
         predict=mypredict.lda, est.para=control.errorest(strat=TRUE))

# force predict to return class labels
mypredict.rpart <- function(object, newdata)
  predict(object, newdata = newdata,type="class")

library("rpart")
pruneit <- function(formula, ...)
  prune(rpart(formula, ...), cp =0.01)

errorest(Type ~ ., data=Glass, model=pruneit,
         predict=mypredict.rpart, est.para=control.errorest(strat=TRUE))

# compute sensitivity and specifity for stabilised LDA

data("GlaucomaM", package = "TH.data")

error <- errorest(Class ~ ., data=GlaucomaM, model=slda,
  predict=mypredict.lda, est.para=control.errorest(predictions=TRUE))

# sensitivity 

mean(error$predictions[GlaucomaM$Class == "glaucoma"] == "glaucoma")

# specifity

mean(error$predictions[GlaucomaM$Class == "normal"] == "normal")

# Indirect Classification: Smoking data

data(Smoking)
# Set three groups of variables:
# 1) explanatory variables are: TarY, NicY, COY, Sex, Age
# 2) intermediate variables are: TVPS, BPNL, COHB
# 3) response (resp) is defined by:

resp <- function(data){
  data <- data[, c("TVPS", "BPNL", "COHB")]
  res <- t(t(data) > c(4438, 232.5, 58))
  res <- as.factor(ifelse(apply(res, 1, sum) > 2, 1, 0))
  res
}

response <- resp(Smoking[ ,c("TVPS", "BPNL", "COHB")])
smoking <- cbind(Smoking, response)

formula <- response~TVPS+BPNL+COHB~TarY+NicY+COY+Sex+Age

# Estimation per leave-one-out estimate for the misclassification is 
# 36.36% (Hand et al., 2001), using indirect classification with 
# linear models
## Not run: 
# errorest(formula, data = smoking, model = inclass,estimator = "cv", 
#          pFUN = list(list(model=lm, predict = mypredict.lm)), cFUN = resp,  
#          est.para=control.errorest(k=nrow(smoking)))
# ## End(Not run)

# Regression

data("BostonHousing", package = "mlbench")

# 10-fold cv of lm for Boston Housing data
errorest(medv ~ ., data=BostonHousing, model=lm,
         est.para=control.errorest(random=FALSE))

# the same, with "model" returning a function for prediction
# instead of an object of class "lm"

mylm <- function(formula, data) {
  mod <- lm(formula, data)
  function(newdata) predict(mod, newdata)
}

errorest(medv ~ ., data=BostonHousing, model=mylm,
est.para=control.errorest(random=FALSE))


# Survival data

data("GBSG2", package = "TH.data")
library("survival")

# prediction is fitted Kaplan-Meier
predict.survfit <- function(object, newdata) object

# 5-fold cv of Kaplan-Meier for GBSG2 study
errorest(Surv(time, cens) ~ 1, data=GBSG2, model=survfit,
         predict=predict.survfit, est.para=control.errorest(k=5))

Run the code above in your browser using DataLab