rf.crossValidation: Random Forest Classification Model Cross-validation

Description

Implements a permutation test cross-validation for Random Forests classification models

Usage

rf.crossValidation(x, xdata, p = 0.1, n = 99, seed = NULL, ...)

Arguments

random forest object

xdata

x data used in model

Percent data withhold

Number of cross validations

seed

Sets random seed in R global environment

...

Additional arguments passed to Random Forests

Value

A "rf.cv" class object with the following components:

cross.validation$cv.users.accuracy Class-level users accuracy for the subset cross validation data
cross.validation$cv.producers.accuracy Class-level producers accuracy for the subset cross validation data
cross.validation$cv.oob Global and class-level OOB error for the subset cross validation data
model$model.users.accuracy Class-level users accuracy for the model
model$model.producers.accuracy Class-level producers accuracy for the model
model$model.oob Global and class-level OOB error for the model

Details

The crossvalidation statistics are based on the prediction error on the witheld data:

Total observed accuracy represents the percent correctly classified (AKA, ) and is considered as a naive measure of agreement. The diagonal of the confusion matrix represents correctly classified observations where off-diagonals represent cross-classification error. The primary issue with this evaluation is that does not reveal if error was evenly distributed between classes.

To represent the balance of error one can use omission and commission statistics such as estimates of users and producers accuracy. User's accuracy corresponds to error of commission (inclusion), observations being erroneously included in a given class. The commission errors are represented by row sums of the matrix. Producer's accuracy corresponds to error of omission (exclusion), observations being erroneously excluded from a given class. The omission errors are represented by column sums of the matrix.

None of the previous statistics account for random agreement influencing the accuracy measure. The kappa statistic is a chance corrected metric that reflects the difference between observed agreement and agreement expected by random chance. A kappa of k=0.85 would indicate that there is 85

pcc = [Number of correct observations / total number of observations]
pcc = [Number of correct observations / total number of observations]
producers accuracy = [Number of correct / total number of correct and omission errors]
k = (observed accuracy - chance agreement) / (1 - chance agreement) where; change agreement = sum[product of row and column totals for each class]

References

Evans, J.S. and S.A. Cushman (2009) Gradient Modeling of Conifer Species Using Random Forest. Landscape Ecology 5:673-683.

Murphy M.A., J.S. Evans, and A.S. Storfer (2010) Quantify Bufo boreas connectivity in Yellowstone National Park with landscape genetics. Ecology 91:252-261

Evans J.S., M.A. Murphy, Z.A. Holden, S.A. Cushman (2011). Modeling species distribution and change using Random Forests CH.8 in Predictive Modeling in Landscape Ecology eds Drew, CA, Huettmann F, Wiersma Y. Springer

Examples

Run this code

require(randomForest)
  data(iris)
    iris$Species <- as.factor(iris$Species)    	
      set.seed(1234)	
( rf.mdl <- randomForest(iris[,1:4], iris[,"Species"], ntree=501) )
  ( rf.cv <- rf.crossValidation(rf.mdl, iris[,1:4], p=0.10, n=99, ntree=501) )

   # Plot cross validation verses model producers accuracy
   par(mfrow=c(1,2)) 
     plot(rf.cv, type = "cv", main = "CV producers accuracy")
     plot(rf.cv, type = "model", main = "Model producers accuracy")

   # Plot cross validation verses model oob
   par(mfrow=c(1,2)) 
     plot(rf.cv, type = "cv", stat = "oob", main = "CV oob error")
     plot(rf.cv, type = "model", stat = "oob", main = "Model oob error")

Run the code above in your browser using DataLab