postResample: Calculates performance across resamples

Description

Given two numeric vectors of data, the mean squared error and R-squared are calculated. For two factors, the overall agreement rate and Kappa are determined.

Usage

postResample(pred, obs)
defaultSummary(data, lev = NULL, model = NULL)
twoClassSummary(data, lev = NULL, model = NULL)
mnLogLoss(data, lev = NULL, model = NULL)
R2(pred, obs, formula = "corr", na.rm = FALSE)
RMSE(pred, obs, na.rm = FALSE)
getTrainPerf(x)

Arguments

pred

A vector of numeric data (could be a factor)

obs

A vector of numeric data (could be a factor)

data

a data frame or matrix with columns obs and pred for the observed and predicted outcomes. For twoClassSummary, columns should also include predicted probabilities for each class. See the classProbs

lev

a character vector of factors levels for the response. In regression cases, this would be NULL.

model

a character string for the model name (as taken form the method argument of train.

formula

which $R^2$ formula should be used? Either "corr" or "traditional". See Kvalseth (1985) for a summary of the different equations.

na.rm

a logical value indicating whether NA values should be stripped before the computation proceeds.

an object of class train

Value

A vector of performance estimates.

Details

postResample is meant to be used with apply across a matrix. For numeric data the code checks to see if the standard deviation of either vector is zero. If so, the correlation between those samples is assigned a value of zero. NA values are ignored everywhere.

Note that many models have more predictors (or parameters) than data points, so the typical mean squared error denominator (n - p) does not apply. Root mean squared error is calculated using sqrt(mean((pred - obs)^2. Also, $R^2$ is calculated wither using as the square of the correlation between the observed and predicted outcomes when form = "corr". when form = "traditional", $$R^2 = 1-\frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y}_i)^2}$$

For defaultSummary is the default function to compute performance metrics in train. It is a wrapper around postResample.

twoClassSummary computes sensitivity, specificity and the area under the ROC curve. mnLogLoss computes the minus log-likelihood of the multinomial distribution (without the constant term): $$-logLoss = \frac{-1}{n}\sum_{i=1}^n \sum_{j=1}^C y_{ij} \log(p_{ij})$$ where the y values are binary indicators for the classes and p are the predicted class probabilities.

To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE.

Other functions can be used via the summaryFunction argument of trainControl. Custom functions must have the same arguments asdefaultSummary.

The function getTrainPerf returns a one row data frame with the resampling results for the chosen model. The statistics will have the prefix "Train" (i.e. "TrainROC"). There is also a column called "method" that echoes the argument of the call to trainControl of the same name.

References

Kvalseth. Cautionary note about $R^2$. American Statistician (1985) vol. 39 (4) pp. 279-285

Examples

Run this code

predicted <-  matrix(rnorm(50), ncol = 5)
observed <- rnorm(10)
apply(predicted, 2, postResample, obs = observed)

classes <- c("class1", "class2")
set.seed(1)
dat <- data.frame(obs =  factor(sample(classes, 50, replace = TRUE)),
                  pred = factor(sample(classes, 50, replace = TRUE)),
                  class1 = runif(50), class2 = runif(50))

defaultSummary(dat, lev = classes)
twoClassSummary(dat, lev = classes)
mnLogLoss(dat, lev = classes)

Run the code above in your browser using DataLab