confusion_matrix: Confusion Matrices (Contingency Tables)

Description

Construction of confusion matrices, accuracy, sensitivity, specificity, confidence intervals (Wilson's method and (optional bootstrapping)).

Usage

confusion_matrix(
  ...,
  thresholds = NULL,
  confint_method = "logit",
  alpha = getOption("qwraps2_alpha", 0.05)
)
# S3 method for default
confusion_matrix(
  truth,
  predicted,
  ...,
  thresholds = NULL,
  confint_method = "logit",
  alpha = getOption("qwraps2_alpha", 0.05)
)
# S3 method for formula
confusion_matrix(
  formula,
  data = parent.frame(),
  ...,
  thresholds = NULL,
  confint_method = "logit",
  alpha = getOption("qwraps2_alpha", 0.05)
)
# S3 method for glm
confusion_matrix(
  x,
  ...,
  thresholds = NULL,
  confint_method = "logit",
  alpha = getOption("qwraps2_alpha", 0.05)
)
# S3 method for qwraps2_confusion_matrix
print(x, ...)

Value

confusion_matrix returns a data.frame with columns

Arguments

...: pass through
thresholds: a numeric vector of thresholds to be used to define the confusion matrix (one threshold) or matrices (two or more thresholds). If NULL the unique values of predicted will be used.
confint_method: character string denoting if the logit (default), binomial, or Wilson Score method for deriving confidence intervals
alpha: alpha level for 100 * (1 - alpha)% confidence intervals
truth: a integer vector with the values 0 and 1, or a logical vector. A value of 0 or FALSE is an indication of condition negative; 1 or TRUE is an indication of condition positive.
predicted: a numeric vector. See Details.
formula: column (known) ~ row (test) for building the confusion matrix
data: environment containing the variables listed in the formula
x: a glm object

Details

The confusion matrix:

		True	Condition
		+	-
Predicted Condition	+	TP	FP
Predicted Condition	-	FN	TN

where

FN: False Negative = truth = 1 & prediction < threshold,
FP: False Positive = truth = 0 & prediction >= threshold,
TN: True Negative = truth = 0 & prediction < threshold, and
TP: True Positive = truth = 1 & prediction >= threshold.

The statistics returned in the stats element are:

accuracy = (TP + TN) / (TP + TN + FP + FN)
sensitivity, aka true positive rate = TP / (TP + FN)
specificity, aka true negative rate = TN / (TN + FP)
positive predictive value (PPV), aka precision = TP / (TP + FP)
negative predictive value (NPV) = TN / (TN + FN)
false negative rate (FNR) = 1 - Sensitivity
false positive rate (FPR) = 1 - Specificity
false discovery rate (FDR) = 1 - PPV
false omission rate (FOR) = 1 - NPV
F1 score
Matthews Correlation Coefficient (MCC) = ((TP * TN) - (FP * FN)) / sqrt((TP + FP) (TP+FN) (TN+FP) (TN+FN))

Synonyms for the statistics:

Sensitivity: true positive rate (TPR), recall, hit rate
Specificity: true negative rate (TNR), selectivity
PPV: precision
FNR: miss rate

Sensitivity and PPV could, in some cases, be indeterminate due to division by zero. To address this we will use the following rule based on the DICE group https://github.com/dice-group/gerbil/wiki/Precision,-Recall-and-F1-measure: If TP, FP, and FN are all 0, then PPV, sensitivity, and F1 will be defined to be 1. If TP are 0 and FP + FN > 0, then PPV, sensitivity, and F1 are all defined to be 0.

Examples

Run this code


# Example 1: known truth and prediction status
df <-
  data.frame(
      truth = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
    , pred  = c(1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0)
  )

confusion_matrix(df$truth, df$pred, thresholds = 1)

# Example 2: Use with a logistic regression model
mod <- glm(
  formula = spam ~ word_freq_our + word_freq_over + capital_run_length_total
, data = spambase
, family = binomial()
)

confusion_matrix(mod)
confusion_matrix(mod, thresholds = 0.5)

Run the code above in your browser using DataLab