sa_diff: Statistical tests for the differences between standardized accuracies (staccuracies)

Description

Because the distribution of staccuracies is uncertain (and indeed, different staccuracies likely have different distributions), bootstrapping is used to empirically estimate the distributions and calculate the p-values. See the return value description for details on what the function provides.

Usage

sa_diff(
  actual,
  preds,
  ...,
  na.rm = FALSE,
  sa = NULL,
  pct = c(0.01, 0.02, 0.03, 0.04, 0.05),
  boot_alpha = 0.05,
  boot_it = 1000,
  seed = 0
)

Value

tibble with staccuracy difference results:

staccuracy: name of staccuracy measure
pred: Each named element (model name) in the input preds. The row values give the staccuracy for that prediction. When pred is NA, the row represents the difference between prediction staccuracies (diff) instead of staccuracies themselves.
diff: When diff takes the form 'model1-model2', then the row values give the difference in staccuracies between two named elements (model names) in the input preds. When diff is NA, the row instead represents the staccuracy of a specific model prediction (pred).
lo, mean, hi: The lower bound, mean, and upper bound of the bootstrapped staccuracy. The lower and upper bounds are confidence intervals specified by the input boot_alpha.
p__: p-values that the difference in staccuracies are at least the specified percentage amount or greater. E.g., for the default input pct = c(0.01, 0.02, 0.03, 0.04, 0.05), these columns would be p01, p02, p03, p04, and p05. As they apply only to differences between staccuracies, they are provided only for diff rows and are NA for pred rows. As an example of their meaning, if the mean difference for 'model1-model2' is 0.0832 with p01 of 0.012 and p02 of 0.035, then 1.2% of bootstrapped staccuracies had a model1 - model2 difference of less than 0.01 and 3.5% were less than 0.02. (That is, 98.8% of differences were greater than 0.01 and 96.5% were greater than 0.02.)

Arguments

actual: numeric vector. The actual (true) labels.
preds: named list of at least two numeric vectors. Each element is a vector of the same length as actual with predictions for each row corresponding to each element of actual. The names of the list elements should be the names of the models that produced each respective prediction; these names will be used to distinguish the results.
...: not used. Forces explicit naming of subsequent arguments.
na.rm: See documentation for staccuracy()
sa: list of functions. Each element is the unquoted name of a valid staccuracy function (see staccuracy() for the required function signature.) If an element is named, the name will be displayed as the value of the sa column of the result. Otherwise, the function name will be displayed. If NULL (default), staccuracy functions will be automatically selected based on the datatypes of actual and preds.
pct: numeric with values from (0, 1). The percentage values on which the difference in staccuracies will be tested.
boot_alpha: numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrapped means; the bootstrap confidence intervals will be the lowest and highest (1 - 0.05) / 2 percentiles. For example, if boot_alpha = 0.05 (default), the intervals will be at the 2.5 and 97.5 percentiles.
boot_it: positive integer(1). The number of bootstrap iterations.
seed: integer(1). Random seed for the bootstrap sampling. Supply this between runs to assure identical results.

Examples

Run this code

lm_attitude_all <- lm(rating ~ ., data = attitude)
lm_attitude__a <- lm(rating ~ . - advance, data = attitude)
lm_attitude__c <- lm(rating ~ . - complaints, data = attitude)

sdf <- sa_diff(
  attitude$rating,
  list(
    all = predict(lm_attitude_all),
    madv = predict(lm_attitude__a),
    mcmp = predict(lm_attitude__c)
  ),
  boot_it = 10
)
sdf

Run the code above in your browser using DataLab