Learn R Programming

arete (version 0.1)

performance_report: Evaluate the performance of a LLM

Description

Produce a detailed report on the discrepancies between LLM extracted data and human annotated data for the same collection of files.

Usage

performance_report(
  human_data,
  model_data,
  full_locations = "coordinates",
  string_distance = "levenshtein",
  verbose = TRUE,
  rmds = TRUE,
  path = NULL
)

Value

list. A confusion matrix is returned for every species per document, plus one for the entire process.

Arguments

human_data

matrix. Ground truth dataset to compare the data extracted by a LLM.

model_data

matrix. Dataset of location data, following the description under human_data.

full_locations

character. Defines dataset structure. If "locations" then structure follows Species, Location, File. if "coordinates" then structure follows Species, Long, Lat, File. if "both" then structure follows Species, Location, Long, Lat, File.

string_distance

character. Selects the method through which the proximity between two strings is calculated, from those available under utils::adist().

verbose

logical. Determines if output should be printed.

rmds

logical. Determines if more extensive R Markdown files should be created at path.

path

character. Directory to which the output of the function is saved.

Details

Four main metrics are calculated to report on the performance of the model for coordinates. These are

  • Accuracy, \(\frac{TP}{TP + FP + FN}\), here defined as such in a system without True Negatives.

  • Recall, \(\frac{TP}{TP + FN}\), Kent et al. (1955)

  • Precision, \(\frac{TP}{TP + FP}\), Kent et al. (1955)

  • F1 score, \(\frac{2}{\frac{1}{Precision} + \frac{1}{Sensitivity}}\), van Rijsbergen(1979).

Additional metrics are calculated, including: 1) a distance-weighed confusion matrix where the sum of each type of error (False Negatives and False Positives) is done by weights, calculated to be inverse to the mean euclidean distance of that data point to all others. This way errors that are close to existing data for that species will count less than those further way, i.e. a data point was hallucinated that was close to existing data or, a data point was missed that is already represented in the data. This adjusted confusion matrix is also presented along with versions of the four main metrics calculated with these values. To report on the performance of locations, by default the minimum Levenshtein distance (Levenshtein, 1966) between a term and all other terms is calculated. Which is defined as: $$ lev(a,b) = \begin{cases} |a| & if |b|=0, \\ |b| & if |a|=0, \\ lev(tail(a),tail(b)) & if head(a) = head(b), \\ 1 + min \begin{cases} lev (tail(a),b) \\ lev (a,tail(b)) \\ lev (tail(a),tail(b)) \\ \end{cases} & otherwise \end{cases} $$ In short, the number of edits needed to turn one string a into string b.

References

  • Kent, A. et al. (1955). "Machine literature searching VIII. Operational criteria for designing information retrieval systems", American Documentation, 6(2), pp. 93–101. doi:10.1002/asi.5090060209.

  • van Rijsbergen, C.J. (1979). "Information Retrieval", Architectural Press. ISBN: 978-0408709293.

  • Levenshtein, V.I. (1966). "Binary codes capable of correcting deletions, insertions, and reversals", Soviet Physics-Doklady, 10(8), pp. 707–710 [Translated from Russian].

Examples

Run this code
trial_data = arete::arete_data("holzapfelae-extract")
trial_data = cbind(trial_data[,1:2], arete::string_to_coords(trial_data[,3])[2:1], trial_data[,4:5])

trial_data = list(
  GT = trial_data[trial_data$Type == "Ground truth", 1:5],
  MD = trial_data[trial_data$Type == "Model", 1:5]
)

# make sure you run arete_setup() beforehand!
performance_report(
  trial_data$GT,
  trial_data$MD,
  full_locations = "both",
  verbose = FALSE,
  rmds = FALSE
)

Run the code above in your browser using DataLab