evaluate_external_ratings: Evaluate how new typicality ratings predict human ratings and compares performance to LLM baselines
Description
This function compares external typicality ratings (e.g., generated by a new LLM)
against the validation dataset included in 'baserater'. The validation set contains
average typicality ratings collected from 50 Prolific participants on a subset of
100 group–adjective pairs, as described in the accompanying paper.
The input ratings are merged with this reference set, and then:
Computes a correlation (cor.test) between the external ratings and the human average;
Compares it to one or more built-in model baselines (default: 'GPT-4' and 'LLaMA 3.3');
Prints a clear summary of all correlation coefficients and flags whether the external model outperforms each baseline;