ura
ura provides a set of tools for calculating inter-rater reliability
(IRR) statistics by rater, allowing for real-time monitoring of rater
reliability. While not the first package to provide users access to IRR
diagnostics (e.g., irr), ura
aims to provide a simple set of tools for quickly monitoring rater
progress and precision. You can use ura to, for instance, find the
percentage agreement or Krippendorf’s Alpha of all of the subjects coded
by your raters. Another helpful use is to calculate percentage agreement
values by rater, providing an efficient way to monitor the relative
reliability of your raters.
This package complements a
paper
published in PS: Political Science & Politics, entitled “Improving
Content Analysis: Tools for Working with Undergraduate Research
Assistants.” Please refer to this paper for a more general discussion
about training and monitoring student raters. Also, be sure to check out
the paper for more information about how to use the tools in ura to
monitor progress without compromising reproducibility.
Installation Instructions
ura is available on CRAN and can be installed using:
install.packages("ura")You can install the most recent development version of ura using the
devtools package. First, you have to install devtools using the
following code. Note that you only have to do this once:
if(!require(devtools)) install.packages("devtools")Then, load devtools and use the function install_github() to install
ura:
library(devtools)
install_github("bengoehring/ura", dependencies = TRUE)Usage Examples
IRR statistics
ura can be used to calculate key IRR statistics, such as percentage
agreement and Krippendorf’s Alpha via the irr_stats() function. This
function largely serves as a wrapper around irr::agree() and
irr::kripp.alpha() but aims to simplify users’ lives by only requiring
the user to provide a dataframe and specify key columns.
For instance, below I calculate the percentage agreement and Krippendorf
Alpha of the diagnoses dataset, which notes the psychiatric
evaluations of 30 patients from 6 raters. The diagnoses dataset is
included with the ura package and is simply a reshaped version of the
dataset with the same name in the irr package.
library(ura)
irr_stats(diagnoses,
rater_column = 'rater_id',
subject_column = 'patient_id',
coding_column = 'diagnosis')
#> # A tibble: 2 × 3
#> statistic value n_subjects
#> <chr> <dbl> <int>
#> 1 Percentage agreement 16.7 30
#> 2 Krippendorf's Alpha 0.43 30A few things to note here. First, the unit of analysis in diagnoses is
rater-subject — that is, each row provides the coding decision of rater
i for subject j. All data inputted into a ura function should be long
by rater-subject. Second, you will see that the dataframe returned by
irr_stats() notes the number of subjects used to calculate the given
IRR statistic. In the case of diagnoses, this value is equal to the
number of unique subjects in the dataframe:
length(unique(diagnoses$patient_id))
#> [1] 30This is not always the case. If your dataframe includes subjects that
were coded by more than one rater and subjects coded by a single rater
(this is a common approach for balancing efficiency with the need for
IRR statistics), ura will automatically only use the subjects coded by
more than one rater. The resulting number of subjects will then appear
in the n_subjects column.
Percentage Agreement by Rater
The rater_agreement() function is the key method for monitoring rater
reliability. While irr_stats() provides pooled IRR statistics across
all raters, rater_agreement() provides the percent share of a given
raters’ codings that agree with other raters’ codings. In other words,
it offers supervisors a method for checking the relative precision of
each rater in real time. Since interventions in coding procedures should
be used sparingly, I suggest taking a look at the paper linked above for
more information about when and why to intervene based on information
gleaned from rater_agreement().
In the snippet below, all raters have the same percent agreement: 17%. That is because, as implied by the n_multi_coded column, every rater codes every subject in the diagnoses dataset.
rater_agreement(diagnoses,
rater_column = 'rater_id',
subject_column = 'patient_id',
coding_column = 'diagnosis')
#> # A tibble: 6 × 3
#> rater percent_agree n_multi_coded
#> <dbl> <dbl> <int>
#> 1 1 17 30
#> 2 2 17 30
#> 3 3 17 30
#> 4 4 17 30
#> 5 5 17 30
#> 6 6 17 30A more helpful use case is when you only have your raters multi-code a subset of subjects. Take this hypothetical dataset, for instance:
example_data <- tibble::tribble(
~rater, ~subject, ~coding,
1, 1, 1,
1, 2, 0,
1, 3, 1,
1, 4, 0,
2, 3, 1,
2, 9, 0,
2, 10, 1,
2, 4, 1,
2, 5, 1,
2, 6, 1,
3, 5, 1,
3, 6, 1,
3, 7, 1,
3, 8, 1,
)Here, some subjects are coded by multiple raters while others are coded by a single rater. As a result:
rater_agreement(example_data,
rater_column = 'rater',
subject_column = 'subject',
coding_column = 'coding')
#> # A tibble: 3 × 3
#> rater percent_agree n_multi_coded
#> <dbl> <dbl> <int>
#> 1 3 100 2
#> 2 2 75 4
#> 3 1 50 2In terms of interpretation, row 3 shows that of the 2 subjects coded by rater 1 that were also coded by another rater, rater 1 agrees with the other rater(s) 50% of the time. Looking back at example_data, it appears that rater 1 agreed with rater 2 on the coding of subject 3 but not on subject 4.