qlm_compare: Compare coded results for inter-rater reliability

Description

Compares two or more coded objects to assess inter-rater reliability or agreement. For predefined-unit data (data frames or qlm_coded objects), computes standard reliability statistics. For segmented corpora from qlm_segment(), computes Krippendorff's alpha for unitizing (see Details).

Usage

qlm_compare(
  ...,
  by,
  level = NULL,
  tolerance = 0,
  ci = c("none", "analytic", "bootstrap"),
  bootstrap_n = 1000
)

Value

A qlm_comparison object (a tibble/data frame) with the following columns:

variable: Name of the compared variable
level: Measurement level used
measure: Name of the reliability metric
value: Computed value of the metric
docid: Source document identifier and overall indicator (unitizing comparisons only). Absent for predefined-unit comparisons.
rater1, rater2, ...: Names of the compared objects (one column per rater)
ci_lower: Lower bound of confidence interval (only if ci != "none")
ci_upper: Upper bound of confidence interval (only if ci != "none")

The object has class c("qlm_comparison", "tbl_df", "tbl", "data.frame") and attributes containing metadata (raters, n, call).

Metrics by measurement level (predefined-unit comparisons):

Nominal: alpha_nominal, kappa (Cohen's/Fleiss'), percent_agreement
Ordinal: alpha_ordinal, kappa_weighted (2 raters only), w (Kendall's W), rho (Spearman's), percent_agreement
Interval/Ratio: alpha_interval/alpha_ratio, icc, r (Pearson's), percent_agreement

For unitizing measures (segmented corpora), see Details.

Confidence intervals:

ci = "analytic": Provides analytic CIs for ICC and Pearson's r only
ci = "bootstrap": Provides bootstrap CIs for all metrics via resampling

Arguments

...

Two or more data frames, qlm_coded, or as_qlm_coded objects to compare. These represent different "raters" (e.g., different LLM runs, different models, human coders, or human vs. LLM coding). Each object must have a .id column and the variable specified in by. Objects should have the same units (matching .id values). Plain data frames are automatically converted to as_qlm_coded objects. Alternatively, all inputs may be segmented corpora from qlm_segment() or as_qlm_coded() with qlm_segment = TRUE (see Details).

by

Optional. Name of the variable(s) to compare across raters (supports both quoted and unquoted). If NULL (default), all coded variables are compared. Can be a single variable (by = sentiment), a character vector (by = c("sentiment", "rating")), or NULL to process all variables.

level

Optional. Measurement level(s) for the variable(s). Can be:

NULL (default): Auto-detect from codebook
Character scalar: Use same level for all variables
Named list: Specify level for each variable

Valid levels are "nominal", "ordinal", "interval", or "ratio".

tolerance

Numeric. Tolerance for agreement with numeric data. Default is 0 (exact agreement required). Used for percent agreement calculation.

ci

Confidence interval method:

"none": No confidence intervals (default)

"analytic"

Analytic CIs where available (ICC, Pearson's r)

"bootstrap"

Bootstrap CIs for all metrics via resampling

bootstrap_n

Number of bootstrap resamples when ci = "bootstrap". Default is 1000. Ignored when ci is "none" or "analytic".

Details

The function merges the coded objects by their .id column and only includes units that are present in all objects. Missing values in any rater will exclude that unit from analysis.

Measurement levels and statistics:

Nominal: For unordered categories. Computes Krippendorff's alpha, Cohen's/Fleiss' kappa, and percent agreement.
Ordinal: For ordered categories. Computes Krippendorff's alpha (ordinal), weighted kappa (2 raters only), Kendall's W, Spearman's rho, and percent agreement.
Interval: For continuous data with meaningful intervals. Computes Krippendorff's alpha (interval), ICC, Pearson's r, and percent agreement.
Ratio: For continuous data with a true zero point. Computes the same measures as interval level, but Krippendorff's alpha uses the ratio-level formula which accounts for proportional differences.

Kendall's W, ICC, and percent agreement are computed using all raters simultaneously. For 3 or more raters, Spearman's rho and Pearson's r are computed as the mean of all pairwise correlations between raters.

Unitizing (segmentation) reliability

When all inputs are segmented corpora — created by qlm_segment() or as_qlm_coded() with qlm_segment = TRUE — agreement is measured at the character level using Krippendorff's alpha for unitizing continua (Krippendorff, 2019, section 12.6). This accounts for segments of unequal length and partial overlaps between coders' unitizations. The observed and expected coincidence matrices are constructed from the lengths of pairwise segment intersections across all observer pairs. The output includes a docid column with per-document and overall results. Segmented corpora must reference the same source text.

Four members of the unitizing alpha family are supported:

alpha_u_binary (|_ualpha): Computed when by is omitted. Measures agreement on which character spans are identified as segments versus gaps (irrelevant matter). Collapses all segment values to a binary distinction. Use this for pure boundary agreement when segments carry no codes (section 12.6.4, eq. 35).
alpha_u_nominal (_ualpha[nominal]): Computed when by names a docvar. Measures agreement on both boundary placement and the value (code) assigned to each segment. This is the most comprehensive measure: low values can reflect boundary disagreement, coding disagreement, or both (section 12.6.3, eq. 34).
alpha_cu_nominal (_cualpha[nominal]): Computed alongside alpha_u_nominal when by is specified. Measures coding agreement conditional on unitization, restricting the coincidence matrix to intersections of non-gap segments only. This isolates "do the coders agree on the codes?" from "do they agree on the boundaries?" (section 12.6.5, eqs. 36--37).
alpha_u_per_value[k] (_(k)ualpha[nominal]): Computed alongside alpha_u_nominal when by is specified. Reports the reliability of each individual value k, showing which codes are applied reliably and which are not. Coverage (the percentage of all k-valued matter found in valued intersections) is reported in the docid column (section 12.6.6, eq. 38).

References

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage. tools:::Rd_expr_doi("10.4135/9781071878781")

Examples

Run this code

# Load example coded objects
examples <- readRDS(system.file("extdata", "example_objects.rds", package = "quallmer"))

# Compare two coding runs
comparison <- qlm_compare(
  examples$example_coded_sentiment,
  examples$example_coded_mini,
  by = "sentiment",
  level = "nominal"
)
print(comparison)

# Compare specific variables with explicit levels
qlm_compare(
  examples$example_coded_sentiment,
  examples$example_coded_mini,
  by = "sentiment"
)

Run the code above in your browser using DataLab