Learn R Programming

joinspy

Diagnostic Tools for Data Frame Joins in R

The joinspy package helps you understand and debug join operations by analyzing key columns before and after joins, detecting common issues, and explaining unexpected row count changes. Catch problems early instead of discovering them when downstream analysis breaks.

Quick Start

library(joinspy)

# Pre-join diagnostics
report <- join_spy(orders, customers, by = "customer_id")
summary(report)

# Quick pass/fail check
key_check(orders, customers, by = "customer_id")

# Safe join with cardinality enforcement
result <- join_strict(orders, customers, by = "customer_id", expect = "1:1")

# Auto-repair common issues
orders_fixed <- join_repair(orders, by = "customer_id")

Statement of Need

Joins silently produce unexpected results when:

  • Duplicate keys cause row multiplication
  • Trailing whitespace breaks matches invisibly
  • Case mismatches ("ABC" vs "abc") prevent joins
  • Encoding issues make identical-looking strings different
  • NA values in keys cause unexpected behavior
  • Type mismatches (character "1" vs numeric 1) fail silently

These problems are discovered only when downstream analysis breaks. joinspy catches them upfront by analyzing keys before you join, explaining why joins misbehave, and showing where the problems are.

Features

Pre-Join Diagnostics

  • join_spy(): Comprehensive pre-flight diagnostic report
  • key_check(): Quick pass/fail key quality assessment
  • key_duplicates(): Find and locate duplicate keys

Post-Join Analysis

  • join_explain(): Explain row count changes after a join
  • join_diff(): Compare before/after states

Safe Join Wrappers

  • join_strict(): Join with cardinality enforcement (1:1, 1:m, m:1, m:m)
  • left_join_spy(), right_join_spy(), inner_join_spy(), full_join_spy(): Joins with automatic diagnostics
  • last_report(): Retrieve diagnostics after silent (.quiet = TRUE) joins

Auto-Repair

  • join_repair(): Fix whitespace, case, encoding, empty strings automatically
  • suggest_repairs(): Generate R code snippets to fix detected issues

Advanced Analysis

  • detect_cardinality(): Determine actual relationship (1:1, 1:m, m:1, m:m)
  • check_cartesian(): Warn about Cartesian product explosions
  • analyze_join_chain(): Analyze multi-table join sequences

Visualization & Logging

  • plot(): Venn diagram of key overlap (with optional file param to save)
  • summary(): Compact metrics table (with optional format param for text/markdown)
  • log_report(): Write reports to file (text/JSON/RDS)
  • set_log_file(): Enable automatic logging

Installation

# Install from CRAN (when available)
install.packages("joinspy")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/joinspy")

Usage Examples

Pre-Join Diagnostics

library(joinspy)

orders <- data.frame(
  customer_id = c("A", "B", "B", "C", "D "),
  amount = c(100, 200, 150, 300, 50),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  customer_id = c("A", "B", "C", "D", "E"),
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  stringsAsFactors = FALSE
)

# Full diagnostic report
report <- join_spy(orders, customers, by = "customer_id")

# Compact summary
summary(report)
#>              metric value
#> 1         left_rows     5
#> 2        right_rows     5
#> 3   left_unique_keys    4
#> 4  right_unique_keys    5
#> ...

Cardinality Enforcement

# Succeeds - 1:1 relationship
products <- data.frame(id = 1:3, name = c("Widget", "Gadget", "Gizmo"))
prices <- data.frame(id = 1:3, price = c(10, 20, 30))

join_strict(products, prices, by = "id", expect = "1:1")

# Fails - duplicates violate 1:1
prices_dup <- data.frame(id = c(1, 1, 2, 3), price = c(10, 15, 20, 30))
join_strict(products, prices_dup, by = "id", expect = "1:1")
#> Error: Cardinality violation: expected '1:1' but found '1:m'

Auto-Repair

messy <- data.frame(
  id = c(" A", "B ", "  C  "),
  value = 1:3,
  stringsAsFactors = FALSE
)

# Preview what would be fixed
join_repair(messy, by = "id", dry_run = TRUE)

# Apply fixes
fixed <- join_repair(messy, by = "id")
fixed$id
#> [1] "A" "B" "C"

Silent Pipeline Mode

# Silent join for pipelines
result <- left_join_spy(orders, customers, by = "customer_id", .quiet = TRUE)

# Access diagnostics afterward
last_report()$match_analysis$match_rate
#> [1] 0.8

Visualization

report <- join_spy(orders, customers, by = "customer_id")

# Venn diagram
plot(report)

# Save to file
plot(report, file = "overlap.png")

Documentation

Related Work

  • dplyr - The relationship argument provides basic cardinality checks
  • tidylog - Logs row count changes (but doesn't diagnose causes)

joinspy fills the gap: it tells you why joins misbehave and where the problems are.

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}

Copy Link

Version

Install

install.packages('joinspy')

Version

0.7.3

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Gilles Colling

Last Published

January 23rd, 2026

Functions in joinspy (0.7.3)

.detect_near_matches

Detect Near-Matches Between Keys
.levenshtein

Calculate 'Levenshtein' Distance
.join_spy_impl

Internal join wrapper helper
.format_bytes

Format Bytes as Human-Readable String
.report_to_text

Convert JoinReport to text
.detect_type_mismatch

Detect Type Mismatches Between Key Columns
.predict_row_counts

Predict Row Counts for Different Join Types
.report_to_list

Convert JoinReport to list for serialization
.detect_numeric_precision

Detect Numeric Precision Issues
join_spy

Comprehensive Pre-Join Diagnostic Report
join_repair

Repair Common Key Issues
join_explain

Explain Row Count Changes After a Join
.to_json

Simple JSON serialization (no dependencies)
full_join_spy

Full Join with Diagnostics
get_log_file

Get Current Log File
inner_join_spy

Inner Join with Diagnostics
is_join_report

Check if Object is a JoinReport
join_diff

Compare Data Frame Before and After Join
log_report

Log Join Report to File
plot.JoinReport

Plot Method for JoinReport
.detect_whitespace

Detect Whitespace Issues in Keys
.summarize_keys

Summarize Key Column
print.JoinReport

Print Method for JoinReport
left_join_spy

Left Join with Diagnostics
suggest_repairs

Suggest Repair Code
new_join_report

Create a JoinReport Object
summary.JoinReport

Summary Method for JoinReport
last_report

Get the Last Join Report
key_check

Quick Key Quality Check
key_duplicates

Find Duplicate Keys
right_join_spy

Right Join with Diagnostics
set_log_file

Configure Automatic Logging
join_strict

Strict Join with Cardinality Enforcement
joinspy-package

joinspy: Diagnostic Tools for Data Frame Joins
.detect_factor_mismatch

Detect Factor Level Mismatches
analyze_join_chain

Analyze Multi-Table Join Chain
detect_cardinality

Detect Join Relationship Type
check_cartesian

Detect Potential Cartesian Product
.detect_case_mismatch

Detect Case Mismatches Between Keys
.analyze_multicolumn_keys

Analyze Multi-Column Key Breakdown
.detect_encoding_issues

Detect Encoding Issues in Keys
.analyze_match

Analyze Match Between Two Key Sets
.detect_empty_strings

Detect Empty Strings in Keys
addin_join_inspector

'RStudio' Addin: Join Inspector