Learn R Programming

keyed

Explicit Key Assumptions for Flat-File Data

The keyed package brings database-style primary key protections to R data frames. Declare which columns must be unique, and keyed enforces that constraint through filters, joins, and mutations — erroring immediately when assumptions break instead of failing silently downstream.

Quick Start

library(keyed)

# Declare a primary key — errors if not unique
customers <- read.csv("customers.csv") |> key(customer_id)

# Key persists through transformations
active <- customers |> dplyr::filter(status == "active")
has_key(active)
#> [1] TRUE

# Watch for automatic drift detection
customers <- customers |> watch()
modified  <- customers |> dplyr::mutate(score = score + 10)
check_drift(modified)
#> Drift detected
#> Modified: 3 row(s)
#>   score: 3 change(s)

Statement of Need

In databases, you declare customer_id as a primary key and the engine enforces uniqueness. With CSV and Excel files, you get no such guarantees — duplicates slip in silently, joins produce unexpected row counts, and data assumptions are implicit.

Existing validation packages (pointblank, validate) offer comprehensive rule engines but require upfront schema definitions. For analysts working interactively with flat files, this overhead is often too high. The result: assumptions go unchecked, and errors surface far from their source.

keyed addresses this gap with four lightweight mechanisms:

FeatureWhat it does
KeysDeclare unique columns, enforced through transformations
LocksAssert conditions (no NAs, row counts, coverage) at pipeline checkpoints
UUIDsTrack row identity through filters, joins, and reshaping
Watch & DiffAuto-snapshot before each transformation, cell-level drift reports

These features are designed for CSV-first workflows without database infrastructure or version control — where SQLite is overkill but silent corruption is unacceptable.

Features

Keys

Declare which columns must be unique. Keys persist through base R and dplyr operations, and block any transformation that would break uniqueness.

# Single or composite keys
customers <- key(customers, customer_id)
sales     <- key(sales, region, year)

# Keys survive filtering
active <- customers[customers$status == "active", ]
has_key(active)
#> [1] TRUE

# Uniqueness-breaking operations are blocked
customers |> dplyr::mutate(customer_id = 1)
#> Error: Key is no longer unique after transformation.
#> i Use `unkey()` first if you intend to break uniqueness.

Join Diagnostics

Preview join cardinality before executing:

diagnose_join(customers, orders, by = "customer_id")
#> Cardinality: one-to-many
#> customers: 1000 rows (unique)
#> orders:    5432 rows (4432 duplicates)
#> Left join will produce ~5432 rows

Locks

Assert conditions at pipeline checkpoints. Locks error immediately — no silent continuation.

customers |>
  lock_unique(customer_id) |>
  lock_no_na(email) |>
  lock_nrow(min = 100)

Available locks:

FunctionChecks
lock_unique(df, col)No duplicate values
lock_no_na(df, col)No missing values
lock_complete(df)No NAs in any column
lock_coverage(df, threshold, col)% non-NA above threshold
lock_nrow(df, min, max)Row count in range

UUIDs

Generate stable row identifiers when your data has no natural key. UUIDs survive all transformations and enable row-level tracking.

customers <- add_id(customers)

# Track which rows were added or removed
filtered <- customers |> dplyr::filter(name != "Bob")
compare_ids(customers, filtered)
#> Lost: 1 row (7b1e4a9c2f8d3601)
#> Kept: 2 rows

Watch & Diff

watch() turns drift detection from a manual ceremony into an automatic safety net. Watched data frames auto-snapshot before each dplyr verb, so check_drift() always gives you a cell-level report of what the last transformation changed.

# Watch a keyed data frame — stamps a baseline automatically
customers <- key(df, customer_id) |> watch()

# Every dplyr verb auto-snapshots before executing
filtered <- customers |> dplyr::filter(status == "active")
check_drift(filtered)
#> Drift detected
#> Removed: 153 row(s)
#> Unchanged: 847 row(s)

# Cell-level detail through a pipe chain
result <- filtered |> dplyr::mutate(score = score + 10)
check_drift(result)
#> Drift detected
#> Modified: 847 row(s)
#>   score: 847 change(s)

For manual one-off comparisons, stamp() and diff() still work directly:

# Manual stamp + check
customers <- customers |> stamp()
customers$score[1] <- 999
check_drift(customers)

# Cell-level diff between any two keyed data frames
diff(old_version, new_version)
#> Key: customer_id
#> Removed: 2 row(s)
#> Added: 5 row(s)
#> Modified: 3 row(s)
#>   email: 2 change(s)
#>   segment: 1 change(s)

Use unwatch() to stop automatic stamping, or clear_all_snapshots() to free memory.

Installation

# Install from CRAN
install.packages("keyed")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/keyed")

When to Use Something Else

NeedBetter Tool
Enforced schemaSQLite, DuckDB
Full data validationpointblank, validate
Production pipelinestargets

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

License

MIT (see the LICENSE.md file)

Citation

@software{keyed,
  author = {Colling, Gilles},
  title = {keyed: Explicit Key Assumptions for Flat-File Data},
  year = {2025},
  url = {https://CRAN.R-project.org/package=keyed},
  doi = {10.32614/CRAN.package.keyed}
}

Copy Link

Version

Install

install.packages('keyed')

Version

0.2.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Gilles Colling

Last Published

February 25th, 2026

Functions in keyed (0.2.0)

lock_no_na

Assert that columns have no missing values
lock_nrow

Assert row count within expected range
watch

Watch a keyed data frame for automatic drift detection
get_key_cols

Get key column names
has_id

Check if data frame has IDs
lock_coverage

Assert minimum coverage of values
lock_complete

Assert that data is complete (no missing values anywhere)
unwatch

Stop watching a keyed data frame
remove_id

Remove ID column
list_snapshots

List all snapshots in cache
stamp

Stamp a data frame as reference
key

Define a key for a data frame
has_key

Check if data frame has a key
keyed-package

keyed: Explicit Key Assumptions for Flat-File Data
extend_id

Extend IDs to new rows
key_status

Get key status summary
key_is_valid

Check if the key is still valid
compare_ids

Compare IDs between data frames
check_id

Check ID integrity
add_id

Add identity column
bind_keyed

Bind rows of keyed data frames
clear_all_snapshots

Clear all snapshots from cache
clear_snapshot

Clear snapshot for a data frame
compare_keys

Compare key values between two data frames
check_drift

Check for drift from committed snapshot
diff.keyed_df

Diff two keyed data frames
bind_id

Bind data frames with ID handling
check_id_disjoint

Check IDs are disjoint across datasets
find_duplicates

Find duplicate keys
lock_unique

Assert that columns are unique
get_id

Get ID column
make_id

Create ID from columns
compare_structure

Compare structure of two data frames
diagnose_join

Diagnose a join before executing
unkey

Remove key from a data frame
summary.keyed_df

Summary method for keyed data frames