Learn R Programming

keyed

Primary keys for data frames.

In databases, you declare customer_id as a primary key and the database enforces uniqueness. With CSV and Excel files, you get no such guarantees - duplicates slip in silently.

keyed brings database-style protections to R data frames through four features:

FeatureWhat it does
KeysDeclare unique columns, enforced through transformations
LocksAssert conditions (no NAs, row counts, coverage)
UUIDsTrack row identity through your pipeline
CommitsSnapshot data to detect drift

Installation

# install.packages("pak")
pak::pak("gcol33/keyed")

1. Keys

Declare which columns must be unique - like a primary key in a database.

library(keyed)

# Declare the key (errors if not unique)
customers <- read.csv("customers.csv") |>
  key(customer_id)

# Composite keys work too
sales <- key(sales, region, year)

Keys follow your data through transformations:

# Base R
active <- customers[customers$status == "active", ]
has_key(active)
#> [1] TRUE

# dplyr
active <- customers |> filter(status == "active")
has_key(active)
#> [1] TRUE

Keys block operations that would break uniqueness:

customers |> mutate(customer_id = 1)
#> Error: Key is no longer unique after transformation.
#> i Use `unkey()` first if you intend to break uniqueness.

# To proceed, explicitly remove the key first
customers |> unkey() |> mutate(customer_id = 1)

Preview joins before running them:

diagnose_join(customers, orders, by = "customer_id")
#> Cardinality: one-to-many
#> customers: 1000 rows (unique)
#> orders:    5432 rows (4432 duplicates)
#> Left join will produce ~5432 rows

2. Locks

Assert conditions at checkpoints in your pipeline.

customers |>
  lock_unique(customer_id) |>    # Must be unique
  lock_no_na(email) |>           # No missing emails
  lock_nrow(min = 100)           # At least 100 rows

Locks error immediately if the condition fails - no silent continuation.

Available locks:

FunctionChecks
lock_unique(df, col)No duplicate values
lock_no_na(df, col)No missing values
lock_complete(df)No NAs in any column
lock_coverage(df, threshold, col)% non-NA above threshold
lock_nrow(df, min, max)Row count in range

3. UUIDs

When your data has no natural key, generate stable row identifiers.

# Add a UUID to each row
customers <- add_id(customers)
#>                .id name
#> 1 a3f2c8e1b9d04567 Alice
#> 2 7b1e4a9c2f8d3601 Bob
#> 3 e9c7b2a1d4f80235 Carol

UUIDs survive all transformations:

filtered <- customers |> filter(name != "Bob")
get_id(filtered)
#> [1] "a3f2c8e1b9d04567" "e9c7b2a1d4f80235"

Track which rows were added or removed:

compare_ids(customers, filtered)
#> Lost: 1 row (7b1e4a9c2f8d3601)
#> Kept: 2 rows

UUIDs let you trace rows through joins, filters, and reshaping - essential for debugging data pipelines.


4. Commits

Snapshot your data to detect unexpected changes later.

# Save a snapshot (stored in memory for this session)
customers <- customers |> commit_keyed()

# Work with your data...
customers <- customers |>
  filter(status == "active") |>
  mutate(score = score + 10)

# Check what changed since the commit
check_drift(customers)
#> Drift detected!
#> - Row count: 1000 -> 847 (-153)
#> - Column 'score' modified

How it works:

  • Each data frame can have one snapshot attached
  • Snapshots persist for your R session (lost on restart)
  • check_drift() compares current state to the snapshot
  • clear_snapshot() removes it, list_snapshots() shows all

Useful for catching unexpected changes during interactive analysis.


When to Use Something Else

NeedBetter Tool
Enforced schemaSQLite, DuckDB
Full data validationpointblank, validate
Production pipelinestargets

keyed gives you database-style protections without database infrastructure. For exploratory workflows where SQLite is overkill but silent corruption is unacceptable.

Documentation

License

MIT

Copy Link

Version

Install

install.packages('keyed')

Version

0.1.3

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Gilles Colling

Last Published

February 6th, 2026

Functions in keyed (0.1.3)

lock_complete

Assert that data is complete (no missing values anywhere)
remove_id

Remove ID column
lock_no_na

Assert that columns have no missing values
unkey

Remove key from a data frame
compare_ids

Compare IDs between data frames
clear_snapshot

Clear snapshot for a data frame
clear_all_snapshots

Clear all snapshots from cache
check_id

Check ID integrity
check_id_disjoint

Check IDs are disjoint across datasets
bind_keyed

Bind rows of keyed data frames
add_id

Add identity column
commit_keyed

Commit a keyed data frame as reference
check_drift

Check for drift from committed snapshot
bind_id

Bind data frames with ID handling
get_id

Get ID column
find_duplicates

Find duplicate keys
compare_keys

Compare key values between two data frames
get_key_cols

Get key column names
extend_id

Extend IDs to new rows
compare_structure

Compare structure of two data frames
diagnose_join

Diagnose a join before executing
has_id

Check if data frame has IDs
lock_unique

Assert that columns are unique
key_is_valid

Check if the key is still valid
list_snapshots

List all snapshots in cache
lock_nrow

Assert row count within expected range
keyed-package

keyed: Explicit Key Assumptions for Flat-File Data
lock_coverage

Assert minimum coverage of values
make_id

Create ID from columns
has_key

Check if data frame has a key
summary.keyed_df

Summary method for keyed data frames
key

Define a key for a data frame
key_status

Get key status summary