leakR

Welcome to leakR, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.

Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to "leak" between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.

Installation

From CRAN (Recommended)

install.packages("leakr")

From GitHub (Development Version)

For the latest features and bug fixes:

# Install devtools if you don't have it
install.packages("devtools")

# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")

Quick Start

library(leakr)

# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")

# View summary of issues found
leakr_summarise(report)

# Generate diagnostic visualizations
leakr_plot(report)

# Access detailed results
print(report)

Main Functions

Function	Purpose
`leakr_audit()`	Main auditing function - detects leakage across your dataset
`leakr_summarise()`	Generate human-readable summaries of detected issues
`leakr_plot()`	Create diagnostic visualizations highlighting problems
`leakr_from_caret()`	Import and audit caret workflow objects
`leakr_from_tidymodels()`	Import and audit tidymodels workflow objects
`leakr_from_mlr3()`	Import and audit mlr3 workflow objects

Learn More

Get started with the comprehensive vignettes:

# Getting started guide
vignette("getting-started", package = "leakr")

# Advanced detection techniques
vignette("advanced-detection", package = "leakr") 

# Framework integration examples
vignette("framework-integration", package = "leakr")

Why leakR?

Automates leakage detection, filling a key methodological gap
Designed for clarity, reproducibility, and transparent ML research
Modular architecture supports gradual expansion (time series, NLP, images)
Useful for both academic and industry workflows

What leakR Detects

Train/test contamination - Overlapping records between training and test sets
Target leakage - Features that contain information about the target variable that wouldn't be available at prediction time
Duplicate rows/records - Exact and near-duplicate observations that can inflate performance metrics
Temporal misalignments - Time-based data leaks in time series analysis

Key Features

Visual summaries of suspicious patterns and leakage hotspots
Detailed leakage reports suitable for audits, peer review, or publications
Clean APIs for seamless integration into existing ML workflows
Example vignettes demonstrating real leakage phenomena with code illustrations
Framework integration with caret, tidymodels, and mlr3

Development Roadmap

Phase 1: Core tabular leakage detectors ✓
Phase 2: Time series leakage detection (in progress)
Phase 3: Domain-specific extensions (NLP, image pipelines)
Phase 4: Pipeline integration and multi-language support

Citation

If you use leakR in your research, please cite:

@Manual{leakr2025,
  title = {leakR: Data Leakage Detection Tools for Machine Learning},
  author = {Cheryl Isabella Lim},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/cherylisabella/leakR},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

leakR is currently under development. Feedback and contributions are welcome from the community!

leakR

Installation

From CRAN (Recommended)

From GitHub (Development Version)

Quick Start

Main Functions

Learn More

Why leakR?

What leakR Detects

Key Features

Development Roadmap

Citation

License

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in leakr (0.1.0)