leakR
Welcome to leakR, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.
Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to "leak" between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.
Installation
From CRAN (Recommended)
install.packages("leakr")From GitHub (Development Version)
For the latest features and bug fixes:
# Install devtools if you don't have it
install.packages("devtools")
# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")Quick Start
library(leakr)
# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")
# View summary of issues found
leakr_summarise(report)
# Generate diagnostic visualizations
leakr_plot(report)
# Access detailed results
print(report)Main Functions
| Function | Purpose |
|---|---|
leakr_audit() | Main auditing function - detects leakage across your dataset |
leakr_summarise() | Generate human-readable summaries of detected issues |
leakr_plot() | Create diagnostic visualizations highlighting problems |
leakr_from_caret() | Import and audit caret workflow objects |
leakr_from_tidymodels() | Import and audit tidymodels workflow objects |
leakr_from_mlr3() | Import and audit mlr3 workflow objects |
Learn More
Get started with the comprehensive vignettes:
# Getting started guide
vignette("getting-started", package = "leakr")
# Advanced detection techniques
vignette("advanced-detection", package = "leakr")
# Framework integration examples
vignette("framework-integration", package = "leakr")Why leakR?
- Automates leakage detection, filling a key methodological gap
- Designed for clarity, reproducibility, and transparent ML research
- Modular architecture supports gradual expansion (time series, NLP, images)
- Useful for both academic and industry workflows
What leakR Detects
- Train/test contamination - Overlapping records between training and test sets
- Target leakage - Features that contain information about the target variable that wouldn't be available at prediction time
- Duplicate rows/records - Exact and near-duplicate observations that can inflate performance metrics
- Temporal misalignments - Time-based data leaks in time series analysis
Key Features
- Visual summaries of suspicious patterns and leakage hotspots
- Detailed leakage reports suitable for audits, peer review, or publications
- Clean APIs for seamless integration into existing ML workflows
- Example vignettes demonstrating real leakage phenomena with code illustrations
- Framework integration with caret, tidymodels, and mlr3
Development Roadmap
- Phase 1: Core tabular leakage detectors ✓
- Phase 2: Time series leakage detection (in progress)
- Phase 3: Domain-specific extensions (NLP, image pipelines)
- Phase 4: Pipeline integration and multi-language support
Citation
If you use leakR in your research, please cite:
@Manual{leakr2025,
title = {leakR: Data Leakage Detection Tools for Machine Learning},
author = {Cheryl Isabella Lim},
year = {2025},
note = {R package version 0.1.0},
url = {https://github.com/cherylisabella/leakR},
}License
This project is licensed under the MIT License - see the LICENSE file for details.
leakR is currently under development. Feedback and contributions are welcome from the community!