Learn R Programming

leakR

Welcome to leakR, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.

Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to "leak" between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.

Installation

From CRAN (Recommended)

install.packages("leakr")

From GitHub (Development Version)

For the latest features and bug fixes:

# Install devtools if you don't have it
install.packages("devtools")

# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")

Quick Start

library(leakr)

# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")

# View summary of issues found
leakr_summarise(report)

# Generate diagnostic visualizations
leakr_plot(report)

# Access detailed results
print(report)

Main Functions

FunctionPurpose
leakr_audit()Main auditing function - detects leakage across your dataset
leakr_summarise()Generate human-readable summaries of detected issues
leakr_plot()Create diagnostic visualizations highlighting problems
leakr_from_caret()Import and audit caret workflow objects
leakr_from_tidymodels()Import and audit tidymodels workflow objects
leakr_from_mlr3()Import and audit mlr3 workflow objects

Learn More

Get started with the comprehensive vignettes:

# Getting started guide
vignette("getting-started", package = "leakr")

# Advanced detection techniques
vignette("advanced-detection", package = "leakr") 

# Framework integration examples
vignette("framework-integration", package = "leakr")

Why leakR?

  • Automates leakage detection, filling a key methodological gap
  • Designed for clarity, reproducibility, and transparent ML research
  • Modular architecture supports gradual expansion (time series, NLP, images)
  • Useful for both academic and industry workflows

What leakR Detects

  • Train/test contamination - Overlapping records between training and test sets
  • Target leakage - Features that contain information about the target variable that wouldn't be available at prediction time
  • Duplicate rows/records - Exact and near-duplicate observations that can inflate performance metrics
  • Temporal misalignments - Time-based data leaks in time series analysis

Key Features

  • Visual summaries of suspicious patterns and leakage hotspots
  • Detailed leakage reports suitable for audits, peer review, or publications
  • Clean APIs for seamless integration into existing ML workflows
  • Example vignettes demonstrating real leakage phenomena with code illustrations
  • Framework integration with caret, tidymodels, and mlr3

Development Roadmap

  • Phase 1: Core tabular leakage detectors ✓
  • Phase 2: Time series leakage detection (in progress)
  • Phase 3: Domain-specific extensions (NLP, image pipelines)
  • Phase 4: Pipeline integration and multi-language support

Citation

If you use leakR in your research, please cite:

@Manual{leakr2025,
  title = {leakR: Data Leakage Detection Tools for Machine Learning},
  author = {Cheryl Isabella Lim},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/cherylisabella/leakR},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

leakR is currently under development. Feedback and contributions are welcome from the community!

Copy Link

Version

Install

install.packages('leakr')

Monthly Downloads

223

Version

0.1.0

License

MIT + file LICENSE

Maintainer

Cheryl Isabella

Last Published

October 26th, 2025

Functions in leakr (0.1.0)

generate_recommendations

Generate actionable recommendations based on report findings.
generate_recommendations_section

Format recommendations for output.
leakr_summarise

Enhanced summarise with better formatting
list_registered_detectors

List Registered Detectors
generate_executive_summary_text

Report generator
generate_issues_section

Generate detailed issues section with output formatting and truncation.
leakr_from_tidymodels

Convert tidymodels workflow to standard format
plot.detector_result

Plot a detector_result object
import_parquet

Import Parquet files
import_json

Import JSON files with better structure handling
leakr_import

Import data from various sources for leakage analysis
leakr_list_snapshots

List available snapshots with enhanced information
register_detector

Register a new detector
print.leakr_report

Print method for leakr_report
import_rds

Import RDS files with validation
leakr_load_snapshot

Load data snapshot with enhanced validation
plot.udld_report

Plot a udld_report object
leakr_create_snapshot

Create data snapshots with improved metadata handling
leakr_export_data

Export data in various formats
import_excel

Import Excel files with enhanced sheet support
import_tsv

Import TSV files with robust parsing
grapes-or-or-grapes

Null-coalescing operator for clean default value handling
leakr_plot

Plot leakage detection results
leakr_from_caret

Convert caret training objects to standard format
validate_and_preprocess_data

Robust data validation and preprocessing
stratified_sample

Stratified sampling helper
leakr_quick_import

Fast import with default preprocessing
leakr_audit

Audit dataset for data leakage
preprocess_imported_data

Enhanced preprocessing with better performance and robustness
leakr

leakr: Data Leakage Detection for Machine Learning in R
prepare_audit_data

Enhanced data preparation with robust preprocessing
new_train_test_detector

Create a new train-test detector
new_temporal_detector

Create a new temporal detector
leakr_from_mlr3

Convert mlr3 Task objects to standard format
run_detector

Run a detector on data
run_detectors

Run multiple detectors on audit data
validate_imported_data

Enhanced data validation with better error messages
.onLoad

Initialise built-in detectors
compile_report

Enhanced report compilation with numeric severity scores
format_detector_name

Format detector names for display.
export_data_internal

Export data with consistent messaging
get_detector_info

Get detector information
detect_file_format

Detect file format from extension and content
empty_snapshot_info

Helper function to return an empty snapshot info dataframe
detect_and_convert_dates_enhanced

Enhanced date detection handling multiple formats and data types
clean_column_names

Enhanced column name cleaning with better robustness
detector_registry

Registry-based Detector System
determine_risk_level

Determine risk level and CSS class from severity counts.
import_csv

Import CSV files with robust parsing
generate_diagnostic_plots

Generate diagnostic plots for a leakr_report
generate_evidence_section

Generate evidence section with format-specific handling and DRY logic.