Learn R Programming

data.checker

data.checker is a package for helping with boilerplate data checks. It enables you to automate fundamental data checks which, while simple, can be time-consuming to implement.

data.checker

  • Checks data against a user supplied schema that defines what columns and data types are expected

  • Enables user to add additional custom data checks based on multiple columns

  • Creates exports of the results for QA

Getting Started

Installation

Software requirements

To use this package, you’ll need the following software on your computer:

  1. RStudio 2024.04.2 or later and R 4.5.0 or later
  2. GIT 2.35.3 or later

To install this R package, you will first need to clone the repository to you local machine by running

git clone https://github.com/ONSdigital/data.checker.git

Open the project in RStudio and in the console run:

devtools::install()

The package will be installed in you R library.

Setup and Usage

data.checker requires an input dataframe and a data schema to validate against. A full list of checks performed by the data checker, alongside how to include custom checks can be found here. The schema can either be defined within the R script itself or saved to either a JSON or YAML file to be loaded by the data checker. We recommend that schemas be saved as either a JSON or YAML to simplify the process of adding additional checks and column information. Once defined, we can pass both the dataframe and schema, alongside an output filepath and format for the report and the option for hardchecks into the check_and_export function.

libary(data.checker)

df <- data.frame(
  age = c(10, 11, 13, 15, 22, 34, 80),
  sex = c("M", "F", "M", "F", "M", "F", "M")
)

my_schema <- list(
  check_duplicates = TRUE,
  check_completeness = FALSE,
  columns = list(
    age = list(type = "integer", optional = FALSE),
    sex = list(type = "character", optional = FALSE)
  )
)

check_and_export(data = df,
         schema = my_schema, 
         file = "report.csv", 
         format = "csv", 
         hard_check =TRUE)

This will produce a report.csv containing the status of each of the validation checks. With hard_check set to TRUE, this will mean the code stops running if any validation checks fail. The report will still be produced before this stop so you can view and investigate the issue causing a fail.

Pre-Defined and Adding Custom Checks

Pre-Defined Checks

These checks can be included in the lists for individual columns in your schema, depending on the data type.

Data TypeCheck NameParameterCheck Definition
integer / doubleMinimum valuemin_valChecks that all values are above or equal to the minimum value
integer / doubleMaximum valuemax_valChecks that all values are below or equal to the maximum value
integer / doubleInterquartile range (IQR) outlier checkiqr_checkChecks that all values fall within $Q1 - (\text{IQR}\cdot\text{multiplier})$ and $Q3 + (\text{IQR}\cdot\text{multiplier})$, where the multiplier is given by iqr_check
integer / doubleMaximum absolute z scoremax_z_scoreChecks that the absolute value of all z scores are below or equal to the maximum z score
characterMinimum lengthmin_lengthChecks that all strings have length are above or equal to the minimum length
characterMaximum lengthmax_lengthChecks that all strings have length below or equal to the maximum length
date / datetimeMinimum Datemin_dateChecks that all dates are after the minimum date using the format “YYYY-MM-DD”
date / datetimeMaximum Datemax_dateChecks that all dates are before the maximum date using the format “YYYY-MM-DD”
date/ datetimeMinimum Datetimemin_datetimeChecks that all dates are after the minimum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS
date/ datetimeMaximum Datetimemax_datetimeChecks that all dates are before the maximum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS
anyallowed valuesallowed_valuesValidates that entries match a set of permitted values, list or regex can be used. (Optional and can use forbidden strings instead)
anyforbidden valuesforbidden_valuesValidates that entries do not contain a set of forbidden values, list or regex can be used. (Optional and can use allowed strings instead)
anyMissing values checkallow_naChecks for missing or NA values in the column.
anyClassclassChecks that column data Class matches the specified type

Adding Custom Checks

Additionally, you can write your own checks and add them to the validator object using the add_custom_check function. This is particularly useful for checks involving more than one column, which cannot be configured using the standard template. The checks are done in the context of the original data, meaning you can reference columns as if they are variables in the environment (similar to tidy evaluation). This is recommended because it guarantees the checks are done on the correct data only. Alternatively, you can use standard evaluation (see example below).

The example below demonstrates how to incorporate both pre-defined and custom checks into your validation.

df <- data.frame(
  id = 1:10,
  age = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  sex = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
)

schema <- list(
  check_duplicates = TRUE,
  check_completeness = FALSE,
  columns = list(
    id = list(type = "double", optional = FALSE),
    age = list(type = "double", optional = FALSE, min_val = 0),
    sex = list(type = "character", optional = FALSE, allowed_values = c("M", "F"))
  )
)

data_check_results <- data.checker::new_validator(df, schema) |>
  data.checker::check() |>
  data.checker::add_check(description = "There are no males over 90 (tidy evaluation)", condition = !(sex == "M" & age > 90)) |>
  data.checker::add_check(description = "There are no males over 90 (standard evaluation)", condition = !(df$sex == "M" & df$age > 90))

print(data_check_results)

Contributing

We always welcome contributions and suggestions to improve functionality of our products. Feel free to open an issue using the issue tab. If you wish to make a direct contribution, please fork the repository, make your changes and raise a pull request and we can review and merge your changes.

Copy Link

Version

Install

install.packages('data.checker')

Version

2.0.0

License

MIT + file LICENSE

Maintainer

Analysis Standards and Pipelines Team (ONS)

Last Published

June 8th, 2026

Functions in data.checker (2.0.0)

check_backseries

Check backseries consistency
add_check_custom

Add a custom check to the validator
check_colnames

Check Column Names against schema
check_completeness

Check dataset for missing columns
check_and_export

Validate data against a schema and output results
check

Validate a Validator Object
add_check

Add a custom check to the validator
add_qa_entry

Add a QA Entry to the validator's QA Log
check_duplicates

Check for duplicate rows. Can use subset of columns to check for duplicates if duplicates_cols is specified in the schema. Otherwise, all columns are used for duplicate check.
check_column_contents

Check Column Contents against schema and checks
iqr_bounds

Flag outliers based on Interquartile Range (IQR). Outliers are flagged if they are below Q1 - (mulitplier * IQR) or above Q3 + (multiplier * IQR).
export

Generic export function
export.Validator

Export Validator Log
check_schema_contents_against_df

Check schema contents against the data frame provided
is_column_contents_valid

Check column contents valid
check_types

Check Column Types and Classes
hard_checks_status

Check the status of errors and warnings in the validator log
is_valid_column_values

Check that max values are not less than min values in column schema
data.checker-package

data.checker: Data Checker for Validating Data Frames Against Defined Schema
is_type_valid

Check type of column in schema is valid
z_score

Check Z Score of Numeric Columns
print.Validator

Print Validator Log
validate_and_convert_date_formats

Validate date formats in the schema This function checks that any date formats specified in the schema are valid and can be parsed correctly.
types_to_classes

Convert complex types to the correct types and classes
new_validator

Validator Constructor
log_html

Generate HTML Representation of a Log
run_checks

Run column checks
is_valid_schema

Check if the schema is valid
log_to_table

Convert Validator Log to Table
log_pointblank_outcomes

Log pointblank validation outcomes to a validator log
%>%

Pipe operator