scores: Scoring Functions for Data Quality Checks

Description

A set of functions to assess various aspects of data quality. including a comprehensive dataset score as well as individual scores for specific data quality dimensions such as date consistency, duplicates, recency, frequency, time, coding, comments, sources, missing values, and variables.

According to the literature, data quality can be assessed by checking for consistency, completeness, accuracy, timeliness, and uniqueness of the data. Consistency means that the data is logically coherent, completeness means that all required data is present, accuracy means that the data is correct and reliable, timeliness means that the data is up-to-date, and uniqueness means that there are no duplicate records.

Usage

score_dataset(df)
score_obs_no(df)
score_var_no(df)
score_completeness(df)
score_date_consistency(df)
score_date_scope(df)
score_obs_info(df, id_col = "ID")
score_coding(df)
score_comments(df)
score_var_info(df)

Arguments

df: A data frame to be scored.
id_col: The name of the column containing IDs. Default is "ID".

Details

These functions are designed to help assess the quality of data in a data frame. Each function checks a specific aspect of the data and returns a score or a message indicating the quality of that aspect. The functions include:

score_date_consistency: Proportion of invalid date pairs (End <= Begin).
score_duplicates: Proportion of duplicate IDs.

References

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-34.

Examples

Run this code

score_dataset(emperors)
score_obs_no(emperors)
score_var_no(emperors)
score_completeness(emperors)
score_date_consistency(emperors)
score_date_scope(emperors)
score_obs_info(emperors)
score_var_info(emperors)

Run the code above in your browser using DataLab