Learn R Programming

DQA (version 0.1.0)

check_missing_record: Check Missing Data by Record (Unit Check)

Description

Provides a high-level summary of data completeness across the entire dataset by classifying each row (or "record") as complete, incomplete, or missing.

Usage

check_missing_record(
  S_data,
  M_data,
  Show_Plot = FALSE,
  start_var = 1,
  skip_vars = NULL
)

Value

A single-row `data.frame` with summary counts and percentages: `Total_Rows`, `Complete_Count`, `Incomplete_Count`, `Missing_Count`, `Percent_Complete`, `Percent_Incomplete`, and `Percent_Missing`.

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules.

Show_Plot

A logical value. If `TRUE`, a pie chart visualizing the proportions of complete, incomplete, and missing rows is displayed.

start_var

A numeric value indicating the starting variable index (from `M_data`) to include in the analysis. Defaults to 1.

skip_vars

A character or numeric vector of variables to exclude from the analysis. Can be variable names or column indices.

Details

This function evaluates all specified variables for each row and determines its overall status based on the same error logic as `check_missing_segments`. A row is:

  • **Complete:** The row has all values as non-missing for any variable within the each rows.

  • **Incomplete:** The row has at least one `NA` value for variables in the each rows.

  • **Fully Missing:** All variables in the record are `NA` for that row.

The metadata (`M_data`) must contain the following columns to define the rules:

  • **VARIABLE:** The name of the variable in the source data (`S_data`) to be checked for missingness.

  • **VARIABLE_Code:** A unique numeric or character code assigned to each variable for identification and dependency mapping.

  • **Dependency:** Specifies the dependency of the variable on another variable. A value of `0` indicates no dependency, while other values indicate the `VARIABLE_Code` of the parent variable.

  • **Dep_Value:** The specific value or condition of the parent variable (as referenced in `Dependency`) that must be met for the current variable to be applicable. Use `"ANY"` if the value of the parent variable can be any non-missing value.

The function returns a single-row data frame summarizing the counts and percentages for the entire dataset.

See Also

Other missing data checks: check_missing_itemwise(), check_missing_segments()

Examples

Run this code
# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
  Gender = c("Male", NA, "Male", "Female", "Male","Female", "Male", "Female", "Male", "Female"),
Age = c(25, NA, 31, 55, 29, 38, 45, 22, 60, 33),
Has_Job = c("Yes", NA, "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
Job_Title = c(NA, NA, NA, "Analyst", NA, "Student","Director", "Engineer", NA, "Designer"),
Job_Satisfaction = c(5, NA, NA, 8, 7, NA, 10, 9, NA, 6),
Last_Promotion_Year = c(2020,NA , 2021, NA, NA, NA, 2024, 2022, NA, 2023),
Has_Insurance = c("Yes", NA, "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C","Provider D", NA, "Provider E",
 NA, NA, "Provider F"),
Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)
# 4. Run the row-wise check with plot
row_report <- check_missing_record(
  S_data = Source_data, M_data = Meta_data, skip_vars = "ID", Show_Plot = TRUE
)
print(row_report)

Run the code above in your browser using DataLab