Learn R Programming

DQA (version 0.1.0)

check_missing_itemwise: Check Missing Data Item-wise with Dependency Logic

Description

Analyzes missing data (`NA` values) for each variable (item-wise) by considering dependencies between variables. This function goes beyond simple NA counts by classifying missingness into different categories based on rules defined in metadata.

Usage

check_missing_itemwise(
  S_data,
  M_data,
  var_select = 1:nrow(M_data),
  Show_Plot = FALSE
)

Value

A `data.table` summarizing the missing data analysis for each variable, with columns such as `VARIABLE`, `Missing_Count`, `Jump_Count`, `Unexpected_Count`, `Total_Applicable`(the variable's value was expected to be completed based on metadata rules.), `Percent_Complete`, and `Percent_Missing`.

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules.

var_select

A numeric or character vector specifying which variables to process. Can be indices or names from the `VARIABLE` column of `M_data`. Defaults to all variables.

Show_Plot

A logical value. If `TRUE`, a ggplot bar chart showing the missingness percentage for each variable is displayed.

Details

This function classifies each row for a given variable into one of four states:

  • **Completed:** The value is present where it is expected.

  • **Missing:** The value is `NA` where it was expected (based on a parent condition).

  • **Jump:** The value is `NA` because the parent condition was not met (i.e., the question was correctly skipped).

  • **Unexpected:** The value is present where it was *not* expected (a data quality issue).

The metadata (`M_data`) must contain the following columns to define the rules:

  • **VARIABLE:** The name of the variable in the source data (`S_data`) to be checked for missingness.

  • **VARIABLE_Code:** A unique numeric or character code assigned to each variable for identification and dependency mapping.

  • **Dependency:** Specifies the dependency of the variable on another variable. A value of `0` indicates no dependency, while other values indicate the `VARIABLE_Code` of the parent variable.

  • **Dep_Value:** The specific value or condition of the parent variable (as referenced in `Dependency`) that must be met for the current variable to be applicable. Use `"ANY"` if the value of the parent variable can be any non-missing value.

See Also

Other missing data checks: check_missing_record(), check_missing_segments()

Examples

Run this code
# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
  Gender = c("Male", "Female", "Male", "Female", "Male",
             "Female", "Male", "Female", "Male", "Female"),
  Age = c(25, 42, 31, 55, 29, 38, 45, 22, 60, 33),
  Has_Job = c("Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
  Job_Title = c(NA, "Manager", NA, "Analyst", NA, "Student",
                "Director", "Engineer", NA, "Designer"),
  Job_Satisfaction = c(5, 9, NA, 8, 7, NA, 10, 9, NA, 6),
  Last_Promotion_Year = c(2020, 2021, NA, NA, NA, NA, 2024, 2022, NA, 2023),
  Has_Insurance = c("Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
  Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C",
                         "Provider D", NA, "Provider E", NA, NA, "Provider F"),
  Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)

# 2. Run the item-wise check with plot
item_report <- check_missing_itemwise(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(item_report)

Run the code above in your browser using DataLab