Learn R Programming

DQA (version 0.1.0)

check_missing_segments: Check Missing Data by Segments

Description

Analyzes data completeness at the segment level. A segment is a group of variables defined in the `Segment_Names` column of the metadata.

Usage

check_missing_segments(S_data, M_data, Show_Plot = FALSE)

Value

A `data.frame` summarizing the analysis for each segment, with columns: `SEGMENT`, `Total_Rows`, `Complete_Count`, `Incomplete_Count`, `Missing_Count`, `Percent_Complete`, `Percent_Incomplete`, and `Percent_Missing`.

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules, including a `Segment_Names` column.

Show_Plot

A logical value. If `TRUE`, a stacked bar chart visualizing the proportions for each segment is displayed.

Details

For each segment, this function evaluates every row of the source data (`S_data`) and classifies it into one of three categories:

  • **Complete:** The row has all values as non-missing for any variable within the segment.

  • **Incomplete:** The row has at least one `NA` value for variables in the segment.

  • **Fully Missing:** All variables belonging to the segment are `NA` for that row.

The metadata (`M_data`) must contain the following columns to define the rules:

  • **VARIABLE:** The name of the variable in the source data (`S_data`) to be checked for missingness.

  • **VARIABLE_Code:** A unique numeric or character code assigned to each variable for identification and dependency mapping.

  • **Dependency:** Specifies the dependency of the variable on another variable. A value of `0` indicates no dependency, while other values indicate the `VARIABLE_Code` of the parent variable.

  • **Dep_Value:** The specific value or condition of the parent variable (as referenced in `Dependency`) that must be met for the current variable to be applicable. Use `"ANY"` if the value of the parent variable can be any non-missing value.

The function returns a summary table with counts and percentages for each category per segment.

See Also

Other missing data checks: check_missing_itemwise(), check_missing_record()

Examples

Run this code
# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
Gender = c("Male", NA, "Male", "Female", "Male","Female", "Male", "Female", "Male", "Female"),
Age = c(25, NA, 31, 55, 29, 38, 45, 22, 60, 33),
Has_Job = c("Yes", NA, "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
Job_Title = c(NA, NA, NA, "Analyst", NA, "Student","Director", "Engineer", NA, "Designer"),
Job_Satisfaction = c(5, NA, NA, 8, 7, NA, 10, 9, NA, 6),
Last_Promotion_Year = c(2020,NA , 2021, NA, NA, NA, 2024, 2022, NA, 2023),
Has_Insurance = c("Yes", NA, "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C","Provider D", NA, "Provider E",
 NA, NA, "Provider F"),
Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)
# 3. Run the segment check with plot
segment_report <- check_missing_segments(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(segment_report)

Run the code above in your browser using DataLab