inspect_imb: Summary and comparison of the most common levels in categorical columns

Description

For a single dataframe, summarise the most common level in each categorical column. If two dataframes are supplied, compare the most common levels of categorical features appearing in both dataframes. For grouped dataframes, summarise the levels of categorical columns in the dataframe split by group.

Usage

inspect_imb(df1, df2 = NULL, include_na = FALSE)

Value

A tibble summarising and comparing the imbalance for each categorical column in one or a pair of dataframes.

Arguments

df1: A dataframe.
df2: An optional second data frame for comparing columnwise imbalance. Defaults to NULL.
include_na: Logical flag, whether to include missing values as a unique level. Default is FALSE - to ignore NA values.

Author

Alastair Rushworth

Details

For a single dataframe, the tibble returned contains the columns:

col_name, a character vector containing column names of df1.
value, a character vector containing the most common categorical level in each column of df1.
pcnt, the relative frequency of each column's most common categorical level expressed as a percentage.
cnt, the number of occurrences of the most common categorical level in each column of df1.

For a pair of dataframes, the tibble returned contains the columns:

col_name, a character vector containing names of the unique columns in df1 and df2.
value, a character vector containing the most common categorical level in each column of df1.
pcnt_1, pcnt_2, the percentage occurrence of value in the column col_name for each of df1 and df2, respectively.
cnt_1, cnt_2, the number of occurrences of of value in the column col_name for each of df1 and df2, respectively.
p_value, p-value associated with the null hypothesis that the true rate of occurrence is the same for both dataframes. Small values indicate stronger evidence of a difference in the rate of occurrence.

For a grouped dataframe, the tibble returned is as for a single dataframe, but where the first k columns are the grouping columns. There will be as many rows in the result as there are unique combinations of the grouping variables.

Examples

Run this code

# Load dplyr for starwars data & pipe
library(dplyr)

# Single dataframe summary
inspect_imb(starwars)

# Paired dataframe comparison
inspect_imb(starwars, starwars[1:20, ])

# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_imb()

Run the code above in your browser using DataLab