For a single dataframe, the tibble returned contains the columns:
col_name
, character vector containing column names of df1
.
cnt
integer column containing count of unique levels found in each column,
including NA
.
common
, a character column containing the name of the most common level.
common_pcnt
, the percentage of each column occupied by the most common level shown in
common
.
levels
, a named list containing relative frequency tibbles for each feature.
For a pair of dataframes, the tibble returned contains the columns:
col_name
, character vector containing names of columns appearing in both
df1
and df2
.
jsd
, a numeric column containing the Jensen-Shannon divergence. This measures the
difference in relative frequencies of levels in a pair of categorical features. Values near
to 0 indicate agreement of the distributions, while 1 indicates disagreement.
pval
, the p-value corresponding to a NHT that the true frequencies of the categories are equal.
A small p indicates evidence that the the two sets of relative frequencies are actually different. The test
is based on a modified Chi-squared statistic.
lvls_1
, lvls_2
, the relative frequency of levels in each of df1
and df2
.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.