For a single dataframe, the tibble returned contains the columns:
col_name
, a character vector containing the column names in df1
min
, q1
, median
, mean
, q3
, max
and
sd
, the minimum, lower quartile, median, mean, upper quartile, maximum and
standard deviation for each numeric column.
pcnt_na
, the percentage of each numeric feature that is missing
hist
, a named list of tibbles containing the relative frequency of values
falling in bins determined by breaks
.
For a pair of dataframes, the tibble returned contains the columns:
col_name
, a character vector containing the column names in df1
and df2
hist_1
, hist_2
, a list column for histograms of each of df1
and df2
.
Where a column appears in both dataframe, the bins used for df1
are reused to
calculate histograms for df2
.
jsd, a numeric column containing the Jensen-Shannon divergence. This measures the
difference in distribution of a pair of binned numeric features. Values near to 0 indicate
agreement of the distributions, while 1 indicates disagreement.
pval
, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal.
A small p indicates evidence that the the two sets of relative frequencies are actually different. The test
is based on a modified Chi-squared statistic.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.