inspect_num: Summarise and compare the numeric variables within one or two dataframes

Description

Summarise and compare the numeric variables within one or two dataframes

Usage

inspect_num(df1, df2 = NULL, breaks = 20, breakseq = NULL,
  show_plot = FALSE)

Arguments

df1

A data frame

df2

An optional second data frame for comparing categorical levels. Defaults to NULL.

breaks

Optional argument determining how breaks are constructed for histograms when comparing numeric data frame features. This is passed to

breakseq

For internal use only. Argument that accepts a pre-specified set of break points, default is NULL.

show_plot

(Deprecated) Logical flag indicating whether a plot should be shown. Superseded by the function show_plot() and will be dropped in a future version. hist(..., breaks). See ?hist for more details.

Value

A tibble containing statistical summaries of the numeric columns of df1, or comparing the histograms of df1 and df2.

Details

If only df1 is specified, inspect_num returns a tibble with columns

col_name character vector containing the column names in df1 and df2
min, q1, median, mean, q3, max and sd: the minimum, lower quartile, median, mean, upper quartile, maximum and standard deviation for each numeric column.
pcnt_na the percentage of each numeric feature that is missing
hist a list of tibbles containing the relative frequency of values in a set of discrete bins for each column.

If both df1 and df2 are specified, the tibble has columns

col_name character vector containing the column names in df1 and df2
hist_1, hist_2 list column for histograms of each of df1 and df2. Where a column appears in both dataframe, the bins used for df1 are reused to calculate histograms for df2.
jsd numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.
fisher_p p-value corresponding to Fisher's exact test. A small p indicates evidence that the two histograms are actually different.

Examples

Run this code

# NOT RUN {
data("starwars", package = "dplyr")
# show summary statistics for starwars
inspect_num(starwars)
# with a visualisation too - try to limit number of bins
inspect_num(starwars, breaks = 10)
# compare two data frames
inspect_num(starwars, starwars[-c(1:10), ], breaks = 10)
# }

Run the code above in your browser using DataLab