gghistostats
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
The function ggstatsplot::gghistostats
can be used for data exploration
and to provide an easy way to make publicationready histograms with
appropriate and selected statistical details embedded in the plot itself. In
this vignette we will explore several examples of how to use it.
Some instances where you would want to use gghistostats

 to inspect distribution of a continuous variable
 to test if the mean of a sample variable is different from a specified value (population parameter)
Distribution of a sample with gghistostats
Let's begin with a very simple example from the psych
package
(psych::sat.act
), a sample of 700 selfreported scores on the SAT Verbal, SAT
Quantitative and ACT tests. ACT composite scores may range from 1  36. National
norms have a mean of 20.
# loading needed libraries
library(ggstatsplot)
library(psych)
library(dplyr)
# looking at the structure of the data using glimpse
dplyr::glimpse(x = psych::sat.act)
To get a simple histogram with no statistics and no special information.
gghistostats
will by default choose a binwidth max(x)  min(x) / sqrt(N)
.
You should always check this value and explore multiple widths to find the best
to illustrate the stories in your data since histograms are sensitive to
binwidth.
ggstatsplot::gghistostats(
data = psych::sat.act, # data from which variable is to be taken
x = ACT, # numeric variable
results.subtitle = FALSE, # don't run statistical tests
messages = FALSE, # turn off messages
xlab = "ACT Score", # xaxis label
title = "Distribution of ACT Scores", # title for the plot
subtitle = "N = 700", # subtitle for the plot
caption = "Data courtesy of: SAPA project (https://sapaproject.org)", # caption for the plot
centrality.k = 1 # show 1 decimal places for centrality label
)
Statistical analysis with gghistostats
The authors note that "the score means are higher than national norms suggesting
both self selection for people taking on line personality and ability tests and
a self reporting bias in scores." Let's display the national norms (labeled as
"Test") and test (using results.subtitle = TRUE
) the hypothesis that our
sample mean is the same as our national population mean of 20 using a parametric
one sample ttest (type = "p"
). To demonstrate some of the options available
we'll also:
 Change the overall theme with
ggtheme = ggthemes::theme_tufte()
 Make the histogram bars a different color with
bar.fill = "#D55E00"
 Plot proportions on the y axes with
bar.measure = "proportion"
 Turn messages on to receive additional diagnostic information with
messages = TRUE
 Compute information about the Bayes Factor (
bf.message = TRUE
, see below)
ggstatsplot::gghistostats(
data = psych::sat.act, # data from which variable is to be taken
x = ACT, # numeric variable
results.subtitle = TRUE, # run statistical tests
messages = TRUE, # turn on messages
bar.measure = "proportion", # proportions
xlab = "ACT Score", # xaxis label
title = "Distribution of ACT Scores", # title for the plot
type = "p", # one sample ttest
bf.message = TRUE, # display Bayes method results
ggtheme = ggthemes::theme_tufte(), # changing default theme
bar.fill = "#D55E00", # change fill color
normal.curve = TRUE, # disply a normal distribution curve
normal.curve.color = "black",
test.value = 20, # test value
test.value.line = TRUE, # show a vertical line at `test.value`
caption = "Data courtesy of: SAPA project (https://sapaproject.org)", # caption for the plot
centrality.k = 1 # show 1 decimal places for centrality label
)
gghistostats
computed Bayes Factors to quantify the likelihood of the research (BF10) and
the null hypothesis (BF01). In our current example, the Bayes Factor value provides
very strong evidence (Kass and Rafferty, 1995)
in favor of the research hypothesis: these ACT scores are much higher than the
national average. The log(Bayes factor) of 492.5 means the odds are 7.54e+213:1
that this sample is different.
Grouped analysis with grouped_gghistostats
What if we want to do the same analysis separately for each gender? In
that case, we will have to either write a for
loop or use purrr
, none of
which seem like an exciting prospect.
ggstatsplot
provides a special helper function for such instances:
grouped_gghistostats
. This is merely a wrapper function around
ggstatsplot::combine_plots
. It applies gghistostats
across all levels of
a specified grouping variable and then combines the individual plots into a
single plot. Note that the grouping variable can be anything: conditions in a
given study, groups in a study sample, different studies, etc.
Let's see how we can use this function to apply gghistostats
to accomplish our
task.
ggstatsplot::grouped_gghistostats(
# arguments relevant for ggstatsplot::gghistostats
data = psych::sat.act, # same dataset
x = ACT, # same outcome variable
xlab = "ACT Score",
grouping.var = gender, # grouping variable males = 1, females = 2
title.prefix = "Gender", # prefix for the fixed title
k = 1, # number of decimal places in results
type = "r", # robust test: onesample percentile bootstrap
robust.estimator = "mom", # changing the robust estimator used
test.value = 20, # test value against which sample mean is to be compared
test.value.line = TRUE, # show a vertical line at `test.value`
bar.measure = "density", # density
centrality.para = "median", # plotting centrality parameter
centrality.color = "#D55E00", # color for centrality line and label
test.value.color = "#009E73", # color for test line and label
messages = FALSE, # turn off messages
ggtheme = ggthemes::theme_stata(), # changing default theme
ggstatsplot.layer = FALSE, # turn off ggstatsplot theme layer
# arguments relevant for ggstatsplot::combine_plots
title.text = "Distribution of ACT scores across genders",
caption.text = "Data courtesy of: SAPA project (https://sapaproject.org)",
nrow = 2,
ncol = 1,
labels = c("Male","Female")
)
As can be seen from these plots, the mean value is much higher than the national norm. Additionally, we see the benefits of plotting this data separately for each gender. We can see the differences in distributions.
Grouped analysis with purrr
Although this is a quick and dirty way to explore a large amount of data with
minimal effort, it does come with an important limitation: reduced flexibility.
For example, if we wanted to add, let's say, a separate test.value
argument
for each gender, this is not possible with grouped_gghistostats
. For cases
like these, or to run separate kinds of tests (robust for some, parametric for
other, while Bayesian for some other levels of the group) it would be better to
use purrr
.
See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html
Suggestions
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues
Session Information
For details, see https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/session_info.html