gghistostats
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
The function ggstatsplot::gghistostats
is meant to provide a publicationready
histogram with all statistical details included in the plot itself. We will see
examples of how to use this function in this vignette with the gapminder
dataset.
To begin with, here are some instances where you would want to use
gghistostats

 to check distribution of a continuous variable
 to check if the mean of variable is different from a specified value
gapminder dataset
This dataset (available in eponymous package on CRAN) provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007, for each of 142 countries and was collected by the Gapminder Foundation. Let's have a look at the data
library(gapminder)
library(dplyr)
library(magrittr)
gapminder::gapminder %>%
dplyr::glimpse(x = .)
Distribution with gghistostats
Suppose the first thing we want to check is the distribution of population
worldwide in 2007. In this case, we are not interested in any statistics and,
therefore, can set the results.subtitle
argument to FALSE
.
library(ggstatsplot)
gapminder::gapminder %>%
dplyr::filter(.data = ., year == 2007) %>% # select data only from the year 2007
ggstatsplot::gghistostats(
data = ., # data from which variable is to be taken
x = pop, # numeric variable
results.subtitle = FALSE, # don't run statistical tests
messages = FALSE, # turn off messages
xlab = "Population", # xaxis label
title = "Distribution of population worldwide", # title for the plot
subtitle = "Year: 2007", # subtitle for the plot
caption = "Data courtesy of: Gapminder Foundation" # caption for the plot
)
Although this plot is useful, it is still not satisfactory as most of the mass seems to be concentrated at 0 due to the large range of numbers. We can remedy this by converting population to logarithmic scale. We can additionally adjust binwidth so that we have bins for every increase in order of magnitude.
gapminder::gapminder %>%
dplyr::filter(.data = ., year == 2007) %>% # select data only from the year 2007
dplyr::mutate(.data = ., pop_log = log10(pop)) %>% # creating new population variable
ggstatsplot::gghistostats(
data = ., # data from which variable is to be taken
x = pop_log, # numeric variable
results.subtitle = FALSE, # don't run statistical tests
messages = FALSE, # turn off messages
xlab = "Population (logarithmic)", # xaxis label
title = "Distribution of population worldwide", # title for the plot
subtitle = "Year: 2007", # subtitle for the plot
caption = "Data courtesy of: Gapminder Foundation", # caption for the plot
binwidth.adjust = TRUE, # adjust binwidth
binwidth = 1 # new binwidth
)
This shows the utility of gghistostats
in case of data exploration.
Statistical analysis with gghistostats
Let's say we are now interested in investigating whether the mean life
expectancy in 2007 across the world has improved during the 20thCentury.
In 1950, it was 48, so this is the
test.value
we are going to use.
gapminder::gapminder %>%
dplyr::filter(.data = ., year == 2007) %>% # select data only from the year 2007
ggstatsplot::gghistostats(
data = ., # data from which variable is to be taken
x = lifeExp, # numeric variable
messages = FALSE, # turn off messages
test.value = 48, # test value against which sample mean is to be compared
xlab = "Life expectancy", # xaxis label
title = "Life expectancy worldwide", # title for the plot
subtitle = "Year: 2007", # subtitle for the plot
caption = "Data courtesy of: Gapminder Foundation", # caption for the plot
centrality.para = "mean" # plotting centrality parameter (mean)
)
Although there are still some countries where the life expectancy is low, on average, the life expectancy worldwide has improved compared to what it was in 1950.
gghistostats
also provides the opportunity to compute Bayes Factors to
quantify evidence in favor of the alternative (BF10) or the null
hypothesis (BF01). In practice, you need to compute only one and the other will
just be the inverse. In the current example, let's say we want to quantify
evidence in favor of the alternative hypothesis (H1) that the life expectancy
in 2007 has improved significantly worldwide since 1957. The
null, in this case, will of course be that there is no improvement.
gapminder::gapminder %>%
dplyr::filter(.data = ., year == 2007) %>%
ggstatsplot::gghistostats(
data = ., # data from which variable is to be taken
x = lifeExp, # numeric variable
messages = FALSE, # turn off messages
type = "bf", # bayesian one sample ttest
test.value = 48, # test value
xlab = "Life expectancy", # xaxis label
title = "Life expectancy worldwide", # title for the plot
subtitle = "Year: 2007", # subtitle for the plot
caption = "Note: black line  1950; blue line  2007", # caption for the plot
test.value.line = TRUE, # show a vertical line at `test.value`
centrality.para = "mean" # plotting centrality parameter (mean)
)
As this analysis shows, Bayes Factor value provides conclusive evidence in favor of the alternative hypothesis: Life expectancy worldwide has improved significantly since 1957.
Grouped analysis with gghistostats
What if we want to do the same analysis separately for all five continents? In
that case, we will have to either write a for
loop or use purrr
, none of
which seem like an exciting prospect.
ggstatsplot
provides a special helper function for such instances:
grouped_gghistostats
. This is merely a wrapper function around
ggstatsplot::combine_plots
. It applies gghistostats
across all levels of
a specified grouping variable and then combines list of individual plots
into a single plot. Note that the grouping variable can be anything: conditions
in a given study, groups in a study sample, different studies, etc.
Let's see how we can use this function to apply gghistostats
for all five
continents. We will be running parametric tests (one sample ttest, i.e.). If
you set type = "np"
, results from nonparametric test will be displayed.
gapminder::gapminder %>%
dplyr::filter(.data = ., year == 2007) %>%
ggstatsplot::grouped_gghistostats(
# arguments relevant for ggstatsplot::gghistostats
data = .,
x = lifeExp,
xlab = "Life expectancy",
type = "p", # parametric test
test.value = 48, # test value against which sample mean is to be compared
test.value.line = TRUE, # show a vertical line at `test.value`
messages = FALSE, # turn off messages
centrality.para = "mean", # plotting centrality parameter (mean)
grouping.var = continent, # grouping variable with multiple levels
# arguments relevant for ggstatsplot::combine_plots
title.text = "Life expectancy change in different continents since 1950",
caption.text = "Note: black line  1950; blue line  2007",
nrow = 3,
ncol = 2,
labels = c("(a)","(b)","(c)","(d)","(e)")
)
As can be seen from these plots, life expectancy has improved in all continents in 2007 as compared to the global average of 1950. Additionally, we see the benefits of plotting this data separately for each continent. If we look at the standardized effect sizes (Cohen's d), it is apparent that the biggest improvements in life expectancy outcomes were seen on the continents of Europe, Americas, and Oceania (just one data point is available here), while Asia and Africa exhibit the lowest improvements.
Although this is a quick and dirty way to explore large amount of data with
minimal effort, it does come with an important limitation: reduced flexibility.
For example, if we wanted to add, let's say, a separate test.value
argument
for each continent, this is not possible with grouped_gghistostats
. For cases
like these, it would be better to use a function like
purrr::pmap
.
Suggestions
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues