Learn R Programming

⚠️There's a newer version (1.7.0) of this package.Take me there.

{statsExpressions}: Tidy dataframes and expressions with statistical details

StatusUsageMiscellaneous

Introduction

The {statsExpressions} package has two key aims:

  • to provide a consistent syntax to do statistical analysis with tidy data (in pipe-friendly manner),
  • to provide statistical expressions (pre-formatted in-text statistical results) for plotting functions.

Statistical packages exhibit substantial diversity in terms of their syntax and expected input type. This can make it difficult to switch from one statistical approach to another. For example, some functions expect vectors as inputs, while others expect dataframes. Depending on whether it is a repeated measures design or not, different functions might expect data to be in wide or long format. Some functions can internally omit missing values, while other functions error in their presence. Furthermore, if someone wishes to utilize the objects returned by these packages downstream in their workflow, this is not straightforward either because even functions from the same package can return a list, a matrix, an array, a dataframe, etc., depending on the function.

This is where {statsExpressions} comes in: It can be thought of as a unified portal through which most of the functionality in these underlying packages can be accessed, with a simpler interface and no requirement to change data format.

This package forms the statistical processing backend for ggstatsplot package.

For more documentation, see the dedicated website.

Installation

TypeSourceCommand
Releaseinstall.packages("statsExpressions")
Developmentremotes::install_github("IndrajeetPatil/statsExpressions")

Citation

The package can be cited as:

citation("statsExpressions")

  Patil, I., (2021). statsExpressions: R Package for Tidy Dataframes
  and Expressions with Statistical Details. Journal of Open Source
  Software, 6(61), 3236, https://doi.org/10.21105/joss.03236

A BibTeX entry for LaTeX users is

  @Article{,
    doi = {10.21105/joss.03236},
    url = {https://doi.org/10.21105/joss.03236},
    year = {2021},
    publisher = {{The Open Journal}},
    volume = {6},
    number = {61},
    pages = {3236},
    author = {Indrajeet Patil},
    title = {{statsExpressions: {R} Package for Tidy Dataframes and Expressions with Statistical Details}},
    journal = {{Journal of Open Source Software}},
  }

General Workflow

Summary of functionality

Summary of available analyses

TestFunctionLifecycle
one-sample t-testone_sample_test
two-sample t-testtwo_sample_test
one-way ANOVAoneway_anova
correlation analysiscorr_test
contingency table analysiscontingency_table
meta-analysismeta_analysis

Summary of details available for analyses

AnalysisHypothesis testingEffect size estimation
(one/two-sample) t-test
one-way ANOVA
correlation
(one/two-way) contingency table
random-effects meta-analysis

Summary of supported statistical approaches

DescriptionParametricNon-parametricRobustBayesian
Between group/condition comparisons
Within group/condition comparisons
Distribution of a numeric variable
Correlation between two variables
Association between categorical variables
Equal proportions for categorical variable levels
Random-effects meta-analysis

Tidy dataframes from statistical analysis

To illustrate the simplicity of this syntax, let’s say we want to run a one-way ANOVA. If we first run a non-parametric ANOVA and then decide to run a robust ANOVA instead, the syntax remains the same and the statistical approach can be modified by changing a single argument:

library(statsExpressions)

mtcars %>% oneway_anova(cyl, wt, type = "nonparametric")
#> # A tibble: 1 x 15
#>   parameter1 parameter2 statistic df.error   p.value
#>   <chr>      <chr>          <dbl>    <int>     <dbl>
#> 1 wt         cyl             22.8        2 0.0000112
#>   method                       effectsize      estimate conf.level conf.low
#>   <chr>                        <chr>              <dbl>      <dbl>    <dbl>
#> 1 Kruskal-Wallis rank sum test Epsilon2 (rank)    0.736       0.95    0.624
#>   conf.high conf.method          conf.iterations n.obs expression  
#>       <dbl> <chr>                          <int> <int> <list>      
#> 1         1 percentile bootstrap             100    32 <expression>

mtcars %>% oneway_anova(cyl, wt, type = "robust")
#> # A tibble: 1 x 12
#>   statistic    df df.error p.value
#>       <dbl> <dbl>    <dbl>   <dbl>
#> 1      12.7     2     12.2 0.00102
#>   method                                           
#>   <chr>                                            
#> 1 A heteroscedastic one-way ANOVA for trimmed means
#>   effectsize                         estimate conf.level conf.low conf.high
#>   <chr>                                 <dbl>      <dbl>    <dbl>     <dbl>
#> 1 Explanatory measure of effect size     1.05       0.95    0.843      1.50
#>   n.obs expression  
#>   <int> <list>      
#> 1    32 <expression>

All possible output dataframes from functions are tabulated here: https://indrajeetpatil.github.io/statsExpressions/articles/web_only/dataframe_outputs.html

Needless to say this will also work with the kable function to generate a table:

# setup
library(statsExpressions)
set.seed(123)

# one-sample robust t-test
# we will leave `expression` column out; it's not needed for using only the dataframe
mtcars %>%
  one_sample_test(wt, test.value = 3, type = "robust") %>%
  dplyr::select(-expression) %>%
  knitr::kable()
statisticp.valuen.obsmethodeffectsizeestimateconf.levelconf.lowconf.high
1.1791810.27532Bootstrap-t method for one-sample testTrimmed mean3.1970.952.8542463.539754

These functions are also compatible with other popular data manipulation packages.

For example, let’s say we want to run a one-sample t-test for all levels of a certain grouping variable. We can use dplyr to do so:

# for reproducibility
set.seed(123)
library(dplyr)

# grouped operation
# running one-sample test for all levels of grouping variable `cyl`
mtcars %>%
  group_by(cyl) %>%
  group_modify(~ one_sample_test(.x, wt, test.value = 3), .keep = TRUE) %>%
  ungroup()
#> # A tibble: 3 x 16
#>     cyl    mu statistic df.error  p.value method            alternative
#>   <dbl> <dbl>     <dbl>    <dbl>    <dbl> <chr>             <chr>      
#> 1     4     3    -4.16        10 0.00195  One Sample t-test two.sided  
#> 2     6     3     0.870        6 0.418    One Sample t-test two.sided  
#> 3     8     3     4.92        13 0.000278 One Sample t-test two.sided  
#>   effectsize estimate conf.level conf.low conf.high conf.method
#>   <chr>         <dbl>      <dbl>    <dbl>     <dbl> <chr>      
#> 1 Hedges' g    -1.16        0.95   -1.97     -0.422 ncp        
#> 2 Hedges' g     0.286       0.95   -0.419     1.01  ncp        
#> 3 Hedges' g     1.24        0.95    0.565     1.98  ncp        
#>   conf.distribution n.obs expression  
#>   <chr>             <int> <list>      
#> 1 t                    11 <expression>
#> 2 t                     7 <expression>
#> 3 t                    14 <expression>

Using expressions in custom plots

Note that expression here means a pre-formatted in-text statistical result. In addition to other details contained in the dataframe, there is also a column titled expression, which contains expression with statistical details and can be displayed in a plot.

For all statistical test expressions, the default template attempt to follow the gold standard for statistical reporting.

For example, here are results from Welch’s t-test:

Expressions for centrality measure

library(ggplot2)

# displaying mean for each level of `cyl`
centrality_description(mtcars, cyl, wt) |>
  ggplot(aes(cyl, wt)) +
  geom_point() +
  geom_label(aes(label = expression), parse = TRUE)

Here are a few examples for supported analyses.

Expressions for one-way ANOVAs

Between-subjects design

Let’s say we want to check differences in weight of the vehicle based on number of cylinders in the engine and wish to carry out robust trimmed-means ANOVA:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)
library(ggridges)

# create a ridgeplot
ggplot(iris, aes(x = Sepal.Length, y = Species)) +
  geom_density_ridges(
    jittered_points = TRUE, quantile_lines = TRUE,
    scale = 0.9, vline_size = 1, vline_color = "red",
    position = position_raincloud(adjust_vlines = TRUE)
  ) + # use the expression in the dataframe to display results in the subtitle
  labs(
    title = "A heteroscedastic one-way ANOVA for trimmed means",
    subtitle = oneway_anova(iris, Species, Sepal.Length, type = "robust")$expression[[1]]
  )

Within-subjects design

Let’s now see an example of a repeated measures one-way ANOVA.

# setup
set.seed(123)
library(ggplot2)
library(WRS2)
library(ggbeeswarm)
library(statsExpressions)

ggplot2::ggplot(WineTasting, aes(Wine, Taste, color = Wine)) +
  geom_quasirandom() +
  labs(
    title = "Friedman's rank sum test",
    subtitle = oneway_anova(
      WineTasting,
      Wine,
      Taste,
      paired = TRUE,
      subject.id = Taster,
      type = "np"
    )$expression[[1]]
  )

Expressions for two-sample tests

Between-subjects design

# setup
set.seed(123)
library(ggplot2)
library(gghalves)
library(ggbeeswarm)
library(hrbrthemes)

# create a plot
ggplot(ToothGrowth, aes(supp, len)) +
  geom_half_boxplot() +
  geom_beeswarm() +
  theme_ipsum_rc() +
  # adding a subtitle with
  labs(
    title = "Two-Sample Welch's t-test",
    subtitle = two_sample_test(ToothGrowth, supp, len)$expression[[1]]
  )

Within-subjects design

We can also have a look at a repeated measures design and the related expressions.

# setup
set.seed(123)
library(ggplot2)
library(tidyr)
library(PairedData)
data(PrisonStress)

# get data in tidy format
df <- pivot_longer(PrisonStress, starts_with("PSS"), "PSS", values_to = "stress")

# plot
paired.plotProfiles(PrisonStress, "PSSbefore", "PSSafter", subjects = "Subject") +
  labs(
    title = "Two-sample Wilcoxon paired test",
    subtitle = two_sample_test(
      data = df,
      x = PSS,
      y = stress,
      paired = TRUE,
      subject.id = Subject,
      type = "np"
    )$expression[[1]]
  )

Expressions for one-sample tests

# setup
set.seed(123)
library(ggplot2)

# dataframe with results
df_results <- one_sample_test(mtcars, wt,
  test.value = 3, type = "bayes",
  top.text = "Bayesian one-sample t-test"
)

# creating a histogram plot
ggplot(mtcars, aes(wt)) +
  geom_histogram(alpha = 0.5) +
  geom_vline(xintercept = mean(mtcars$wt), color = "red") +
  labs(
    subtitle = df_results$expression[[1]]
  )

Expressions for correlation analysis

Let’s look at another example where we want to run correlation analysis:

# setup
set.seed(123)
library(ggplot2)

# create a scatter plot
ggplot(mtcars, aes(mpg, wt)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(
    title = "Spearman's rank correlation coefficient",
    subtitle = corr_test(mtcars, mpg, wt, type = "nonparametric")$expression[[1]]
  )

Expressions for contingency table analysis

For categorical/nominal data - one-sample:

# setup
set.seed(123)
library(ggplot2)

df_results <- contingency_table(as.data.frame(table(mpg$class)),
  Var1,
  counts = Freq,
  type = "bayes",
  top.text = "One-sample goodness-of-fit test"
)

# basic pie chart
ggplot(as.data.frame(table(mpg$class)), aes(x = "", y = Freq, fill = factor(Var1))) +
  geom_bar(width = 1, stat = "identity") +
  theme(axis.line = element_blank()) +
  # cleaning up the chart and adding results from one-sample proportion test
  coord_polar(theta = "y", start = 0) +
  labs(
    fill = "Class",
    x = NULL,
    y = NULL,
    title = "Pie Chart of class (type of car)",
    caption = df_results$expression[[1]]
  )

You can also use these function to get the expression in return without having to display them in plots:

# setup
set.seed(123)
library(ggplot2)

# Pearson's chi-squared test of independence
contingency_table(mtcars, am, cyl)$expression[[1]]
#> expression(list(chi["Pearson"]^2 * "(" * 2 * ")" == "8.74", italic(p) == 
#>     "0.01", widehat(italic("V"))["Cramer"] == "0.46", CI["95%"] ~ 
#>     "[" * "0.00", "1.00" * "]", italic("n")["obs"] == "32"))

Expressions for meta-analysis

# setup
set.seed(123)
library(metaviz)
library(ggplot2)
library(metaplus)

# meta-analysis forest plot with results random-effects meta-analysis
viz_forest(
  x = mozart[, c("d", "se")],
  study_labels = mozart[, "study_name"],
  xlab = "Cohen's d",
  variant = "thick",
  type = "cumulative"
) + # use `{statsExpressions}` to create expression containing results
  labs(
    title = "Meta-analysis of Pietschnig, Voracek, and Formann (2010) on the Mozart effect",
    subtitle = meta_analysis(dplyr::rename(mozart, estimate = d, std.error = se))$expression[[1]]
  ) +
  theme(text = element_text(size = 12))

Customizing details to your liking

Sometimes you may not wish include so many details in the subtitle. In that case, you can extract the expression and copy-paste only the part you wish to include. For example, here only statistic and p-values are included:

# setup
set.seed(123)
library(ggplot2)

# extracting detailed expression
(res_expr <- oneway_anova(iris, Species, Sepal.Length, var.equal = TRUE)$expression[[1]])
#> expression(list(italic("F")["Fisher"](2, 147) == "119.26", italic(p) == 
#>     "1.67e-31", widehat(omega["p"]^2) == "0.61", CI["95%"] ~ 
#>     "[" * "0.53", "1.00" * "]", italic("n")["obs"] == "150"))

# adapting the details to your liking
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  labs(subtitle = ggplot2::expr(paste(
    NULL, italic("F"), "(", "2",
    ",", "147", ") = ", "119.26", ", ",
    italic("p"), " = ", "1.67e-31"
  )))

Summary of tests and effect sizes

Here a go-to summary about statistical test carried out and the returned effect size for each function is provided. This should be useful if one needs to find out more information about how an argument is resolved in the underlying package or if one wishes to browse the source code. So, for example, if you want to know more about how one-way (between-subjects) ANOVA, you can run ?stats::oneway.test in your R console.

centrality_description

TypeMeasureFunction used
Parametricmeanparameters::describe_distribution()
Non-parametricmedianparameters::describe_distribution()
Robusttrimmed meanparameters::describe_distribution()
BayesianMAP (maximum a posteriori probability) estimateparameters::describe_distribution()

oneway_anova

between-subjects

Hypothesis testing

TypeNo. of groupsTestFunction used
Parametric> 2Fisher’s or Welch’s one-way ANOVAstats::oneway.test()
Non-parametric> 2Kruskal-Wallis one-way ANOVAstats::kruskal.test()
Robust> 2Heteroscedastic one-way ANOVA for trimmed meansWRS2::t1way()
Bayes Factor> 2Fisher’s ANOVABayesFactor::anovaBF()

Effect size estimation

TypeNo. of groupsEffect sizeCI available?Function used
Parametric> 2partial eta-squared, partial omega-squaredYeseffectsize::omega_squared(), effectsize::eta_squared()
Non-parametric> 2rank epsilon squaredYeseffectsize::rank_epsilon_squared()
Robust> 2Explanatory measure of effect sizeYesWRS2::t1way()
Bayes Factor> 2Bayesian R-squaredYesperformance::r2_bayes()

within-subjects

Hypothesis testing

TypeNo. of groupsTestFunction used
Parametric> 2One-way repeated measures ANOVAafex::aov_ez()
Non-parametric> 2Friedman rank sum teststats::friedman.test()
Robust> 2Heteroscedastic one-way repeated measures ANOVA for trimmed meansWRS2::rmanova()
Bayes Factor> 2One-way repeated measures ANOVABayesFactor::anovaBF()

Effect size estimation

TypeNo. of groupsEffect sizeCI available?Function used
Parametric> 2partial eta-squared, partial omega-squaredYeseffectsize::omega_squared(), effectsize::eta_squared()
Non-parametric> 2Kendall’s coefficient of concordanceYeseffectsize::kendalls_w()
Robust> 2Algina-Keselman-Penfield robust standardized difference averageYesWRS2::wmcpAKP()
Bayes Factor> 2Bayesian R-squaredYesperformance::r2_bayes()

two_sample_test

between-subjects

Hypothesis testing

TypeNo. of groupsTestFunction used
Parametric2Student’s or Welch’s t-teststats::t.test()
Non-parametric2Mann-Whitney U teststats::wilcox.test()
Robust2Yuen’s test for trimmed meansWRS2::yuen()
Bayesian2Student’s t-testBayesFactor::ttestBF()

Effect size estimation

TypeNo. of groupsEffect sizeCI available?Function used
Parametric2Cohen’s d, Hedge’s gYeseffectsize::cohens_d(), effectsize::hedges_g()
Non-parametric2r (rank-biserial correlation)Yeseffectsize::rank_biserial()
Robust2Algina-Keselman-Penfield robust standardized differenceYesWRS2::akp.effect()
Bayesian2differenceYesbayestestR::describe_posterior()

within-subjects

Hypothesis testing

TypeNo. of groupsTestFunction used
Parametric2Student’s t-teststats::t.test()
Non-parametric2Wilcoxon signed-rank teststats::wilcox.test()
Robust2Yuen’s test on trimmed means for dependent samplesWRS2::yuend()
Bayesian2Student’s t-testBayesFactor::ttestBF()

Effect size estimation

TypeNo. of groupsEffect sizeCI available?Function used
Parametric2Cohen’s d, Hedge’s gYeseffectsize::cohens_d(), effectsize::hedges_g()
Non-parametric2r (rank-biserial correlation)Yeseffectsize::rank_biserial()
Robust2Algina-Keselman-Penfield robust standardized differenceYesWRS2::wmcpAKP()
Bayesian2differenceYesbayestestR::describe_posterior()

one_sample_test

Hypothesis testing

TypeTestFunction used
ParametricOne-sample Student’s t-teststats::t.test
Non-parametricOne-sample Wilcoxon teststats::wilcox.test
RobustBootstrap-t method for one-sample testWRS2::trimcibt
BayesianOne-sample Student’s t-testBayesFactor::ttestBF

Effect size estimation

TypeEffect sizeCI available?Function used
ParametricCohen’s d, Hedge’s gYeseffectsize::cohens_d(), effectsize::hedges_g()
Non-parametricr (rank-biserial correlation)Yeseffectsize::rank_biserial()
Robusttrimmed meanYesWRS2::trimcibt()
Bayes FactordifferenceYesbayestestR::describe_posterior()

corr_test

Hypothesis testing and Effect size estimation

TypeTestCI available?Function used
ParametricPearson’s correlation coefficientYescorrelation::correlation()
Non-parametricSpearman’s rank correlation coefficientYescorrelation::correlation()
RobustWinsorized Pearson correlation coefficientYescorrelation::correlation()
BayesianBayesian Pearson’s correlation coefficientYescorrelation::correlation()

contingency_table

two-way table

Hypothesis testing

TypeDesignTestFunction used
Parametric/Non-parametricUnpairedPearson’s chi-squared teststats::chisq.test()
BayesianUnpairedBayesian Pearson’s chi-squared testBayesFactor::contingencyTableBF()
Parametric/Non-parametricPairedMcNemar’s chi-squared teststats::mcnemar.test()
BayesianPairedNoNo

Effect size estimation

TypeDesignEffect sizeCI available?Function used
Parametric/Non-parametricUnpairedCramer’s VYeseffectsize::cramers_v()
BayesianUnpairedCramer’s VYeseffectsize::cramers_v()
Parametric/Non-parametricPairedCohen’s gYeseffectsize::cohens_g()
BayesianPairedNoNoNo

one-way table

Hypothesis testing

TypeTestFunction used
Parametric/Non-parametricGoodness of fit chi-squared teststats::chisq.test()
BayesianBayesian Goodness of fit chi-squared test(custom)

Effect size estimation

TypeEffect sizeCI available?Function used
Parametric/Non-parametricPearson’s CYeseffectsize::pearsons_c()
BayesianNoNoNo

meta_analysis

Hypothesis testing and Effect size estimation

TypeTestEffect sizeCI available?Function used
ParametricMeta-analysis via random-effects modelsbetaYesmetafor::metafor()
RobustMeta-analysis via robust random-effects modelsbetaYesmetaplus::metaplus()
BayesMeta-analysis via Bayesian random-effects modelsbetaYesmetaBMA::meta_random()

Usage in ggstatsplot

Note that these functions were initially written to display results from statistical tests on ready-made {ggplot2} plots implemented in {ggstatsplot}.

For detailed documentation, see the package website: https://indrajeetpatil.github.io/ggstatsplot/

Here is an example from {ggstatsplot} of what the plots look like when the expressions are displayed in the subtitle-

Acknowledgments

The hexsticker and the schematic illustration of general workflow were generously designed by Sarah Otterstetter (Max Planck Institute for Human Development, Berlin).

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I personally prefer using the GitHub issues system over trying to reach out to me in other ways (personal e-mail, Twitter, etc.). Pull Requests for contributions are encouraged.

Here are some simple ways in which you can contribute (in the increasing order of commitment):

  • Read and correct any inconsistencies in the documentation

  • Raise issues about bugs or wanted features

  • Review code

  • Add new functionality (in the form of new plotting functions or helpers for preparing subtitles)

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copy Link

Version

Install

install.packages('statsExpressions')

Monthly Downloads

7,405

Version

1.3.1

License

GPL-3 | file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Indrajeet Patil

Last Published

March 29th, 2022

Functions in statsExpressions (1.3.1)

movies_long

Movie information and user ratings from IMDB.com (long format).
add_expression_col

Template for expressions with statistical details
long_to_wide_converter

Converts dataframe from long/tidy to wide format with NAs removed
centrality_description

Dataframe and expression for distribution properties
corr_test

Correlation analyses
meta_analysis

Random-effects meta-analyses
contingency_table

Contingency table analyses
iris_long

Edgar Anderson's Iris Data in long format.
bugs_long

Tidy version of the "Bugs" dataset.
movies_wide

Movie information and user ratings from IMDB.com (wide format).
one_sample_test

One-sample tests
statsExpressions-package

statsExpressions: Tidy Dataframes and Expressions with Statistical Details
reexports

Objects exported from other packages
stats_type_switch

Switch the type of statistics.
tidy_model_expressions

Expressions with statistics for tidy regression dataframes
oneway_anova

One-way analysis of variance (ANOVA)
two_sample_test

Two-sample tests
tidy_model_parameters

Convert parameters package output to tidyverse conventions