⚠️There's a newer version (2.1.0) of this package. Take me there.

groupedstats: Grouped statistical analysis in a tidy way

Overview

groupedstats package provides a collection of functions to run statistical operations on multiple variables across multiple grouping variables in a dataframe. This is a common situation, as illustrated by few example cases-

  1. If you have combined multiple studies in a single dataframe and want to run a common operation (e.g., linear regression) on each study. In this case, column corresponding to study will be the grouping variable.
  2. If you have multiple groups in your dataframe (e.g., clinical disorder groups and controls group) and you want to carry out the same operation for each group (e.g., planned t-test to check for differences in reaction time in condition 1 versus condition 2 for both groups). In this case, group will be the grouping variable.
  3. If you have multiple conditions in a given study (e.g., six types of videos participants saw) and want to run a common operation between different measures of interest for each condition (e.g., correlation between subjective rating of emotional intensity and reaction time).
  4. Combination of all of the above.

Retirement

This package is no longer under active development and no new functionality will be added. The package will continue to be available on CRAN and all releases will primarily be focused on maintenance and bug fixes. This is for two reasons-

  1. There are more general versions of these functions introduced in broomExtra package: grouped_tidy, grouped_augment, grouped_glance.

For more, see: https://indrajeetpatil.github.io/broomExtra/reference/index.html#section-grouped-variants-of-generics

  1. dplyr 0.8.1 introduced group_map(), group_modify() and group_walk() functions that can be used to iterate on grouped dataframes. So if you want to do grouped_ operations, I would highly recommend using these functions over groupedstats functions since the former are much more general, efficient, and faster than the latter.

For more, see: https://dplyr.tidyverse.org/reference/group_map.html

Installation

To get the latest, stable CRAN release (0.0.8):

utils::install.packages(pkgs = "groupedstats")

You can get the development version of the package from GitHub (0.0.8.9000). To see what new changes (and bug fixes) have been made to the package since the last release on CRAN, you can check the detailed log of changes here: https://indrajeetpatil.github.io/groupedstats/news/index.html

If you are in hurry and want to reduce the time of installation, prefer-

# needed package to download from GitHub repo
utils::install.packages(pkgs = "remotes")

remotes::install_github(
  repo = "IndrajeetPatil/groupedstats", # package path on GitHub
  quick = TRUE
) # skips docs, demos, and vignettes

If time is not a constraint-

remotes::install_github(
  repo = "IndrajeetPatil/groupedstats", # package path on GitHub
  dependencies = TRUE, # installs packages which groupedstats depends on
  upgrade_dependencies = TRUE # updates any out of date dependencies
)

Citation

If you want to cite this package in a scientific journal or in any other context, run the following code in your R console:

utils::citation(package = "groupedstats")

Help

There is a dedicated website to groupedstats, which is updated after every new commit: https://indrajeetpatil.github.io/groupedstats/.

To see arguments for any of the functions, use args. For example-

args(groupedstats::grouped_ttest)
#> Registered S3 method overwritten by 'broom.mixed':
#>   method      from 
#>   tidy.gamlss broom
#> function (data, dep.vars, indep.vars, grouping.vars, paired = FALSE, 
#>     var.equal = FALSE) 
#> NULL

In case you want to look at the function body for any of the functions, just type the name of the function without the parentheses:

groupedstats::grouped_summary

If you are not familiar either with what the namespace :: does or how to use pipe operator %>%, something this package and its documentation relies a lot on, you can check out these links-

Usage

Below is a short introduction to the currently available functions in this package:

grouped_ versions of broom generic functions

These functions are re-exported from broomExtra package and provide the most general versions of grouped_ functions.

Here is an example

# for reproducibility
set.seed(123)

# running glm across two all combinations of two grouping variables
groupedstats::grouped_tidy(
  data = groupedstats::Titanic_full, # dataframe
  grouping.vars = c(Class, Age),     # grouping variables
  ..f = stats::glm,                  # function to execute
  # additional arguments passed to `..f`
  formula = Survived ~ Sex,
  family = stats::binomial(link = "logit")
)
#> # A tibble: 14 x 7
#>    Class Age   term         estimate  std.error statistic   p.value
#>    <fct> <fct> <chr>           <dbl>      <dbl>     <dbl>     <dbl>
#>  1 1st   Adult (Intercept)  3.56e+ 0      0.507  7.01e+ 0  2.36e-12
#>  2 1st   Adult SexMale     -4.28e+ 0      0.532 -8.05e+ 0  8.36e-16
#>  3 1st   Child (Intercept) -2.46e+ 1 131011.    -1.88e- 4 10.00e- 1
#>  4 1st   Child SexMale     -1.74e-15 143515.    -1.21e-20  1.00e+ 0
#>  5 2nd   Adult (Intercept)  1.82e+ 0      0.299  6.08e+ 0  1.23e- 9
#>  6 2nd   Adult SexMale     -4.21e+ 0      0.409 -1.03e+ 1  6.79e-25
#>  7 2nd   Child (Intercept) -2.56e+ 1  59908.    -4.27e- 4 10.00e- 1
#>  8 2nd   Child SexMale     -7.14e-15  88489.    -8.07e-20  1.00e+ 0
#>  9 3rd   Adult (Intercept) -1.58e- 1      0.156 -1.01e+ 0  3.12e- 1
#> 10 3rd   Adult SexMale     -1.48e+ 0      0.201 -7.39e+ 0  1.51e-13
#> 11 3rd   Child (Intercept) -1.94e- 1      0.361 -5.38e- 1  5.91e- 1
#> 12 3rd   Child SexMale     -7.96e- 1      0.486 -1.64e+ 0  1.01e- 1
#> 13 Crew  Adult (Intercept)  1.90e+ 0      0.619  3.06e+ 0  2.18e- 3
#> 14 Crew  Adult SexMale     -3.15e+ 0      0.625 -5.04e+ 0  4.68e- 7

For more examples, see: https://indrajeetpatil.github.io/broomExtra/reference/index.html#section-grouped-variants-of-generics

grouped_summary

Getting summary for multiple variables across multiple grouping variables. This function is a wrapper around skimr::skim_to_wide(). It is supposed to be a handy summarizing tool if you have multiple grouping variables and multiple variables for which summary statistics are to be computed-

# for reproducibility
set.seed(123)
library(datasets)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_summary(
  data = iris,
  grouping.vars = Species,
  measures = Sepal.Length:Petal.Width,
  measures.type = "numeric"
)
#> # A tibble: 12 x 16
#>    Species    type    variable     missing complete     n  mean    sd   min
#>    <fct>      <chr>   <chr>          <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 setosa     numeric Petal.Length       0       50    50  1.46  0.17   1  
#>  2 setosa     numeric Petal.Width        0       50    50  0.25  0.11   0.1
#>  3 setosa     numeric Sepal.Length       0       50    50  5.01  0.35   4.3
#>  4 setosa     numeric Sepal.Width        0       50    50  3.43  0.38   2.3
#>  5 versicolor numeric Petal.Length       0       50    50  4.26  0.47   3  
#>  6 versicolor numeric Petal.Width        0       50    50  1.33  0.2    1  
#>  7 versicolor numeric Sepal.Length       0       50    50  5.94  0.52   4.9
#>  8 versicolor numeric Sepal.Width        0       50    50  2.77  0.31   2  
#>  9 virginica  numeric Petal.Length       0       50    50  5.55  0.55   4.5
#> 10 virginica  numeric Petal.Width        0       50    50  2.03  0.27   1.4
#> 11 virginica  numeric Sepal.Length       0       50    50  6.59  0.64   4.9
#> 12 virginica  numeric Sepal.Width        0       50    50  2.97  0.32   2.2
#>      p25 median   p75   max std.error mean.low.conf mean.high.conf
#>    <dbl>  <dbl> <dbl> <dbl>     <dbl>         <dbl>          <dbl>
#>  1  1.4    1.5   1.58   1.9    0.0240         1.41           1.51 
#>  2  0.2    0.2   0.3    0.6    0.0156         0.219          0.281
#>  3  4.8    5     5.2    5.8    0.0495         4.91           5.11 
#>  4  3.2    3.4   3.68   4.4    0.0537         3.32           3.54 
#>  5  4      4.35  4.6    5.1    0.0665         4.13           4.39 
#>  6  1.2    1.3   1.5    1.8    0.0283         1.27           1.39 
#>  7  5.6    5.9   6.3    7      0.0735         5.79           6.09 
#>  8  2.52   2.8   3      3.4    0.0438         2.68           2.86 
#>  9  5.1    5.55  5.88   6.9    0.0778         5.39           5.71 
#> 10  1.8    2     2.3    2.5    0.0382         1.95           2.11 
#> 11  6.23   6.5   6.9    7.9    0.0905         6.41           6.77 
#> 12  2.8    3     3.18   3.8    0.0453         2.88           3.06

This function can be used to get summary of either numeric or factor variables, but not both. This is by design. If no measures are specified, the function will compute summary for all variables of the specified type (numeric or factor).

If you want summary of variables of factor type-

# for reproducibility
set.seed(123)
library(ggplot2)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_summary(
  data = ggplot2::diamonds,
  grouping.vars = c(cut, clarity),
  measures = color,
  measures.type = "factor"
)
#> # A tibble: 40 x 10
#>    cut   clarity type   variable missing complete n     n_unique
#>    <ord> <ord>   <chr>  <chr>    <chr>   <chr>    <chr> <chr>   
#>  1 Fair  I1      factor color    0       210      210   7       
#>  2 Fair  SI2     factor color    0       466      466   7       
#>  3 Fair  SI1     factor color    0       408      408   7       
#>  4 Fair  VS2     factor color    0       261      261   7       
#>  5 Fair  VS1     factor color    0       170      170   7       
#>  6 Fair  VVS2    factor color    0       69       69    7       
#>  7 Fair  VVS1    factor color    0       17       17    7       
#>  8 Fair  IF      factor color    0       9        9     3       
#>  9 Good  I1      factor color    0       96       96    7       
#> 10 Good  SI2     factor color    0       1081     1081  7       
#>    top_counts                     ordered
#>    <chr>                          <chr>  
#>  1 G: 53, H: 52, F: 35, I: 34     TRUE   
#>  2 H: 91, F: 89, G: 80, E: 78     TRUE   
#>  3 F: 83, H: 75, G: 69, E: 65     TRUE   
#>  4 F: 53, G: 45, E: 42, H: 41     TRUE   
#>  5 G: 45, F: 33, H: 32, I: 25     TRUE   
#>  6 G: 17, E: 13, H: 11, F: 10     TRUE   
#>  7 F: 5, D: 3, E: 3, G: 3         TRUE   
#>  8 F: 4, D: 3, G: 2, E: 0         TRUE   
#>  9 E: 23, F: 19, G: 19, H: 14     TRUE   
#> 10 D: 223, E: 202, F: 201, G: 163 TRUE   
#> # ... with 30 more rows

Note that there is a column corresponding to top_counts which is really useful in case you, let’s say, want to plot these counts. But this column is of character type and in wide format. The solution is to use an additional argument provided for this function:

# for reproducibility
set.seed(123)
library(ggplot2)
library(magrittr)
library(ggstatsplot)
#> Registered S3 methods overwritten by 'car':
#>   method                          from
#>   influence.merMod                lme4
#>   cooks.distance.influence.merMod lme4
#>   dfbeta.influence.merMod         lme4
#>   dfbetas.influence.merMod        lme4

options(tibble.width = Inf) # show me all columns

groupedstats::grouped_summary(
  data = ggplot2::diamonds,
  grouping.vars = cut, # for simplicity, let's just use one grouping variable
  measures = color,
  measures.type = "factor",
  topcount.long = TRUE
) %>%
  ggplot2::ggplot(
    data = ., # placeholder for summary dataframe we just created
    mapping = ggplot2::aes(
      x = forcats::fct_inorder(f = factor.level),
      y = count,
      fill = factor.level
    )
  ) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::labs(x = "color", y = "count") +
  ggplot2::facet_grid(facets = ~cut) + # for each level of the factor level
  ggstatsplot::theme_ggstatsplot() +
  ggplot2::theme(legend.position = "none")

This produces a long format table with two new columns factor.level and its corresponding count, which can then be immediately fed into other pipelines, e.g., preparing a plot of mean and sd values in ggplot2).

# for reproducibility
set.seed(123)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_summary(
  data = ggplot2::diamonds,
  grouping.vars = c(cut, clarity)
)
#> # A tibble: 280 x 17
#>    cut   clarity type    variable missing complete     n    mean      sd
#>    <ord> <ord>   <chr>   <chr>      <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#>  1 Fair  I1      integer price          0      210   210 3704.   3099.  
#>  2 Fair  I1      numeric carat          0      210   210    1.36    0.75
#>  3 Fair  I1      numeric depth          0      210   210   65.7     3.1 
#>  4 Fair  I1      numeric table          0      210   210   58.1     2.87
#>  5 Fair  I1      numeric x              0      210   210    6.72    1.11
#>  6 Fair  I1      numeric y              0      210   210    6.62    1.13
#>  7 Fair  I1      numeric z              0      210   210    4.39    0.76
#>  8 Fair  SI2     integer price          0      466   466 5174.   3928.  
#>  9 Fair  SI2     numeric carat          0      466   466    1.2     0.5 
#> 10 Fair  SI2     numeric depth          0      466   466   64.4     3.16
#>       min     p25  median     p75      max std.error mean.low.conf
#>     <dbl>   <dbl>   <dbl>   <dbl>    <dbl>     <dbl>         <dbl>
#>  1 584    1387.   2397    5614.   18531     214.           3282.  
#>  2   0.34    0.85    1.06    1.82     5.01    0.0518          1.26
#>  3  55.6    64.7    66.0    67.3     78.2     0.214          65.3 
#>  4  52      56      58      59       67       0.198          57.7 
#>  5   4.72    5.96    6.55    7.46    10.7     0.0766          6.57
#>  6   4.6     5.82    6.42    7.38    10.5     0.0780          6.47
#>  7   2.6     3.77    4.22    4.86     6.98    0.0524          4.29
#>  8 536    2763    3681    6266.   18308     182.           4816.  
#>  9   0.25    0.9     1.01    1.5      3.01    0.0232          1.15
#> 10  53.1    64.5    65.1    65.9     72.2     0.146          64.1 
#>    mean.high.conf
#>             <dbl>
#>  1        4125.  
#>  2           1.46
#>  3          66.1 
#>  4          58.5 
#>  5           6.87
#>  6           6.77
#>  7           4.49
#>  8        5531.  
#>  9           1.25
#> 10          64.7 
#> # ... with 270 more rows

grouped_slr

This function can be used to run simple linear regression (slr) between different pairs of variables across multiple levels of grouping variable(s). For example, we can use the gapminder dataset to study two relationships of interest for each country across years:

  1. life expectancy and GDP (per capita)
  2. population GDP (per capita) Thus, in this case we have two regression models and one grouping variable with 142 levels (countries)
# for reproducibility
set.seed(123)
library(gapminder)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_slr(
  data = gapminder::gapminder,
  dep.vars = c(lifeExp, pop),
  indep.vars = c(gdpPercap, gdpPercap),
  grouping.vars = country
)
#> # A tibble: 284 x 9
#>    country     formula             t.value estimate conf.low conf.high
#>    <fct>       <chr>                 <dbl>    <dbl>    <dbl>     <dbl>
#>  1 Afghanistan lifeExp ~ gdpPercap  -0.151  -0.0475   -0.751     0.656
#>  2 Albania     lifeExp ~ gdpPercap   4.84    0.837     0.452     1.22 
#>  3 Algeria     lifeExp ~ gdpPercap   6.71    0.904     0.604     1.21 
#>  4 Angola      lifeExp ~ gdpPercap  -0.998  -0.301    -0.973     0.371
#>  5 Argentina   lifeExp ~ gdpPercap   4.74    0.832     0.440     1.22 
#>  6 Australia   lifeExp ~ gdpPercap  19.0     0.986     0.871     1.10 
#>  7 Austria     lifeExp ~ gdpPercap  26.5     0.993     0.910     1.08 
#>  8 Bahrain     lifeExp ~ gdpPercap   6.45    0.898     0.587     1.21 
#>  9 Bangladesh  lifeExp ~ gdpPercap   5.05    0.847     0.473     1.22 
#> 10 Belgium     lifeExp ~ gdpPercap  26.1     0.993     0.908     1.08 
#>    std.error  p.value significance
#>        <dbl>    <dbl> <chr>       
#>  1    0.316  8.83e- 1 ns          
#>  2    0.173  6.82e- 4 ***         
#>  3    0.135  5.33e- 5 ***         
#>  4    0.302  3.42e- 1 ns          
#>  5    0.176  7.97e- 4 ***         
#>  6    0.0519 3.52e- 9 ***         
#>  7    0.0374 1.34e-10 ***         
#>  8    0.139  7.38e- 5 ***         
#>  9    0.168  5.03e- 4 ***         
#> 10    0.0380 1.56e-10 ***         
#> # ... with 274 more rows

Notice the order in which the dependent and independent variables are entered; there are two separate regression models being run here: lifeExp ~ gdpPercap and pop ~ gdpPercap If this order is incorrect, the result will also be incorrect. So it is always a good idea to check the formula column to see if you have run the correct linear models. Also, note that the estimates are already standardized, i.e. estimates are standardized regression coefficients (betas, i.e.).

The prior example was with just one grouping variable. This can be done with multiple grouping variables as well. For example, with the diamonds dataset from ggplot2 library, let’s assess the relation between carat and price of a diamond for each type of clarity and cut-

# for reproducibility
set.seed(123)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_slr(
  data = ggplot2::diamonds,
  dep.vars = price,
  indep.vars = carat,
  grouping.vars = c(cut, clarity)
) %>%
  dplyr::arrange(.data = ., cut)
#> # A tibble: 40 x 10
#>    cut   clarity formula       t.value estimate conf.low conf.high
#>    <ord> <ord>   <chr>           <dbl>    <dbl>    <dbl>     <dbl>
#>  1 Fair  VS2     price ~ carat   42.0     0.934    0.890     0.978
#>  2 Fair  SI2     price ~ carat   69.1     0.955    0.928     0.982
#>  3 Fair  SI1     price ~ carat   58.9     0.946    0.915     0.978
#>  4 Fair  I1      price ~ carat   58.7     0.971    0.939     1.00 
#>  5 Fair  VVS1    price ~ carat    8.58    0.911    0.685     1.14 
#>  6 Fair  VS1     price ~ carat   36.6     0.943    0.892     0.994
#>  7 Fair  IF      price ~ carat    8.22    0.952    0.678     1.23 
#>  8 Fair  VVS2    price ~ carat   13.6     0.857    0.732     0.983
#>  9 Good  VS1     price ~ carat   74.0     0.946    0.921     0.971
#> 10 Good  SI2     price ~ carat  103.      0.953    0.934     0.971
#>    std.error   p.value significance
#>        <dbl>     <dbl> <chr>       
#>  1   0.0222  1.55e-117 ***         
#>  2   0.0138  1.96e-246 ***         
#>  3   0.0161  4.58e-201 ***         
#>  4   0.0165  1.86e-131 ***         
#>  5   0.106   3.59e-  7 ***         
#>  6   0.0257  5.40e- 82 ***         
#>  7   0.116   7.67e-  5 ***         
#>  8   0.0629  5.54e- 21 ***         
#>  9   0.0128  9.62e-318 ***         
#> 10   0.00926 0.        ***         
#> # ... with 30 more rows

A more general version of this function (grouped_lm) will be implemented in future that will utilize the formula interface of stats::lm.

grouped_robustslr

There is also robust variant of simple linear regression (as implemented in robust::lmRob)-

# for reproducibility
set.seed(123)
library(gapminder)
library(dplyr)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_robustslr(
  data = gapminder::gapminder,
  dep.vars = c(lifeExp, pop),
  indep.vars = c(gdpPercap, gdpPercap),
  grouping.vars = c(continent, country)
) %>%
  dplyr::arrange(.data = ., continent, country)
#> # A tibble: 284 x 8
#>    continent country      formula             t.value estimate std.error
#>    <fct>     <fct>        <chr>                 <dbl>    <dbl>     <dbl>
#>  1 Africa    Algeria      lifeExp ~ gdpPercap   5.82     0.904    0.155 
#>  2 Africa    Algeria      pop ~ gdpPercap       2.49     0.869    0.349 
#>  3 Africa    Angola       lifeExp ~ gdpPercap  -0.734   -0.413    0.563 
#>  4 Africa    Angola       pop ~ gdpPercap      -2.45    -0.541    0.221 
#>  5 Africa    Benin        lifeExp ~ gdpPercap   2.46     0.773    0.315 
#>  6 Africa    Benin        pop ~ gdpPercap       7.18     0.929    0.129 
#>  7 Africa    Botswana     lifeExp ~ gdpPercap   4.42     1.47     0.332 
#>  8 Africa    Botswana     pop ~ gdpPercap      15.3      1.08     0.0706
#>  9 Africa    Burkina Faso lifeExp ~ gdpPercap   2.14     0.882    0.413 
#> 10 Africa    Burkina Faso pop ~ gdpPercap       7.17     0.920    0.128 
#>         p.value significance
#>           <dbl> <chr>       
#>  1 0.000168     ***         
#>  2 0.0319       *           
#>  3 0.480        ns          
#>  4 0.0344       *           
#>  5 0.0338       *           
#>  6 0.0000299    ***         
#>  7 0.00130      **          
#>  8 0.0000000294 ***         
#>  9 0.0583       ns          
#> 10 0.0000304    ***         
#> # ... with 274 more rows

A more general version of this function (grouped_robustlm) will be implemented in future that will utilize the formula interface of robust::lmRob.

grouped_lm

A more general version of simple linear regression is stats::lm, implemented in grouped_lm:

# for reproducibility
set.seed(123)
library(groupedstats)
#> 
#> Attaching package: 'groupedstats'
#> The following objects are masked from 'package:ggstatsplot':
#> 
#>     movies_long, movies_wide

groupedstats::grouped_lm(
  data = mtcars,
  grouping.vars = cyl, # grouping variable (just one in this case)
  formula = mpg ~ am * wt, # note that this function takes a formula
  output = "tidy" # tidy dataframe containing results
)
#> # A tibble: 12 x 9
#>      cyl term        estimate std.error statistic   p.value conf.low
#>    <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
#>  1     4 (Intercept)   13.9      16.1      0.865  0.416       -24.1 
#>  2     4 am            30.3      17.2      1.77   0.121       -10.3 
#>  3     4 wt             3.07      5.44     0.564  0.590        -9.79
#>  4     4 am:wt        -11.0       6.16    -1.78   0.118       -25.5 
#>  5     6 (Intercept)   63.6      14.1      4.51   0.0204       18.7 
#>  6     6 am           -41.4      19.0     -2.18   0.117      -102.  
#>  7     6 wt           -13.1       4.16    -3.16   0.0511      -26.4 
#>  8     6 am:wt         12.5       6.22     2.02   0.137        -7.26
#>  9     8 (Intercept)   25.1       3.51     7.14   0.0000315    17.2 
#> 10     8 am            -2.92     25.9     -0.113  0.912       -60.5 
#> 11     8 wt            -2.44      0.842   -2.90   0.0159       -4.32
#> 12     8 am:wt          0.439     7.63     0.0575 0.955       -16.6 
#>    conf.high significance
#>        <dbl> <chr>       
#>  1    51.9   ns          
#>  2    70.9   ns          
#>  3    15.9   ns          
#>  4     3.61  ns          
#>  5   109.    *           
#>  6    19.1   ns          
#>  7     0.113 ns          
#>  8    32.4   ns          
#>  9    32.9   ***         
#> 10    54.7   ns          
#> 11    -0.563 *           
#> 12    17.4   ns

The same function can also be used to get model summaries instead of a tidy dataframe containing results-

# for reproducibility
set.seed(123)
library(groupedstats)

groupedstats::grouped_lm(
  data = ggplot2::diamonds,
  grouping.vars = c(cut, color), # grouping variables
  formula = price ~ carat * clarity, # formula
  output = "glance" # dataframe with model summaries
)
#> # A tibble: 35 x 14
#>    cut   color r.squared adj.r.squared sigma statistic   p.value    df
#>    <ord> <ord>     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>
#>  1 Fair  D         0.915         0.906 1005.      106. 1.03e- 70    15
#>  2 Fair  E         0.917         0.912  883.      179. 8.70e-106    13
#>  3 Fair  F         0.917         0.912  954.      217. 9.37e-150    15
#>  4 Fair  G         0.932         0.929  964.      273. 5.03e-164    15
#>  5 Fair  H         0.932         0.929 1033.      332. 1.76e-161    12
#>  6 Fair  I         0.958         0.955  794.      307. 1.14e-104    12
#>  7 Fair  J         0.955         0.950  907.      204. 1.82e- 66    11
#>  8 Good  D         0.933         0.931  831.      600. 0.           15
#>  9 Good  E         0.927         0.926  905.      781. 0.           15
#> 10 Good  F         0.915         0.914  939.      644. 0.           15
#>    logLik    AIC    BIC   deviance df.residual  nobs
#>     <dbl>  <dbl>  <dbl>      <dbl>       <int> <int>
#>  1 -1350.  2733.  2786. 148463951.         147   163
#>  2 -1830.  3690.  3741. 163677866.         210   224
#>  3 -2575.  5184.  5248. 269389292.         296   312
#>  4 -2595.  5224.  5288. 277025077.         298   314
#>  5 -2526.  5081.  5133. 309753202.         290   303
#>  6 -1410.  2848.  2892. 102049583.         162   175
#>  7  -973.  1972.  2008.  88046170.         107   119
#>  8 -5382. 10797. 10874. 446239835.         646   662
#>  9 -7667. 15369. 15451. 750592238.         917   933
#> 10 -7504. 15042. 15123. 787519475.         893   909
#> # ... with 25 more rows

grouped_aov

A related function to stats::lm is stats::aov, which fits an analysis of variance model for each group. Contrast the output you get here with the previous output for the same model from grouped_lm function. The estimate in this case with be an effect size (either partial eta-squared or partial omega-squared).

# for reproducibility
set.seed(123)
library(groupedstats)

groupedstats::grouped_aov(
  data = mtcars,
  grouping.vars = cyl, # grouping variable (just one in this case)
  formula = mpg ~ am * wt, # note that this function takes a formula
  output = "tidy" # tidy dataframe with results
)
#> # A tibble: 9 x 10
#>     cyl term  F.value   df1   df2 p.value partial.etasq conf.low conf.high
#>   <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>         <dbl>    <dbl>     <dbl>
#> 1     6 am    5.07        1     3  0.110       0.628      0         0.820 
#> 2     6 wt    5.91        1     3  0.0933      0.663      0         0.835 
#> 3     6 am:wt 4.06        1     3  0.137       0.575      0         0.796 
#> 4     4 am    5.95        1     7  0.0448      0.459      0         0.711 
#> 5     4 wt    4.59        1     7  0.0693      0.396      0         0.676 
#> 6     4 am:wt 3.17        1     7  0.118       0.311      0         0.627 
#> 7     8 am    0.0456      1    10  0.835       0.00454    0         0.242 
#> 8     8 wt    8.45        1    10  0.0156      0.458      0.0190    0.691 
#> 9     8 am:wt 0.00331     1    10  0.955       0.000330   0         0.0885
#>   significance
#>   <chr>       
#> 1 ns          
#> 2 ns          
#> 3 ns          
#> 4 *           
#> 5 ns          
#> 6 ns          
#> 7 ns          
#> 8 *           
#> 9 ns

The same function can also be used to compute Tukey’s test of Honest Significant Differences (HSD). For example, we can check for differences in life expectancy between different continents for all years for which the gapminder survey was conducted:

# for reproducibility
set.seed(123)
library(groupedstats)
library(gapminder)

groupedstats::grouped_aov(
  data = gapminder::gapminder,
  grouping.vars = year,
  formula = lifeExp ~ continent,
  output = "tukey"
)
#> Note: The p-value is adjusted for the number of tests conducted at each level of the grouping variable.
#> 
#> # A tibble: 120 x 8
#>     year term      comparison       estimate conf.low conf.high adj.p.value
#>    <int> <chr>     <chr>               <dbl>    <dbl>     <dbl>       <dbl>
#>  1  1952 continent Americas-Africa     14.1      9.21     19.1     7.35e-12
#>  2  1952 continent Asia-Africa          7.18     2.66     11.7     2.11e- 4
#>  3  1952 continent Europe-Africa       25.3     20.6      29.9     1.01e-14
#>  4  1952 continent Oceania-Africa      30.1     15.5      44.7     7.12e- 7
#>  5  1952 continent Asia-Americas       -6.97   -12.3      -1.59    4.28e- 3
#>  6  1952 continent Europe-Americas     11.1      5.64     16.6     1.12e- 6
#>  7  1952 continent Oceania-Americas    16.0      1.07     30.9     2.91e- 2
#>  8  1952 continent Europe-Asia         18.1     13.0      23.2     5.87e-14
#>  9  1952 continent Oceania-Asia        22.9      8.17     37.7     3.16e- 4
#> 10  1952 continent Oceania-Europe       4.85    -9.97     19.7     8.95e- 1
#>    significance
#>    <chr>       
#>  1 ***         
#>  2 ***         
#>  3 ***         
#>  4 ***         
#>  5 **          
#>  6 ***         
#>  7 *           
#>  8 ***         
#>  9 ***         
#> 10 ns          
#> # ... with 110 more rows

Note that the p-value is adjusted adjusted for the number of tests conducted at each level of the grouping variable, and not across all tests conducted.

grouped_glm

The option to run generalized linear model (stats::glm) across different levels of the grouping variable is also implemented similarly-

# for reproducibility
set.seed(123)

groupedstats::grouped_glm(
  data = ggstatsplot::Titanic_full,
  formula = Survived ~ Sex,
  grouping.vars = Class,
  family = stats::binomial(link = "logit"),
  output = "tidy"
)
#> # A tibble: 8 x 9
#>   Class term        estimate std.error statistic  p.value conf.low
#>   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1 1st   (Intercept)    3.56      0.507      7.03 2.13e-12    2.70 
#> 2 1st   SexMale       -4.21      0.531     -7.92 2.29e-15   -5.42 
#> 3 2nd   (Intercept)    1.97      0.296      6.65 3.03e-11    1.43 
#> 4 2nd   SexMale       -3.79      0.366    -10.3  4.88e-25   -4.54 
#> 5 3rd   (Intercept)   -0.164     0.143     -1.14 2.54e- 1   -0.446
#> 6 3rd   SexMale       -1.40      0.185     -7.58 3.36e-14   -1.77 
#> 7 Crew  (Intercept)    1.90      0.619      3.06 2.18e- 3    0.827
#> 8 Crew  SexMale       -3.15      0.625     -5.04 4.68e- 7   -4.60 
#>   conf.high significance
#>       <dbl> <chr>       
#> 1     4.74  ***         
#> 2    -3.29  ***         
#> 3     2.60  ***         
#> 4    -3.10  ***         
#> 5     0.117 ns          
#> 6    -1.04  ***         
#> 7     3.34  **          
#> 8    -2.06  ***

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

You can also get a model summary across all models with broom::glance methods-

# for reproducibility
set.seed(123)

groupedstats::grouped_glm(
  data = ggstatsplot::Titanic_full,
  formula = Survived ~ Sex,
  grouping.vars = Class,
  family = stats::binomial(link = "logit"),
  output = "glance"
)
#> # A tibble: 4 x 9
#>   Class null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
#>   <fct>         <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#> 1 1st            430.     324  -134.  272.  280.     268.         323   325
#> 2 2nd            387.     284  -112.  228.  235.     224.         283   285
#> 3 3rd            797.     705  -370.  744.  753.     740.         704   706
#> 4 Crew           974.     884  -466.  936.  946.     932.         883   885

grouped_lmer

Linear mixed effects analyses (lme4::lmer) for all combinations of grouping variable levels can be carried out using grouped_lmer:

# for reproducibility
set.seed(123)
library(gapminder)

# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
    continent),
  grouping.vars = year,
  REML = FALSE,
  output = "tidy"
)
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> # A tibble: 24 x 10
#>     year effect term             estimate std.error t.value conf.low
#>    <int> <chr>  <chr>               <dbl>     <dbl>   <dbl>    <dbl>
#>  1  1952 fixed  (Intercept)         0.215     0.293   0.736   -0.358
#>  2  1952 fixed  scale(gdpPercap)    0.930     0.302   3.07     0.337
#>  3  1957 fixed  (Intercept)         0.254     0.342   0.743   -0.416
#>  4  1957 fixed  scale(gdpPercap)    0.815     0.282   2.89     0.262
#>  5  1962 fixed  (Intercept)         0.255     0.333   0.766   -0.397
#>  6  1962 fixed  scale(gdpPercap)    0.591     0.210   2.82     0.180
#>  7  1967 fixed  (Intercept)         0.249     0.361   0.689   -0.459
#>  8  1967 fixed  scale(gdpPercap)    0.387     0.120   3.24     0.153
#>  9  1972 fixed  (Intercept)         0.276     0.366   0.753   -0.442
#> 10  1972 fixed  scale(gdpPercap)    0.431     0.150   2.88     0.137
#>    conf.high p.value significance
#>        <dbl>   <dbl> <chr>       
#>  1     0.789 0.462   ns          
#>  2     1.52  0.00211 **          
#>  3     0.923 0.457   ns          
#>  4     1.37  0.00389 **          
#>  5     0.908 0.444   ns          
#>  6     1.00  0.00479 **          
#>  7     0.956 0.491   ns          
#>  8     0.622 0.00120 **          
#>  9     0.994 0.451   ns          
#> 10     0.724 0.00402 **          
#> # ... with 14 more rows

# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
    continent),
  grouping.vars = year,
  REML = FALSE,
  output = "glance"
)
#> # A tibble: 12 x 7
#>     year sigma logLik   AIC   BIC deviance df.residual
#>    <int> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#>  1  1952 0.531  -124.  259.  277.     247.         136
#>  2  1957 0.540  -127.  266.  284.     254.         136
#>  3  1962 0.546  -128.  268.  285.     256.         136
#>  4  1967 0.552  -128.  269.  287.     257.         136
#>  5  1972 0.565  -132.  276.  294.     264.         136
#>  6  1977 0.582  -132.  277.  294.     265.         136
#>  7  1982 0.544  -121.  253.  271.     241.         136
#>  8  1987 0.498  -110.  231.  249.     219.         136
#>  9  1992 0.518  -117.  246.  264.     234.         136
#> 10  1997 0.498  -109.  231.  248.     219.         136
#> 11  2002 0.505  -113.  238.  255.     226.         136
#> 12  2007 0.525  -116.  244.  262.     232.         136

grouped_glmer

A more generalized version of lmer is implemented in lme4::glmer, which can also handle categorical/nominal data. For example, let’s say we want to see if sex of a person was predictive of whether they survived the Titanic tragedy.

# for reproducibility
set.seed(123)

# having a look at the data
dplyr::glimpse(groupedstats::Titanic_full)
#> Observations: 2,201
#> Variables: 5
#> $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
#> $ Class    <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,...
#> $ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...

# running glmer model to get tidy output
groupedstats::grouped_glmer(
  formula = Survived ~ Age + (Age | Class),
  data = groupedstats::Titanic_full,
  family = stats::binomial(link = "probit"), # choosing the appropriate GLM family
  control = lme4::glmerControl( # choosing appropriate control
    optimizer = "Nelder_Mead",
    boundary.tol = 1e-07,
    calc.derivs = FALSE,
    optCtrl = list(maxfun = 2e9)
  ),
  grouping.vars = Sex, # grouping variables (just one in this case)
  output = "tidy",
  tidy.args = list(conf.int = TRUE, effects = "fixed")
)
#> # A tibble: 4 x 10
#>   Sex    effect term        estimate std.error statistic      p.value
#>   <fct>  <chr>  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 Female fixed  (Intercept)    1.01      0.379      2.66 0.00789     
#> 2 Female fixed  AgeChild       1.47      0.595      2.48 0.0133      
#> 3 Male   fixed  (Intercept)   -0.888     0.162     -5.47 0.0000000453
#> 4 Male   fixed  AgeChild       4.90      4.03       1.22 0.224       
#>   conf.low conf.high significance
#>      <dbl>     <dbl> <chr>       
#> 1    0.264     1.75  **          
#> 2    0.307     2.64  *           
#> 3   -1.21     -0.570 ***         
#> 4   -3.00     12.8   ns

# getting glmer model summaries (let's use the default family and control values)
groupedstats::grouped_glmer(
  data = groupedstats::Titanic_full,
  grouping.vars = Sex,
  formula = Survived ~ Age + (Age | Class),
  family = stats::binomial(link = "probit"),
  output = "glance"
)
#> boundary (singular) fit: see ?isSingular
#> # A tibble: 2 x 7
#>   Sex    sigma logLik   AIC   BIC deviance df.residual
#>   <fct>  <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#> 1 Female     1  -208.  426.  447.     400.         465
#> 2 Male       1  -860. 1730. 1757.    1698.        1726

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

grouped_proptest

This function helps carry out one-sample proportion tests (stats::chisq.test) with a unique variable for multiple grouping variables-

# for reproducibility
set.seed(123)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_proptest(
  data = mtcars,
  grouping.vars = cyl,
  measure = am
)
#> # A tibble: 3 x 8
#>     cyl `0`    `1`    statistic p.value parameter
#>   <dbl> <chr>  <chr>      <dbl>   <dbl>     <dbl>
#> 1     6 57.14% 42.86%     0.143 0.705           1
#> 2     4 27.27% 72.73%     2.27  0.132           1
#> 3     8 85.71% 14.29%     7.14  0.00753         1
#>   method                                   significance
#>   <chr>                                    <chr>       
#> 1 Chi-squared test for given probabilities ns          
#> 2 Chi-squared test for given probabilities ns          
#> 3 Chi-squared test for given probabilities **

grouped_ttest

This function can help you carry out t-tests, paired or independent, on multiple variables across multiple groups. Demonstrating how to use this function is going to first require getting the iris dataset into long format. Let’s say we want to investigate if Sepal part of the flower has greater measurement (length or width) than Petal part of the flower for each Iris species.

# for reproducibility
set.seed(123)

# converting the iris dataset to long format
iris_long <- iris %>%
  dplyr::mutate(.data = ., id = dplyr::row_number(x = Species)) %>%
  tidyr::gather(
    data = .,
    key = "condition",
    value = "value",
    Sepal.Length:Petal.Width,
    convert = TRUE,
    factor_key = TRUE
  ) %>%
  tidyr::separate(
    col = "condition",
    into = c("part", "measure"),
    sep = "\\.",
    convert = TRUE
  ) %>%
  tibble::as_data_frame(x = .)

# check the long format iris dataset
iris_long
#> # A tibble: 600 x 5
#>    Species    id part  measure value
#>    <fct>   <int> <chr> <chr>   <dbl>
#>  1 setosa      1 Sepal Length    5.1
#>  2 setosa      2 Sepal Length    4.9
#>  3 setosa      3 Sepal Length    4.7
#>  4 setosa      4 Sepal Length    4.6
#>  5 setosa      5 Sepal Length    5  
#>  6 setosa      6 Sepal Length    5.4
#>  7 setosa      7 Sepal Length    4.6
#>  8 setosa      8 Sepal Length    5  
#>  9 setosa      9 Sepal Length    4.4
#> 10 setosa     10 Sepal Length    4.9
#> # ... with 590 more rows

# checking if the Sepal part has different dimentions (value) than Petal part
# for each Species and for each type of measurement (Length and Width)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_ttest(
  data = iris_long,
  dep.vars = value, # dependent variable
  indep.vars = part, # independent variable
  grouping.vars = c(Species, measure), # for each Species and for each measurement
  paired = TRUE # paired t-test
)
#> # A tibble: 6 x 11
#>   Species    measure formula      method        t.test estimate conf.low
#>   <fct>      <chr>   <chr>        <chr>          <dbl>    <dbl>    <dbl>
#> 1 setosa     Length  value ~ part Paired t-test  -71.8   -3.54     -3.64
#> 2 versicolor Length  value ~ part Paired t-test  -34.0   -1.68     -1.78
#> 3 virginica  Length  value ~ part Paired t-test  -22.9   -1.04     -1.13
#> 4 setosa     Width   value ~ part Paired t-test  -61.0   -3.18     -3.29
#> 5 versicolor Width   value ~ part Paired t-test  -43.5   -1.44     -1.51
#> 6 virginica  Width   value ~ part Paired t-test  -23.1   -0.948    -1.03
#>   conf.high parameter  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#> 1    -3.44         49 2.54e-51 ***         
#> 2    -1.58         49 9.67e-36 ***         
#> 3    -0.945        49 7.99e-28 ***         
#> 4    -3.08         49 7.21e-48 ***         
#> 5    -1.38         49 8.42e-41 ***         
#> 6    -0.866        49 5.34e-28 ***

grouped_wilcox

This function is just a non-parametric variant of the grouped_ttest:

# for reproducibility
set.seed(123)
options(tibble.width = Inf) # show me all columns

groupedstats::grouped_wilcox(
  data = iris_long,
  dep.vars = value, # dependent variable
  indep.vars = part, # independent variable
  grouping.vars = c(Species, measure), # for each Species and for each measurement
  paired = TRUE # paired Wilcoxon signed rank test with continuity correction
)
#> # A tibble: 6 x 10
#>   Species    measure formula     
#>   <fct>      <chr>   <chr>       
#> 1 setosa     Length  value ~ part
#> 2 versicolor Length  value ~ part
#> 3 virginica  Length  value ~ part
#> 4 setosa     Width   value ~ part
#> 5 versicolor Width   value ~ part
#> 6 virginica  Width   value ~ part
#>   method                                               statistic estimate
#>   <chr>                                                    <dbl>    <dbl>
#> 1 Wilcoxon signed rank test with continuity correction         0   -3.55 
#> 2 Wilcoxon signed rank test with continuity correction         0   -1.70 
#> 3 Wilcoxon signed rank test with continuity correction         0   -1.05 
#> 4 Wilcoxon signed rank test with continuity correction         0   -3.15 
#> 5 Wilcoxon signed rank test with continuity correction         0   -1.45 
#> 6 Wilcoxon signed rank test with continuity correction         0   -0.950
#>   conf.low conf.high  p.value significance
#>      <dbl>     <dbl>    <dbl> <chr>       
#> 1    -3.65    -3.45  7.62e-10 ***         
#> 2    -1.80    -1.60  7.45e-10 ***         
#> 3    -1.15    -0.950 7.48e-10 ***         
#> 4    -3.25    -3.10  7.40e-10 ***         
#> 5    -1.55    -1.40  7.58e-10 ***         
#> 6    -1.00    -0.850 7.67e-10 ***

While we are at it, let’s also check out examples for t-test and Wilcox test in case of between-subjects designs.

We will use diamonds dataset from ggplot2 and will see if the price and depth of a diamond is different for two of our favorite colors (say E and J) for each type of clarity.

# for reproducibility
set.seed(123)

# subset the dataframe with two colors of interest to us
diamonds_short <-
  dplyr::filter(.data = ggplot2::diamonds, color == "E" |
    color == "J")

options(tibble.width = Inf, tibble.print_max = Inf) # show me all rows and columns

# t-test
groupedstats::grouped_ttest(
  data = diamonds_short,
  dep.vars = c(carat, price, depth), # note that there three dependent variables
  indep.vars = color, # and just one independent variable
  grouping.vars = clarity, # one grouping variable
  paired = FALSE,
  var.equal = FALSE
)
#> # A tibble: 24 x 10
#>    clarity formula       method                   t.test   estimate
#>    <ord>   <chr>         <chr>                     <dbl>      <dbl>
#>  1 SI2     carat ~ color Welch Two Sample t-test -18.0      -0.503 
#>  2 SI1     carat ~ color Welch Two Sample t-test -22.1      -0.462 
#>  3 VS1     carat ~ color Welch Two Sample t-test -16.8      -0.444 
#>  4 VVS2    carat ~ color Welch Two Sample t-test -12.0      -0.553 
#>  5 VS2     carat ~ color Welch Two Sample t-test -24.6      -0.542 
#>  6 I1      carat ~ color Welch Two Sample t-test  -4.48     -0.644 
#>  7 VVS1    carat ~ color Welch Two Sample t-test  -6.53     -0.417 
#>  8 IF      carat ~ color Welch Two Sample t-test  -2.38     -0.198 
#>  9 SI2     price ~ color Welch Two Sample t-test -10.6   -2347.    
#> 10 SI1     price ~ color Welch Two Sample t-test -12.6   -2024.    
#> 11 VS1     price ~ color Welch Two Sample t-test  -9.26  -2028.    
#> 12 VVS2    price ~ color Welch Two Sample t-test  -6.83  -2643.    
#> 13 VS2     price ~ color Welch Two Sample t-test -14.3   -2560.    
#> 14 I1      price ~ color Welch Two Sample t-test  -2.76  -1766.    
#> 15 VVS1    price ~ color Welch Two Sample t-test  -3.35  -1814.    
#> 16 IF      price ~ color Welch Two Sample t-test   0.378   305.    
#> 17 SI2     depth ~ color Welch Two Sample t-test  -2.64     -0.231 
#> 18 SI1     depth ~ color Welch Two Sample t-test   0.333     0.0208
#> 19 VS1     depth ~ color Welch Two Sample t-test  -6.61     -0.417 
#> 20 VVS2    depth ~ color Welch Two Sample t-test  -2.20     -0.277 
#> 21 VS2     depth ~ color Welch Two Sample t-test  -2.28     -0.145 
#> 22 I1      depth ~ color Welch Two Sample t-test  -3.96     -2.10  
#> 23 VVS1    depth ~ color Welch Two Sample t-test  -2.53     -0.370 
#> 24 IF      depth ~ color Welch Two Sample t-test  -2.16     -0.386 
#>     conf.low  conf.high parameter   p.value significance
#>        <dbl>      <dbl>     <dbl>     <dbl> <chr>       
#>  1    -0.558    -0.448      634.  9.24e- 59 ***         
#>  2    -0.503    -0.420      948.  2.47e- 87 ***         
#>  3    -0.496    -0.392      663.  2.91e- 53 ***         
#>  4    -0.644    -0.461      140.  3.81e- 23 ***         
#>  5    -0.586    -0.499      856.  1.51e-101 ***         
#>  6    -0.932    -0.357       60.6 3.34e-  5 ***         
#>  7    -0.545    -0.290       76.1 6.68e-  9 ***         
#>  8    -0.364    -0.0318      60.1 2.03e-  2 *           
#>  9 -2782.    -1913.         676.  2.02e- 24 ***         
#> 10 -2339.    -1709.        1031.  5.75e- 34 ***         
#> 11 -2458.    -1598.         794.  1.86e- 19 ***         
#> 12 -3407.    -1878.         151.  1.94e- 10 ***         
#> 13 -2911.    -2209.         943.  3.07e- 42 ***         
#> 14 -3044.     -487.          62.0 7.58e-  3 **          
#> 15 -2893.     -735.          81.2 1.24e-  3 **          
#> 16 -1299.     1909.          81.0 7.07e-  1 ns          
#> 17    -0.402    -0.0588     772.  8.58e-  3 **          
#> 18    -0.102     0.143     1245.  7.39e-  1 ns          
#> 19    -0.541    -0.293     1040.  6.08e- 11 ***         
#> 20    -0.526    -0.0279     158.  2.95e-  2 *           
#> 21    -0.271    -0.0203    1084.  2.28e-  2 *           
#> 22    -3.16     -1.05        81.6 1.61e-  4 ***         
#> 23    -0.661    -0.0792      88.1 1.32e-  2 *           
#> 24    -0.739    -0.0321     111.  3.28e-  2 *

# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
  data = diamonds_short,
  dep.vars = depth:price, # note that you can select variables in range with `:`
  indep.vars = color, # again, just one independent, multiple dependent variables case
  grouping.vars = clarity, # one grouping variable
  paired = FALSE
)
#> # A tibble: 16 x 9
#>    clarity formula       method                                           
#>    <ord>   <chr>         <chr>                                            
#>  1 SI2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  2 SI1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  3 VS1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  4 VVS2    depth ~ color Wilcoxon rank sum test with continuity correction
#>  5 VS2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  6 I1      depth ~ color Wilcoxon rank sum test with continuity correction
#>  7 VVS1    depth ~ color Wilcoxon rank sum test with continuity correction
#>  8 IF      depth ~ color Wilcoxon rank sum test with continuity correction
#>  9 SI2     price ~ color Wilcoxon rank sum test with continuity correction
#> 10 SI1     price ~ color Wilcoxon rank sum test with continuity correction
#> 11 VS1     price ~ color Wilcoxon rank sum test with continuity correction
#> 12 VVS2    price ~ color Wilcoxon rank sum test with continuity correction
#> 13 VS2     price ~ color Wilcoxon rank sum test with continuity correction
#> 14 I1      price ~ color Wilcoxon rank sum test with continuity correction
#> 15 VVS1    price ~ color Wilcoxon rank sum test with continuity correction
#> 16 IF      price ~ color Wilcoxon rank sum test with continuity correction
#>    statistic      estimate   conf.low     conf.high  p.value significance
#>        <dbl>         <dbl>      <dbl>         <dbl>    <dbl> <chr>       
#>  1   380600     -0.200        -0.300     -0.0000364 1.54e- 2 *           
#>  2   918892      0.0000393    -0.1000     0.100     6.77e- 1 ns          
#>  3   276743     -0.400        -0.500     -0.300     7.03e-12 ***         
#>  4    52618     -0.400        -0.600     -0.200     4.18e- 4 ***         
#>  5   816784.    -0.200        -0.300     -0.1000    8.86e- 5 ***         
#>  6     1570.    -2.00         -3.00      -1.000     1.21e- 4 ***         
#>  7    19455     -0.300        -0.500     -0.1000    5.07e- 3 **          
#>  8     3286     -0.300        -0.700     -0.0000251 4.79e- 2 *           
#>  9   269016  -1844.        -2182.     -1505.        8.82e-31 ***         
#> 10   608724. -1510.        -1789.     -1312.        8.20e-43 ***         
#> 11   277389  -1095.        -1414.      -745.        1.11e-11 ***         
#> 12    42795  -1764.        -2920.     -1205.        2.23e-10 ***         
#> 13   560930  -1888.        -2228.     -1539.        1.08e-54 ***         
#> 14     2017  -1150.        -1832.       -79.0       3.68e- 2 *           
#> 15    19086   -362.         -859.      -107.        2.57e- 3 **          
#> 16     5085    296.          125.       449.        4.94e- 3 **

We can further focus just on two levels of clarity to further elucidate another aspect of entering the arguments-

# for reproducibility
set.seed(123)

# subset the dataframe even further to just select two levels of clarity
diamonds_short2 <-
  dplyr::filter(.data = diamonds_short, clarity == "SI2" |
    clarity == "SI1")

# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
  data = diamonds_short2,
  dep.vars = c(carat, price), # two dependent variables
  indep.vars = c(color, clarity), # two independent variables
  grouping.vars = cut, # one grouping variable
  paired = FALSE
)
#> # A tibble: 10 x 9
#>    cut       formula        
#>    <ord>     <chr>          
#>  1 Ideal     carat ~ color  
#>  2 Premium   carat ~ color  
#>  3 Good      carat ~ color  
#>  4 Very Good carat ~ color  
#>  5 Fair      carat ~ color  
#>  6 Ideal     price ~ clarity
#>  7 Premium   price ~ clarity
#>  8 Good      price ~ clarity
#>  9 Very Good price ~ clarity
#> 10 Fair      price ~ clarity
#>    method                                            statistic estimate
#>    <chr>                                                 <dbl>    <dbl>
#>  1 Wilcoxon rank sum test with continuity correction   103150.   -0.440
#>  2 Wilcoxon rank sum test with continuity correction    91748    -0.540
#>  3 Wilcoxon rank sum test with continuity correction    20752    -0.400
#>  4 Wilcoxon rank sum test with continuity correction    87590    -0.390
#>  5 Wilcoxon rank sum test with continuity correction     2356.   -0.230
#>  6 Wilcoxon rank sum test with continuity correction   344274.  774.   
#>  7 Wilcoxon rank sum test with continuity correction   329276.  876.   
#>  8 Wilcoxon rank sum test with continuity correction    64082.  516.   
#>  9 Wilcoxon rank sum test with continuity correction   272064   752.   
#> 10 Wilcoxon rank sum test with continuity correction     5113   170.   
#>    conf.low conf.high  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#>  1   -0.500    -0.380 1.28e-51 ***         
#>  2   -0.600    -0.490 1.82e-59 ***         
#>  3   -0.500    -0.310 4.49e-18 ***         
#>  4   -0.460    -0.320 7.06e-37 ***         
#>  5   -0.370    -0.110 1.21e- 5 ***         
#>  6  536.     1075.    3.00e- 9 ***         
#>  7  538.     1278.    3.52e- 9 ***         
#>  8  141.      921.    3.05e- 3 **          
#>  9  486.     1098.    2.76e- 8 ***         
#> 10 -411.      802.    5.68e- 1 ns

In these examples, two things are worth noting that generalize to all functions in this package and stem from how tidy evaluation (https://adv-r.hadley.nz/evaluation.html) works:

  • If just one independent variable is provided for multiple dependent variables, it will be used as a common variable.
  • If you want to use a selection of variables, you need not use c().

Extending with purrr

groupedstats functions can be further extended with purrr package. For example, let’s say we want to run the same linear regression across multiple grouping variables but want to use different formulas-

# for reproducibility
set.seed(123)
library(groupedstats)

results_df <- 
  purrr::pmap_dfr(
  .l = list(
    data = list(groupedstats::movies_long),
    grouping.vars = alist(c(mpaa, genre)), # note it's `alist` and not `list`
    formula = list(
      rating ~ budget, # model 1
      rating ~ log(budget), # model 2
      log(rating) ~ budget, # model 3
      log(rating) ~ log(budget) # model 4
    ),
    output = list("glance") # return model diagnostics
  ),
  .f = groupedstats::grouped_lm, # regression model
  .id = "model"
) %>% # for each combination of mpaa rating and movie genre
  dplyr::group_by(.data = ., mpaa, genre) %>% # arrange by best to worst fits
  dplyr::arrange(.data = ., dplyr::desc(adj.r.squared))

head(results_df)
#> # A tibble: 6 x 15
#> # Groups:   mpaa, genre [3]
#>   model mpaa  genre       r.squared adj.r.squared  sigma statistic
#>   <chr> <fct> <fct>           <dbl>         <dbl>  <dbl>     <dbl>
#> 1 2     PG-13 Animation       0.474         0.369 0.824       4.51
#> 2 4     PG-13 Animation       0.447         0.337 0.138       4.05
#> 3 3     PG    Documentary     0.468         0.202 0.0532      1.76
#> 4 1     PG    Documentary     0.449         0.174 0.386       1.63
#> 5 4     R     Action          0.142         0.138 0.254      34.6 
#> 6 2     R     Action          0.129         0.125 1.31       30.9 
#>        p.value    df   logLik    AIC    BIC  deviance df.residual  nobs
#>          <dbl> <dbl>    <dbl>  <dbl>  <dbl>     <dbl>       <int> <int>
#> 1 0.0870           1   -7.40   20.8   20.6    3.39              5     7
#> 2 0.100            1    5.09   -4.18  -4.35   0.0957            5     7
#> 3 0.316            1    7.45   -8.90 -10.7    0.00565           2     4
#> 4 0.330            1   -0.479   6.96   5.12   0.298             2     4
#> 5 0.0000000162     1   -9.39   24.8   34.8   13.5             209   211
#> 6 0.0000000825     1 -356.    718.   728.   361.              209   211

Current code coverage

As the code stands right now, here is the code coverage for all primary functions involved: https://codecov.io/gh/IndrajeetPatil/groupedstats/tree/master/R

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I personally prefer using the GitHub issues system over trying to reach out to me in other ways (personal e-mail, Twitter, etc.). Pull Requests for contributions are encouraged.

Here are some simple ways in which you can contribute (in the increasing order of commitment):

  • Read and correct any inconsistencies in the documentation

  • Raise issues about bugs or wanted features

  • Review code

  • Add new functionality (in the form of new plotting functions or helpers for preparing subtitles)

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/groupedstats/issues

Copy Link

Version

Down Chevron

Install

install.packages('groupedstats')

Monthly Downloads

235

Version

0.0.9

License

GPL-3 | file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

August 28th, 2019

Functions in groupedstats (0.0.9)