Learn R Programming

tabxplor

If R makes complex things simple, it can sometimes make simple things difficult. This is why tabxplor tries to make it easy to deal with multiple cross-tables: to create and manipulate them, but also to read them, using color helpers to highlight important informations (differences from totals, comparisons between lines or columns, contributions to variance, margins of error, etc.). It would love to enhance your data exploration experience with simple yet powerful tools. All functions are propelled by tidyverse, pipe-friendly, and render tibble data frames which can be easily manipulated with dplyr. In the same time, time-taking operations are done with data.table to go faster with big dataframes. Tables can be exported to Excel and in html with formats and colors.

Installation

You can install tabxplor from CRAN with:

install.packages("tabxplor")

Or from github with :

# install.packages(devtools)
devtools::install_github("BriceNocenti/tabxplor")

Base usage: cross-tables with color helpers

The main functions are made to be user-friendly and time-saving in data analysis workflows.

tab makes a simple cross-table:

library(tabxplor)
tab(forcats::gss_cat, marital, race)
#> # A tabxplor tab: 7 × 5
#>   marital       Other Black  White  Total
#>   <fct>           <n>   <n>    <n>    <n>
#> 1 No answer         2     2     13     17
#> 2 Never married   633 1 305  3 478  5 416
#> 3 Separated       110   196    437    743
#> 4 Divorced        212   495  2 676  3 383
#> 5 Widowed          70   262  1 475  1 807
#> 6 Married         932   869  8 316 10 117
#> 7 Total         1 959 3 129 16 395 21 483

When one of the row or column variables is numeric, tab calculates means by category of the other variable.

tab comes with options to weight the table, print percentages, manage totals, digits and missing values, add legends, gather rare categories in a “Others” level.

tab(forcats::gss_cat, marital, race, pct = "row", na = "drop", 
rare_to_other = TRUE, n_min = 1000, other_level = "Custom_other_level_name")

When a third variable is provided, tab makes a table with as many subtables as it has levels. With several tab_vars, it makes a subtable for each combination of their levels. The result is grouped: in dplyr, operations like sum() or all() are done within each subtable, and not for the whole dataframe.

Colors may be added to highlight over-represented and under-represented cells, and therefore help the user read the table. By default, with color = "diff", colors are based on the differences between a cell and it’s related total (which only works with means and row or col pct). When a percentage is superior to the average percentage of the line or column, it appears with shades of green (or blue). When it’s inferior, it appears with shades of red/orange. A color legend is added below the table. In RStudio colors are adapted to the theme, light or dark.

data <- forcats::gss_cat %>% 
dplyr::filter(year %in% c(2000, 2006, 2012), !marital %in% c("No answer", "Widowed"))
gss  <- "Source: General social survey 2000-2014"
gss2 <- "Source: General social survey 2000, 2006 and 2012"
tab(data, race, marital, year, subtext = gss2, pct = "row", color = "diff")

The sup_cols argument adds supplementary column variables to the table. With numeric variables, it calculates the mean for each category or the row variable. With text variables, only the first level is kept (you can choose which one to use by placing it first with forcats::fct_relevel). Use tab_many to keep all levels.

tab(dplyr::storms, category, status, sup_cols = c("pressure", "wind"))
#> # A tabxplor tab: 8 × 7
#>   category hurricane `tropical depressi…` `tropical storm`  Total pressure  wind
#>   <fct>          <n>                  <n>              <n>    <n>   <mean> <mea>
#> 1 -1               0                2 898                0  2 898    1 008    27
#> 2 0                0                    0            5 347  5 347      999    46
#> 3 1            1 933                    0                1  1 934      981    71
#> 4 2              749                    0                0    749      967    89
#> 5 3              434                    0                0    434      954   104
#> 6 4              411                    0                0    411      939   122
#> 7 5               86                    0                0     86      917   146
#> 8 Total        3 613                2 898            5 348 11 859      992    54

References and comparison levels for colors

By default, to calculate colors, each cell is compared to the subtable’s related total.

When a third variable or more are provided, it’s possible to compare with the general total line instead, by setting comp = "all". Here, only the last total row is highlighted (TOTAL ENSEMBLE appears in white but other total rows in grey).

tab(data, race, marital, year, subtext = gss2, pct = "row", color = "diff", comp = "all")

With diff = "first", each row (or column) is compared to the first row (or column), which is particularly helpful to highlight historical evolutions. The first rows then appears in white (while rows totals are themselves colored like normal lines).

data <- data %>% dplyr::mutate(year = as.factor(year))
tab(data, year, marital, race, pct = "row", color = "diff", diff = "first", tot = "col",
totaltab = "table")

When diff is a number, the nth row (or column) is used for comparison.

tab(data, year, marital, race, pct = "row", color = "diff", diff = 3)

Finally, when diff is a string, it it used as a regular expression, to match with the names of the rows (or columns).

tab(data, year, marital, race, pct = "col", tot = "row", color = "diff", diff = "Married")

Confidence intervals

It it possible to print confidence intervals for each cell:

tab(forcats::gss_cat, race, marital, pct = "row", ci = "cell")
#> # A tabxplor tab: 4 × 8
#>   race   `No answer` `Never married` Separated Divorced Widowed Married  Total
#>   <fct>       <row%>          <row%>    <row%>   <row%>  <row%>  <row%> <row%>
#> 1 Other       0%±0.3         32%±2.1    6%±1.1  11%±1.5  4%±0.9 48%±2.2   100%
#> 2 Black       0%±0.2         42%±1.7    6%±0.9  16%±1.3  8%±1.0 28%±1.6   100%
#> 3 White       0%±0.1         21%±0.6    3%±0.3  16%±0.6  9%±0.4 51%±0.8   100%
#> 4 Total       0%             25%        3%      16%      8%     47%       100%

It is also possible to use confidence intervals to enhance colors helpers. With color = "diff_ci", the cells are only colored if the confidence interval of the difference between them and their reference cell (in total or first row/col) is superior to the difference itself. Otherwise, it means the cell is not significantly different from it’s reference in the total (or first) row: it turns grey, and the reader is not anymore tempted to over-interpret the difference.

tab(forcats::gss_cat, race, marital, pct = "row", color = "diff_ci")

Finally, another calculation appears helpful: the difference between the cell and the total, minus the confidence interval of this difference (or in other word, what remains of the difference after having subtracted the confidence interval). ci = "after_ci" highligths all the cells whose value is significantly different from the relative total (or first cell). This is particularly useful when working on small samples : we can see at a glance which numbers we have right to read and interpret.

tab(forcats::gss_cat, race, marital, subtext = gss, pct = "row", color = "after_ci")

Chi2 stats and contributions of cells to variance

chi2 = TRUE add summary statistics made in the chi2 metric: degrees of freedom (df), unweighted count, pvalue and (sub)table’s variance. Chi2 pvalue is colored in green when inferior to 5%, and in red when superior or equal to 5%, meaning that the table is not significantly different from the independent hypothesis (the two variables may be independent).

tab(forcats::gss_cat, race, marital, chi2 = TRUE)
#> chi2 stats     marital
#> df                  12
#> variance        0.0464
#> pvalue              0%
#> count           21 483
#> 
#> # A tabxplor tab: 4 × 8
#>   race   `No answer` `Never married` Separated Divorced Widowed Married  Total
#>   <fct>          <n>             <n>       <n>      <n>     <n>     <n>    <n>
#> 1 Other            2             633       110      212      70     932  1 959
#> 2 Black            2           1 305       196      495     262     869  3 129
#> 3 White           13           3 478       437    2 676   1 475   8 316 16 395
#> 4 Total           17           5 416       743    3 383   1 807  10 117 21 483

Chi2 stats can also be used to color cells based on their contributions to the variance of the (sub)table, with color = "contrib". By default, only the cells whose contribution is superior to the mean contribution are colored. It highlights the cells which would stand out in a correspondence analysis (the two related categories would be located at the edges of the first axes ; here, being black is associated with never married and being separated).

tab(forcats::gss_cat, race, marital, color = "contrib")

Combine tabxplor and dplyr

The result of tab is a tibble::tibble data frame with class tab. It gets it’s own printing methods but, in the same time, can be transformed using most dplyr verbs, like a normal tibble.

library(dplyr)
tab(storms, category, status, sup_cols = c("pressure", "wind")) %>%
filter(category != "-1") %>%
dplyr::select(-`tropical depression`)
arrange(is_totrow(.), desc(category)) # use is_totrow to keep total rows order

With dplyr::arrange, don’t forget to keep the order of tab variables and total rows:

tab(data, race, marital, year, pct = "row") %>%
arrange(year, is_totrow(.), desc(Married))

Draw more complex tables with tab_many

tab is a wrapper around the more powerful function tab_many, which can be used to customize your tables.

It’s possible, for example, to make a summary table of as many columns variables as you want (showing all levels, or showing only one specific level like here):

first_lvs <- c("Married", "$25000 or more", "Strong republican", "Protestant")
data <- forcats::gss_cat %>% mutate(across(
where(is.factor),
~ forcats::fct_relevel(., first_lvs[first_lvs %in% levels(.)])
))
tab_many(data, race, c(marital, rincome, partyid, relig, age, tvhours),
levels = "first", pct = "row", chi2 = TRUE, color = "auto")

Using tab or tab_many with purrr::map and tibble::tribble, you can program several tables with different parameters all at once, in a readable way:

tabs <-
purrr::pmap(
tibble::tribble(
~row_var, ~col_vars       , ~pct , ~filter              , ~subtext               ,
"race"  , "marital"       , "no" , NULL                 , "Source: GSS 2000-2014",
"race"  , "marital"       , "row", NULL                 , "Source: GSS 2000-2014",
"race"  , "marital"       , "col", NULL                 , "Source: GSS 2000-2014",
"relig" , c("race", "age"), "row", "year %in% 2000:2010", "Source: GSS 2000-2010",
"relig" , c("race", "age"), "row", "year %in% 2010:2014", "Source: GSS 2010-2014",
NA_character_, "race"     , "no" , NULL                 , "Source: GSS 2000-2014",
),
.f = tab_many,
data = forcats::gss_cat, color = "auto", chi2 = TRUE)

Export to html or Excel

To export a table to html with colors, tabxplor uses knitr::kable and kableExtra. In this format differences from totals, confidence intervals, contribution to variance, and unweighted counts, are available in a tooltip at cells hover.

tabs <- tab(forcats::gss_cat, race, marital, subtext = "Source: GSS 2000-2014", 
pct = "row", color = "diff")
tabs %>% tab_kable()

To print an html table by default (for example, in RStudio viewer), use tabxplor options:

options(tabxplor.print = "kable") # default to options(tabxplor.print = "console")

tab_xl exports any table or list of tables to Excel, with all colors, chi2 stats and formatting. On Excel, it is still possible to do calculations on raw numbers (display is rounded but, below, decimals are kept).

tabs %>% tab_xl(replace = TRUE, sheets = "unique")

Programming with tabxplor

When not doing data analysis but writing functions, you can use the sub-functions of tab_many step by step to attain more flexibility or speed. That way, it’s possible to write new functions to customize your tables even more.

data <- dplyr::starwars %>%
tab_prepare(sex, hair_color, gender, rare_to_other = TRUE,
n_min = 5, na_drop_all = sex)

data %>%
tab_plain(sex, hair_color, gender, tot = c("row", "col"), pct = "row", comp = "all") %>%
tab_ci("diff", color = "after_ci")  %>%
tab_chi2(calc = "p")

The whole architecture of tabxplor is powered by a special vector class, named tabxplor_fmt for formatted numbers. As a vctrs::record, it stores behind the scenes all the data necessary to calculate printed results, formats and colors. A set of functions are available to access or transform this data. ?fmt to get more information.

The simple way to recover the underlying numbers as numeric vectors is get_num, which extract the currently displayed field whatever it is :

tabs <- tab(forcats::gss_cat, race, marital, pct = "row")
tabs %>% dplyr::mutate(across(where(is_fmt), get_num))
#> # A tabxplor tab: 4 × 8
#>   race   `No answer` `Never married` Separated Divorced Widowed Married Total
#>   <fct>        <dbl>           <dbl>     <dbl>    <dbl>   <dbl>   <dbl> <dbl>
#> 1 Other     0.00102            0.323    0.0562    0.108  0.0357   0.476     1
#> 2 Black     0.000639           0.417    0.0626    0.158  0.0837   0.278     1
#> 3 White     0.000793           0.212    0.0267    0.163  0.0900   0.507     1
#> 4 Total     0.000791           0.252    0.0346    0.157  0.0841   0.471     1

To render character vectors (without colors), use format:

tabs %>% mutate(across(where(is_fmt), format))

The following fields compose any fmt column (though many can be NA if not calculated) :

  • display : name of the field to display, customisable for each cell (character)

  • n : raw count (integer)

  • wn : weighted count

  • pct : percentages

  • diff : differences from totals or reference cells

  • digits : digits to display, customisable for each cell (integer)

  • ctr : contributions of cells to variance (with color = "contrib")

  • mean : means (for numeric column variables)

  • var : variance (for numeric column variables ; Chi2 variance with pct)

  • ci : confidence intervals

  • in_totrow : TRUE if the cell is part of a total row, FALSE otherwise (logical)

  • in_tottab : TRUE if the cell is part of a total table, FALSE otherwise (logical)

  • in_refrow : TRUE if the cell is part of a reference row, FALSE otherwise (logical)

vctrs::vec_data(tabs$Married)
#>       n display digits wn       pct mean         diff ctr var ci in_totrow
#> 1   932     pct      0 NA 0.4757529   NA  0.004822432  NA  NA NA     FALSE
#> 2   869     pct      0 NA 0.2777245   NA -0.193205991  NA  NA NA     FALSE
#> 3  8316     pct      0 NA 0.5072278   NA  0.036297310  NA  NA NA     FALSE
#> 4 10117     pct      0 NA 0.4709305   NA  0.000000000  NA  NA NA      TRUE
#>   in_tottab in_refrow
#> 1     FALSE     FALSE
#> 2     FALSE     FALSE
#> 3     FALSE     FALSE
#> 4     FALSE     FALSE

To get those underlying fields you can either use vctrs::fields or, more simply, $ :

tabs %>% mutate(across(where(is_fmt), ~ vctrs::field(., "pct") ))

tabs$Married$pct
tabs$Married$n
tabs %>% mutate(across(where(is_fmt), ~ .$n))

To modify a field, you can use vctrs field<-. For example, to change the displayed field :

tab(data, race, marital, year, pct = "row") %>%
mutate(across(where(is_fmt), ~ vctrs::`field<-`(., "display", rep("diff", length(.)))))

Faster to write and easier to read, you can also use dplyr::mutate() on an fmt vector. For example, to create a new column with standards deviations and display it with decimals :

tab_num(data, race, c(age, tvhours), marital, digits = 1L, comp = "all") |>
  dplyr::mutate(dplyr::across( #Mutate over the whole table.
    c(age, tvhours),
    ~ dplyr::mutate(., #Mutate over each fmt vector's underlying data.frame.
                    var     = sqrt(var), 
                    display = "var", 
                    digits  = 2L) |> 
      set_color("no"),
    .names = "{.col}_sd"
  ))

Some helper functions exists for total rows, total tables and reference rows (is_totrow() / as_totrow(), is_tottab() / as_tottab(), is_refrow() / as_refrow()) :

tab(data, race, marital, year, pct = "row") %>%
  dplyr::mutate(across( 
    where(is_fmt),
    ~ dplyr::if_else(is_totrow(.), 
                true  = mutate(., digits = 1L), 
                false = mutate(., digits = 2L))
  ))

Each fmt column have attributes, which you can access or modify with get_ and set_ functions :

  • type / get_type() / set_type() : the type of the fmt vector, among c("n", "mean", "row", "col", "all", "all_tabs") ; it determines which calculations are done within tab_ functions.

  • totcol / is_totcol() / as_totcol() : TRUE if the column is a total column, FALSE otherwise (logical)

  • refcol / is_refcol() / as_refcol() : TRUE if the column is a reference column for comparison, FALSE otherwise (logical)

  • color / get_color() / set_color() : the calculation to make to print colors ; among c("", "no", "diff", "diff_ci", "after_ci", "contrib")

  • col_var / get_col_var() / set_col_var() : the name of the column variable (there can be many in one single table)

  • comp_all / get_comp_all / set_comp_all() : when there are tab_vars, is the reference for comparison the subtable (FALSE), or the total table (TRUE) ?

  • diff_type / get_diff_type() / set_diff_type() : the type of difference calculated, either "no", "tot" for totals, an index, or a regular expression.

  • ci_type / get_ci_type() / set_ci_type() : the type of confidence interval, either "cell" or "diff"

For example, to print the number of observations of the total column :

tab(data, race, marital, year, pct = "row") %>%
  mutate(across(where(is_totcol), ~ mutate(., display = "n") ))

Note that, if tab_vars are provided, the table is grouped and all operations are made within groups. To remove grouping (for example when it gives errors), use dplyr::ungroup().

If you only need the simplest table, with only numeric counts (no fmt), or even a base data.frame (not a tibble) :

tab_plain(data, race, marital, num = TRUE) # counts as a numeric vector
tab_plain(data, race, marital, df = TRUE)  # same, with unique class = "data.frame"

Copy Link

Version

Install

install.packages('tabxplor')

Monthly Downloads

727

Version

1.1.0

License

GPL (>= 3)

Issues

Pull Requests

Stars

Forks

Maintainer

Brice Nocenti

Last Published

June 15th, 2022

Functions in tabxplor (1.1.0)

dplyr_reconstruct.tabxplor_grouped_tab

dplyr_reconstruct method for class tabxplor_grouped_tab
fmt

Create a vector of class formatted numbers
fct_to_na

Replace Factor Levels with NA
fct_rename

Recode Factor Levels using Multiple Patterns
get_col_var.data.frame

Get names of column variable of fmt columns
get_col_var.default

Get names of column variable of fmt columns
fct_replace

Recode Factor Levels using one Pattern
get_color.default

Get color
fct_recode_helper

fct_recode helper to recode multiple variables
get_diff_type.data.frame

Get differences type of fmt columns
get_color.tabxplor_fmt

Get color
get_diff_type.default

Get differences type of fmt columns
fct_levels_from_vector

Copy level of factors between dataframes
get_type.tabxplor_fmt

Get types of fmt columns
is_refrow.data.frame

Test function to detect cells in reference rows
is_refcol.tabxplor_fmt

Test function for reference columns
formats_SAS_to_R

INSEE SAS formats to R : translate code
get_ci_type.data.frame

Get confidence intervals type of fmt columns
get_color.data.frame

Get color
format.tabxplor_fmt

Print method for class tabxplor_fmt
fmt_get_color_code

Get HTML Color Code of a fmt vector
get_col_var.tabxplor_fmt

Get names of column variable of fmt columns
is_refcol.data.frame

Test function for reference columns
is_refcol.default

Test function for reference columns
prepare_fct_recode

Prepare fct_recode
is_tottab.default

Test function to detect cells in total tables
get_type.default

Get types of fmt columns
get_type.data.frame

Get types of fmt columns
is_tottab.data.frame

Test function to detect cells in total tables
is_totrow.default

Test function to detect cells in total rows
mutate.tabxplor_fmt

mutate method to access vctrs::fields of tabxplor_fmt vectors
pmap_if

A generalized map_if
is_totrow.tabxplor_fmt

Test function to detect cells in total rows
get_diff_type.tabxplor_fmt

Get differences type of fmt columns
is_refrow.default

Test function to detect cells in reference rows
rowwise.tabxplor_tab

rowwise method for class tabxplor_tab
get_display

get the "display" field of a fmt vector
new_fmt

fmt a constructor for class fmt.
tab

Single cross-table, with color helpers
summarise.tabxplor_grouped_tab

summarise method for class tabxplor_grouped_tab
is_totcol.tabxplor_fmt

Test function for total columns
get_ci_type.default

Get confidence intervals type of fmt columns
get_ci_type.tabxplor_fmt

Get confidence intervals type of fmt columns
is_totrow.data.frame

Test function to detect cells in total rows
select.tabxplor_grouped_tab

select method for class tabxplor_grouped_tab
tab_prepare

tab_spread

Spread a tab, passing a tab variable to column
tab_plain

Plain single cross-table
tabxplor-vctrs

Internal vctrs methods
tab_pct

tbl_format_body.tabxplor_tab

Table body for class tab
vec_proxy_compare.tabxplor_fmt

Compare with fmt vector
tab_chi2

vec_cast.tabxplor_fmt.tabxplor_fmt

Convert fmt into fmt
vec_math.tabxplor_fmt

Vec_math method for class fmt
vec_proxy_equal.tabxplor_fmt

Test equality with fmt vector
tab_cast

Coercion between two tab
is_refrow.tabxplor_fmt

Test function to detect cells in reference rows
tab_totaltab

vec_cast.tabxplor_fmt.double

Convert double into fmt
tab_tot

is_totcol.default

Test function for total columns
group_by.tabxplor_tab

group_by method for class tabxplor_tab
is_totcol.data.frame

Test function for total columns
%>%

Pipe operator
pillar_shaft.tabxplor_fmt

vec_ptype2.tabxplor_fmt.double

Find common ptype between fmt and double
vec_ptype2.tabxplor_fmt.integer

Find common ptype between fmt and integer
new_tab

A constructor for class tabxplor_tab
pillar_shaft.tab_chi2_fmt

Print Chi2 tables columns
vec_ptype2.tabxplor_fmt.tabxplor_fmt

Find common ptype between fmt and fmt
rename.tabxplor_grouped_tab

rename method for class tabxplor_grouped_tab
map2_if

A 2 arguments map_if
tbl_format_footer.tabxplor_tab

Table footer for class tab
relocate.tabxplor_grouped_tab

relocate method for class tabxplor_grouped_tab
is_tottab.tabxplor_fmt

Test function to detect cells in total tables
tbl_sum.tabxplor_grouped_tab

Table headers for class grouped tab
vec_ptype2.double.tabxplor_fmt

Find common ptype between double and fmt
vec_ptype2.integer.tabxplor_fmt

Find common ptype between integer and fmt
vec_cast.tabxplor_fmt.integer

Convert integer into fmt
rename_with.tabxplor_grouped_tab

rename_with method for class tabxplor_grouped_tab
vec_ptype_abbr.tabxplor_fmt

Abbreviated display name for class fmt in tibbles
tab_xl

Excel output for tabxplor tables, with formatting and colors
vec_cast.double.tabxplor_fmt

Convert fmt into double
tab_num

Means table
rowwise.tabxplor_grouped_tab

rowwise method for class tabxplor_grouped_tab
tabxplor-data.table

Internal data.table methods
tab_many

Many cross-tables as one, with color helpers
print.tabxplor_grouped_tab

Printing method for class tabxplor_grouped_tab
vec_cast.integer.tabxplor_fmt

Convert fmt into integer
print.tabxplor_tab

Printing method for class tabxplor_tab
tbl_sum.tabxplor_tab

Table headers for class tab
tab_kable

Print a tabxplor table in html
vec_cast.character.tabxplor_fmt

Convert fmt into character
ungroup.tabxplor_grouped_tab

ungroup method for class tabxplor_grouped_tab
tab_ci

vec_arith.tabxplor_fmt

Vec_arith method for fmt
vec_ptype_full.tabxplor_fmt

Printed type for class fmt
bind_datas_for_tab

Bind dataframes for tab / tab_many
dplyr_row_slice.tabxplor_grouped_tab

dplyr_row_slice method for class tabxplor_grouped_tab
dplyr_col_modify.tabxplor_grouped_tab

dplyr_col_modify method for class tabxplor_grouped_tab
$.tabxplor_fmt

$ method for class tabxplor_fmt
fct_case_when_recode

Recode Factor Levels with Multiple Patterns Detection
fct_detect_replace

Recode Factor Levels with Detected Pattern inside
cleannames_condition

A regex pattern to clean the names of factors.
fct_clean

Clean factor levels.
compare_levels

Compare levels of factors in many df