Learn R Programming

⚠️There's a newer version (0.3.2) of this package.Take me there.

easyalluvial

Alluvial plots are similar to sankey diagrams and visualise categorical data over multiple dimensions as flows. Rosval et. al. 2010 Their graphical grammar however is a bit more complex then that of a regular x/y plots. The ggalluvial package made a great job of translating that grammar into ggplot2 syntax and gives you many option to tweak the appearance of an alluvial plot, however there still remains a multi-layered complexity that makes it difficult to use ‘ggalluvial’ for explorative data analysis. ‘easyalluvial’ provides a simple interface to this package that allows you to produce a decent alluvial plot from any dataframe in either long or wide format from a single line of code while also handling continuous data. It is meant to allow a quick visualisation of entire dataframes with a focus on different colouring options that can make alluvial plots a great tool for data exploration.

Features

  • plot alluvial graph with a single line of code of a given dataframe
  • support for wide and long data format (wiki, wide vs. long/narrow data)
  • automatically transforms numerical to categorical data
  • helper functions for variable selection
  • convenient parameters for coloring and ordering
  • marginal histograms
  • model agnostic partial dependence and model response alluvial plots with 4 dimensions

Installation

CRAN

install.packages('easyalluvial')

Development Version


# install.packages("devtools")
devtools::install_github("erblast/easyalluvial")

Tutorials

In order to learn about all the features an how they can be useful check out the following tutorials:

Examples

suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(easyalluvial) )

Alluvial from data in wide format

Sample Data


knitr::kable( head(mtcars2) )
mpgcyldisphpdratwtqsecvsamgearcarbids
21.061601103.902.62016.46Vmanual44Mazda RX4
21.061601103.902.87517.02Vmanual44Mazda RX4 Wag
22.84108933.852.32018.61Smanual41Datsun 710
21.462581103.083.21519.44Sautomatic31Hornet 4 Drive
18.783601753.153.44017.02Vautomatic32Hornet Sportabout
18.162251052.763.46020.22Sautomatic31Valiant

Plot

Continuous Variables will be automatically binned as follows.

  • High, High (HH)
  • Medium, High (MH)
  • Medium (M)
  • Medium, Low (ML)
  • Low, Low (LL)

alluvial_wide( data = mtcars2
                , max_variables = 5
                , fill_by = 'first_variable' )

Alluvial from data in long format

Sample Data

knitr::kable( head(quarterly_flights) )
tailnumcarrierorigindestqumean_arr_delay
N0EGMQ LGA BNA MQMQLGABNAQ1on_time
N0EGMQ LGA BNA MQMQLGABNAQ2on_time
N0EGMQ LGA BNA MQMQLGABNAQ3on_time
N0EGMQ LGA BNA MQMQLGABNAQ4on_time
N11150 EWR MCI EVEVEWRMCIQ1late
N11150 EWR MCI EVEVEWRMCIQ2late

Plot


alluvial_long( quarterly_flights
               , key = qu
               , value = mean_arr_delay
               , id = tailnum
               , fill = carrier )

Marginal Histograms

alluvial_wide( data = mtcars2
                , max_variables = 5
                , fill_by = 'first_variable' ) %>%
  add_marginal_histograms(mtcars2)

Partial Dependence Alluvial Plots

Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.

Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.


df = select(mtcars2, -ids)
m = randomForest::randomForest( disp ~ ., df)
imp = m$importance

dspace = get_data_space(df, imp, degree = 4)

pred = get_pdp_predictions(df, imp
                           , m
                           , degree = 4
                           , bins = 5)


p = alluvial_model_response(pred, dspace, imp
                            , degree = 4, method = 'pdp'
                            , stratum_label_size = 2.75)

p_grid = add_marginal_histograms(p, df, plot = F) %>%
  add_imp_plot(p, df)

Copy Link

Version

Install

install.packages('easyalluvial')

Monthly Downloads

410

Version

0.2.2

License

CC0

Issues

Pull Requests

Stars

Forks

Maintainer

Bjoern Koneswarakantha

Last Published

December 9th, 2019

Functions in easyalluvial (0.2.2)

plot_all_hists

plot marginal histograms of alluvial plot
mtcars2

mtcars dataset with cyl, vs, am ,gear, carb as factor variables and car model names as id
titanic

titanic data set'
tidy_imp

tidy up dataframe containing model feature importance
alluvial_model_response

create model response plot
palette_filter

color filters for any vector of hex color values
plot_imp

plot feature importance
use_e1071

calls e1071::skewness
plot_condensation

Plot dataframe condensation potential
plot_hist

plot histogram of alluvial plot variable
palette_plot_rgp

plot rgb values of palette
palette_qualitative

compose palette from qualitative RColorBrewer palettes
palette_increase_length

increases length of palette by repeating colours
palette_plot_intensity

plot colour intensity of palette
quarterly_sunspots

Quarterly mean relative sunspots number from 1749-1983
quarterly_flights

Quarterly mean arrival delay times for a set of 402 flights
manip_bin_numerics

bin numerical columns
get_data_space

calculate data space
get_pdp_predictions

get predictions compatibel with the partial dependence plotting method
alluvial_long

alluvial plot of data in long format
add_imp_plot

add bar plot of important features to model response alluvial plot
add_marginal_histograms

add marginal histograms to alluvial plot
alluvial_model_response_caret

create model response plot for caret models
manip_factor_2_numeric

converts factor to numeric preserving numeric levels and order in character levels.
alluvial_wide

alluvial plot of data in wide format