tidyinftheo
Overview
There already exists a great package for information theory measures (Cover and Thomas 2001), called "infotheo" (Meyer 2014). tidyinftheo wraps around a few of the functions in the "infotheo" package. 'Tidy-style' data manipulation in R. Some key differences is that this package:
- just calculates Shannon Entropy, Conditional Shannon Entropy, Mutual Information, and Normalized Mutual Information.
- just calculates the "emperical" versions of these measures, as opposed to estimates.
- prefers "bits" (base-2 logs) vs "nats" (natural logs).
- includes a function for aggregating the pairwise comparison of mutual information across more than two variables, yielding a triangular matrix analogous to a correlation matrix for continuous variables.
- is fairly flexible about the type of the input. Factors, integers, strings, should all work. Doubles won't work.
Functions
shannon_entropy(.data, ..., na.rm=FALSE)
shannon_cond_entropy(.data, ..., na.rm=FALSE)
mutual_info(.data, ..., normalized=FALSE, na.rm=FALSE)
mutual_info_matrix(.data, ..., normalized=FALSE, na.rm=FALSE)
mutual_info_heatmap(mi_matrix, title=NULL, font_sizes=c(12,12))
Installation
You can install tidyinftheo from github with:
devtools::install_github("pohlio/tidyinftheo")
then load:
library(tidyinftheo)
Examples
Calculate (in bits) the Shannon Entropy of the eye color variable in the starwars
dataset:
starwars %>% shannon_entropy(eye_color)
#> [1] 3.117176
With the classic mtcars
dataset, choose some columns to make a matrix of mutual information pairwise comparisons. In particular, the cyl, vs, am, gear, and carb columns are all whole numbers indicating they belong to a category. The other columns are continuous and are better suited to correlation comparisons, unless they're discretized. Here are the first few rows of mtcars:
mtcars %>% select(cyl, vs, am, gear, carb) %>% head()
cyl | vs | am | gear | carb | |
---|---|---|---|---|---|
Mazda RX4 | 6 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 6 | 0 | 1 | 4 | 4 |
Datsun 710 | 4 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 6 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 8 | 0 | 0 | 3 | 2 |
Valiant | 6 | 1 | 0 | 3 | 1 |
And here is our comparison table. There should be 5-choose-2 = 10 different combinations. NMI stands for Normalized Mutual Information, so the mutual information, normally given in bits, is scaled between 0 and 1:
mi_matr <- as_tibble(mtcars) %>%
mutate_if(is_double, as.character) %>%
mutual_info_matrix(cyl, vs, am, gear, carb, normalized=TRUE)
mi_matr
V1 | V2 | MI |
---|---|---|
cyl | vs | 0.4937932 |
cyl | am | 0.1672528 |
cyl | gear | 0.3504372 |
cyl | carb | 0.3983338 |
vs | am | 0.0208314 |
vs | gear | 0.2397666 |
vs | carb | 0.2861119 |
am | gear | 0.5173527 |
am | carb | 0.1149038 |
gear | carb | 0.1905054 |
The matrix is already in a convenient format to plot:
p <- mutual_info_heatmap(mi_matr)
print(p)
NOTE: The above SVG may or may not render 100% correctly. Sometimes the legend lacks the color swatch. This may be a problem with ggplot2
or web browsers.
References
Cover, Thomas M., and Joy A. Thomas. 2001. Elements of Information Theory. 2nd ed. 10th Ser. New York, NY: John Wiley & Sons, Inc.
Meyer, Patrick E. 2014. Infotheo: Information-Theoretic Measures. https://CRAN.R-project.org/package=infotheo.