Learn R Programming

tidyinftheo

Overview

There already exists a great package for information theory measures (Cover and Thomas 2001), called "infotheo" (Meyer 2014). tidyinftheo wraps around a few of the functions in the "infotheo" package. 'Tidy-style' data manipulation in R. Some key differences is that this package:

  • just calculates Shannon Entropy, Conditional Shannon Entropy, Mutual Information, and Normalized Mutual Information.
  • just calculates the "emperical" versions of these measures, as opposed to estimates.
  • prefers "bits" (base-2 logs) vs "nats" (natural logs).
  • includes a function for aggregating the pairwise comparison of mutual information across more than two variables, yielding a triangular matrix analogous to a correlation matrix for continuous variables.
  • is fairly flexible about the type of the input. Factors, integers, strings, should all work. Doubles won't work.

Functions

  • shannon_entropy(.data, ..., na.rm=FALSE)
  • shannon_cond_entropy(.data, ..., na.rm=FALSE)
  • mutual_info(.data, ..., normalized=FALSE, na.rm=FALSE)
  • mutual_info_matrix(.data, ..., normalized=FALSE, na.rm=FALSE)
  • mutual_info_heatmap(mi_matrix, title=NULL, font_sizes=c(12,12))

Installation

You can install tidyinftheo from github with:

devtools::install_github("pohlio/tidyinftheo")

then load:

library(tidyinftheo)

Examples

Calculate (in bits) the Shannon Entropy of the eye color variable in the starwars dataset:

starwars %>% shannon_entropy(eye_color)
#> [1] 3.117176

With the classic mtcars dataset, choose some columns to make a matrix of mutual information pairwise comparisons. In particular, the cyl, vs, am, gear, and carb columns are all whole numbers indicating they belong to a category. The other columns are continuous and are better suited to correlation comparisons, unless they're discretized. Here are the first few rows of mtcars:

mtcars %>% select(cyl, vs, am, gear, carb) %>% head()
cylvsamgearcarb
Mazda RX460144
Mazda RX4 Wag60144
Datsun 71041141
Hornet 4 Drive61031
Hornet Sportabout80032
Valiant61031

And here is our comparison table. There should be 5-choose-2 = 10 different combinations. NMI stands for Normalized Mutual Information, so the mutual information, normally given in bits, is scaled between 0 and 1:

mi_matr <- as_tibble(mtcars) %>% 
    mutate_if(is_double, as.character) %>%
    mutual_info_matrix(cyl, vs, am, gear, carb, normalized=TRUE)
mi_matr
V1V2MI
cylvs0.4937932
cylam0.1672528
cylgear0.3504372
cylcarb0.3983338
vsam0.0208314
vsgear0.2397666
vscarb0.2861119
amgear0.5173527
amcarb0.1149038
gearcarb0.1905054

The matrix is already in a convenient format to plot:

p <- mutual_info_heatmap(mi_matr)
print(p)

NOTE: The above SVG may or may not render 100% correctly. Sometimes the legend lacks the color swatch. This may be a problem with ggplot2 or web browsers.

References

Cover, Thomas M., and Joy A. Thomas. 2001. Elements of Information Theory. 2nd ed. 10th Ser. New York, NY: John Wiley & Sons, Inc.

Meyer, Patrick E. 2014. Infotheo: Information-Theoretic Measures. https://CRAN.R-project.org/package=infotheo.

Copy Link

Version

Install

install.packages('tidyinftheo')

Monthly Downloads

6

Version

0.2.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Andy Pohl

Last Published

November 30th, 2017

Functions in tidyinftheo (0.2.1)

mutual_info_heatmap

Plot heatmap of mutual infos
shannon_entropy

Shannon Entropy H(X)
tidyinftheo-package

tidyinftheo: tidy-style information theoretic routines
mutual_info_matrix

Mutual information Matrix
shannon_cond_entropy

Conditional Shannon Entropy H(X|Y) i.e. "H(X given Y)"
mutual_info

Mutual information MI(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)