Learn R Programming

⚠️There's a newer version (0.8.0) of this package.Take me there.

philentropy

Similarity and Distance Quantification between Probability Functions

Describe and understand the world through data.

Data collection and data comparison are the foundations of scientific research. Mathematics provides the abstract framework to describe patterns we observe in nature and Statistics provides the framework to quantify the uncertainty of these patterns. In statistics, natural patterns are described in form of probability distributions which either follow a fixed pattern (parametric distributions) or more dynamic patterns (non-parametric distributions).

The philentropy package implements fundamental distance and similarity measures to quantify distances between probability density functions as well as traditional information theory measures. In this regard, it aims to provide a framework for comparing natural patterns in a statistical notation.

This project is born out of my passion for statistics and I hope that it will be useful to the people who share it with me.

Installation

# install philentropy version 0.6.0 from CRAN
install.packages("philentropy")

Citation

I am developing philentropy in my spare time and would be very grateful if you would consider citing the following paper in case philentropy was useful for your own research. I plan on maintaining and extending the philentropy functionality and usability in the next years and require citations to back up these efforts. Many thanks in advance :)

HG Drost, (2018). Philentropy: Information Theory and Distance Quantification with R. Journal of Open Source Software, 3(26), 765. https://doi.org/10.21105/joss.00765

Tutorials

Examples

library(philentropy)
# retrieve available distance metrics
philentropy::getDistMethods()
 [1] "euclidean"         "manhattan"         "minkowski"        
 [4] "chebyshev"         "sorensen"          "gower"            
 [7] "soergel"           "kulczynski_d"      "canberra"         
[10] "lorentzian"        "intersection"      "non-intersection" 
[13] "wavehedges"        "czekanowski"       "motyka"           
[16] "kulczynski_s"      "tanimoto"          "ruzicka"          
[19] "inner_product"     "harmonic_mean"     "cosine"           
[22] "hassebrook"        "jaccard"           "dice"             
[25] "fidelity"          "bhattacharyya"     "hellinger"        
[28] "matusita"          "squared_chord"     "squared_euclidean"
[31] "pearson"           "neyman"            "squared_chi"      
[34] "prob_symm"         "divergence"        "clark"            
[37] "additive_symm"     "kullback-leibler"  "jeffreys"         
[40] "k_divergence"      "topsoe"            "jensen-shannon"   
[43] "jensen_difference" "taneja"            "kumar-johnson"    
[46] "avg"
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)

# combine P and Q as matrix object
x <- rbind(P,Q)

# compute the jensen-shannon distance between
# probability density functions P and Q
philentropy::distance(x, method = "jensen-shannon")
jensen-shannon using unit 'log'.
jensen-shannon 
    0.02628933

Alternatively, users can also retrieve values from all available distance/similarity metrics using philentropy::dist.diversity():

philentropy::dist.diversity(x, p = 2, unit = "log2")
        euclidean         manhattan 
       0.12807130        0.35250464 
        minkowski         chebyshev 
       0.12807130        0.06345083 
         sorensen             gower 
       0.17625232        0.03525046 
          soergel      kulczynski_d 
       0.29968454        0.42792793 
         canberra        lorentzian 
       2.09927095        0.49712136 
     intersection  non-intersection 
       0.82374768        0.17625232 
       wavehedges       czekanowski 
       3.16657887        0.17625232 
           motyka      kulczynski_s 
       0.58812616        2.33684211 
         tanimoto           ruzicka 
       0.29968454        0.70031546 
    inner_product     harmonic_mean 
       0.10612245        0.94948528 
           cosine        hassebrook 
       0.93427641        0.86613103 
          jaccard              dice 
       0.13386897        0.07173611 
         fidelity     bhattacharyya 
       0.97312397        0.03930448 
        hellinger          matusita 
       0.32787819        0.23184489 
    squared_chord squared_euclidean 
       0.05375205        0.01640226 
          pearson            neyman 
       0.16814418        0.36742465 
      squared_chi         prob_symm 
       0.10102943        0.20205886 
       divergence             clark 
       1.49843905        0.86557468 
    additive_symm  kullback-leibler 
       0.53556883        0.13926288 
         jeffreys      k_divergence 
       0.31761069        0.04216273 
           topsoe    jensen-shannon 
       0.07585498        0.03792749 
jensen_difference            taneja 
       0.03792749        0.04147518 
    kumar-johnson               avg 
       0.62779644        0.20797774

Install Developer Version

# install.packages("devtools")
# install the current version of philentropy on your system
library(devtools)
install_github("HajkD/philentropy", build_vignettes = TRUE, dependencies = TRUE)

NEWS

The current status of the package as well as a detailed history of the functionality of each version of philentropy can be found in the NEWS section.

Important Functions

Distance Measures

  • distance() : Implements 46 fundamental probability distance (or similarity) measures
  • getDistMethods() : Get available method names for 'distance'
  • dist.diversity() : Distance Diversity between Probability Density Functions
  • estimate.probability() : Estimate Probability Vectors From Count Vectors

Information Theory

  • H() : Shannon's Entropy H(X)
  • JE() : Joint-Entropy H(X,Y)
  • CE() : Conditional-Entropy H(X | Y)
  • MI() : Shannon's Mutual Information I(X,Y)
  • KL() : Kullback–Leibler Divergence
  • JSD() : Jensen-Shannon Divergence
  • gJSD() : Generalized Jensen-Shannon Divergence

Studies that successfully applied the philentropy package

  • An atlas of gene regulatory elements in adult mouse cerebrum YE Li, S Preissl, X Hou, Z Zhang, K Zhang et al.- Nature, 2021

  • Convergent somatic mutations in metabolism genes in chronic liver disease S Ng, F Rouhani, S Brunner, N Brzozowska et al. Nature, 2021

  • Antigen dominance hierarchies shape TCF1+ progenitor CD8 T cell phenotypes in tumors ML Burger, AM Cruz, GE Crossland et al. - Cell, 2021

  • High-content single-cell combinatorial indexing R Mulqueen et al. - Nature Biotechnology, 2021

  • Extinction at the end-Cretaceous and the origin of modern Neotropical rainforests MR Carvalho, C Jaramillo et al. - Science, 2021

  • HERMES: a molecular-formula-oriented method to target the metabolome

R Giné, J Capellades, JM Badia et al. - Nature Methods, 2021

  • The genetic architecture of temperature adaptation is shaped by population ancestry and not by selection regime KA Otte, V Nolte, F Mallard et al. - Genome Biology, 2021

  • The Tug1 lncRNA locus is essential for male fertility JP Lewandowski et al. - Genome Biology, 2020

  • Resolving the structure of phage–bacteria interactions in the context of natural diversity KM Kauffman, WK Chang, JM Brown et al. - Nature Communications, 2022

  • Gut microbiome-mediated metabolism effects on immunity in rural and urban African populations

M Stražar, GS Temba, H Vlamakis et al. - Nature Communications, 2021

  • Aging, inflammation and DNA damage in the somatic testicular niche with idiopathic germ cell aplasia M Alfano, AS Tascini, F Pederzoli et al. - Nature communications, 2021

  • Single cell census of human kidney organoids shows reproducibility and diminished off-target cells after transplantation A Subramanian et al. - Nature Communications, 2019

  • Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche

C Coupé, YM Oh, D Dediu, F Pellegrino - Science Advances, 2019

  • Single-cell deletion analyses show control of pro–T cell developmental speed and pathways by Tcf7, Spi1, Gata3, Bcl11a, Erg, and Bcl11b W Zhou, F Gao, M Romero-Wolf, S Jo, EV Rothenberg - Science Immunology, 2022

  • Large-scale chromatin reorganization reactivates placenta-specific genes that drive cellular aging Z Liu, Q Ji, J Ren, P Yan, Z Wu, S Wang, L Sun, Z Wang et al - Developmental Cell, 2022

  • Direct epitranscriptomic regulation of mammalian translation initiation through N4-acetylcytidine D Arango, D Sturgill, R Yang, T Kanai, P Bauer et al. - Molecular Cell, 2022

  • Loss of adaptive capacity in asthmatic patients revealed by biomarker fluctuation dynamics after rhinovirus challenge A Sinha et al. - eLife, 2019

  • Sex and hatching order modulate the association between MHC‐II diversity and fitness in early‐life stages of a wild seabird

M Pineaux et al - Molecular Ecology, 2020

  • How the Choice of Distance Measure Influences the Detection of Prior-Data Conflict

K Lek, R Van De Schoot - Entropy, 2019

  • Differential variation analysis enables detection of tumor heterogeneity using single-cell RNA-sequencing data

EF Davis-Marcisak, TD Sherman et al. - Cancer research, 2019

  • Multi-Omics Investigation of Innate Navitoclax Resistance in Triple-Negative Breast Cancer Cells M Marczyk et al. - Cancers, 2020

  • Impact of Gut Microbiome on Hypertensive Patients with Low-Salt Intake: Shika Study Results S Nagase et al. - Frontiers in Medicine, 2020

  • Combined TCR Repertoire Profiles and Blood Cell Phenotypes Predict Melanoma Patient Response to Personalized Neoantigen Therapy plus Anti-PD-1 A Poran et al. - Cell Reports Medicine, 2020

  • Identification of a glioma functional network from gene fitness data using machine learning C Xiang, X Liu, D Zhou, Y Zhou, X Wang, F Chen - Journal of Cellular and Molecular Medicine, 2022

  • Prediction of New Risk Genes and Potential Drugs for Rheumatoid Arthritis from Multiomics Data AM Birga, L Ren, H Luo, Y Zhang, J Huang - Computational and Mathematical Methods in Medicine, 2022

  • Phenotyping of acute and persistent COVID-19 features in the outpatient setting: exploratory analysis of an international cross-sectional online survey S Sahanic, P Tymoszuk, D Ausserhofer et al. - medRxiv, 2021

  • A two-part evaluation approach for measuring the usability and user experience of an Augmented Reality-based assistance system to support the temporal coordination of spatially dispersed teams L Thomaschewski, B Weyers, A Kluge - Cognitive Systems Research, 2021

  • SEDE-GPS: socio-economic data enrichment based on GPS information

T Sperlea, S Füser, J Boenigk, D Heider - BMC bioinformatics, 2018

  • Spatial and molecular anatomy of germ layers in the gastrulating primate embryo G Cui, S Feng, Y Yan, L Wang, X He, X Li, et al. - bioRxiv, 2022

  • Evacuees and Migrants Exhibit Different Migration Systems after the Great East Japan Earthquake and Tsunami

M Hauer, S Holloway, T Oda – 2019

  • Robust comparison of similarity measures in analogy based software effort estimation

P Phannachitta - 11th International Conference on Software, 2017

  • RUNIMC - An R-based package for imaging mass cytometry data analysis and pipeline validation L Dolcetti, PR Barber, G Weitsman, S Thavarajet al. - bioRxiv, 2021

  • Expression variation analysis for tumor heterogeneity in single-cell RNA-sequencing data

EF Davis-Marcisak, P Orugunta et al. - BioRxiv, 2018

  • Concept acquisition and improved in-database similarity analysis for medical data

I Wiese, N Sarna, L Wiese, A Tashkandi, U Sax - Distributed and Parallel Databases, 2019

  • Dynamics of Vaginal and Rectal Microbiota over Several Menstrual Cycles in Female Cynomolgus Macaques

MT Nugeyre, N Tchitchek, C Adapen et al. - Frontiers in Cellular and Infection Microbiology, 2019

  • Inferring the quasipotential landscape of microbial ecosystems with topological data analysis

WK Chang, L Kelly - BioRxiv, 2019

  • Shifts in the nasal microbiota of swine in response to different dosing regimens of oxytetracycline administration

KT Mou, HK Allen, DP Alt, J Trachsel et al. - Veterinary microbiology, 2019

  • The Patchy Distribution of Restriction–Modification System Genes and the Conservation of Orphan Methyltransferases in Halobacteria

MS Fullmer, M Ouellette, AS Louyakis et al. - Genes, 2019

  • Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations

EJ Schwarzkopf, JC Motamayor, OE Cornejo - BioRxiv, 2019

  • Metastable regimes and tipping points of biochemical networks with potential applications in precision medicine

SS Samal, J Krishnan, AH Esfahani et al. - Reasoning for Systems Biology and Medicine, 2019

  • Genome‐wide characterization and developmental expression profiling of long non‐coding RNAs in Sogatella furcifera

ZX Chang, OE Ajayi, DY Guo, QF Wu - Insect science, 2019

  • Development of a simulation system for modeling the stock market to study its characteristics

P Mariya – 2018

  • The Tug1 Locus is Essential for Male Fertility

JP Lewandowski, G Dumbović, AR Watson, T Hwang et al. - BioRxiv, 2019

  • Microbiotyping the sinonasal microbiome

A Bassiouni, S Paramasivan, A Shiffer et al. - BioRxiv, 2019

  • Critical search: A procedure for guided reading in large-scale textual corpora

J Guldi - Journal of Cultural Analytics, 2018

  • A Bibliography of Publications about the R, S, and S-Plus Statistics Programming Languages

NHF Beebe – 2019

  • Improved state change estimation in dynamic functional connectivity using hidden semi-Markov models

H Shappell, BS Caffo, JJ Pekar, MA Lindquist - NeuroImage, 2019

  • A Smart Recommender Based on Hybrid Learning Methods for Personal Well-Being Services

RM Nouh, HH Lee, WJ Lee, JD Lee - Sensors, 2019

  • Cognitive Structural Accuracy

V Frenz – 2019

  • Kidney organoid reproducibility across multiple human iPSC lines and diminished off target cells after transplantation revealed by single cell transcriptomics

A Subramanian, EH Sidhom, M Emani et al. - BioRxiv, 2019

  • Multi-classifier majority voting analyses in provenance studies on iron artefacts

G Żabiński et al. - Journal of Archaeological Science, 2020

  • Identifying inhibitors of epithelial–mesenchymal plasticity using a network topology-based approach

K Hari et al. - NPJ systems biology and applications, 2020

  • Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations

EJ Schwarzkopf et al. - BMC Genomics, 2020

  • Enhancing Card Sorting Dendrograms through the Holistic Analysis of Distance Methods and Linkage Criteria. JA Macías - Journal of Usability Studies, 2021

  • Pattern-based identification and mapping of landscape types using multi-thematic data J Nowosad, TF Stepinski - International Journal of Geographical Information, 2021

  • Motif Analysis in k-mer Networks: An Approach towards Understanding SARS-CoV-2 Geographical Shifts

S Biswas, S Saha, S Bandyopadhyay, M Bhattacharyya - bioRxiv, 2020

  • Motif: an open-source R tool for pattern-based spatial analysis J Nowosad - Landscape Ecology, 2021

  • New effective spectral matching measures for hyperspectral data analysis C Kumar, S Chatterjee, T Oommen, A Guha - International Journal of Remote Sensing, 2021

  • Innovative activity of Polish enterprises–a strategic aspect. The similarity of NACE divisions E Bielińska-Dusza, M Hamerska - Journal of Entrepreneurship, Management and innovation, 2021

  • Multi-classifier majority voting analyses in provenance studies on iron artefacts G Żabiński, J Gramacki et al.- Journal of Archaeological Science, 2020

  • Unraveling the record of a tropical continental Cretaceous-Paleogene boundary in northern Colombia, South America F de la Parra, C Jaramillo, P Kaskes et al. - Journal of South American Earth Sciences, 2022

  • A roadmap to reconstructing muscle architecture from CT data

J Katzke, P Puchenkov, H Stark, EP Economo - Integrative Organismal Biology, 2022

  • Pandemonium: a clustering tool to partition parameter space—application to the B anomalies U Laa, G Valencia - The European Physical Journal Plus, 2022

  • Identification of a glioma functional network from gene fitness data using machine learning C Xiang, X Liu, D Zhou, Y Zhou, X Wang, F Chen - Journal of Cellular and Molecular Medicine, 2022

  • Cross compatibility in intraspecific and interspecific hybridization in yam (Dioscorea spp.) JM Mondo, PA Agre, A Edemodu et al. - Scientific reports, 2022

  • A Modular and Expandable Ecosystem for Metabolomics Data Annotation in R J Rainer, A Vicini, L Salzer, J Stanstrup et al. - Metabolites, 2022

  • Single-Cell Transcriptome Integration Analysis Reveals the Correlation Between Mesenchymal Stromal Cells and Fibroblasts C Fan, M Liao, L Xie, L Huang, S Lv, S Cai et al. - Frontiers in genetics, 2022

  • Phenotypic regionalization of the vertebral column in the thorny skate Amblyraja radiata: Stability and variation F Berio, Y Bayle, C Riley, O Larouche, R Cloutier - Journal of Anatomy, 2022

  • Community assembly during vegetation succession after metal mining is driven by multiple processes with temporal variation T Li, H Yang, X Yang, Z Guo, D Fu, C Liu, S Li et al. - Ecology and evolution, 2022

  • Integrative Organismal Biology J Katzke, P Puchenkov, H Stark, EP Economo - 2022

  • Optimizing use of US Ex-PVP inbred lines for enhancing agronomic performance of tropical Striga resistant maize inbred lines ARS Maazou, M Gedil, VO Adetimirin et al. - BMC Plant Biology, 2022

Discussions and Bug Reports

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:

https://github.com/drostlab/philentropy/issues

or find me on twitter: HajkDrost

Copy Link

Version

Install

install.packages('philentropy')

Monthly Downloads

4,089

Version

0.7.0

License

GPL-2

Issues

Pull Requests

Stars

Forks

Maintainer

Hajk-Georg Drost

Last Published

November 5th, 2022

Functions in philentropy (0.7.0)

czekanowski

Czekanowski distance (lowlevel function)
chebyshev

Chebyshev distance (lowlevel function)
dist.diversity

Distance Diversity between Probability Density Functions
clark_sq

Clark squared distance (lowlevel function)
dice_dist

Dice distance (lowlevel function)
dist_one_many

Distances and Similarities between One and Many Probability Density Functions
estimate.probability

Estimate Probability Vectors From Count Vectors
getDistMethods

Get method names for distance
gower

Gower distance (lowlevel function)
dist_many_many

Distances and Similarities between Many Probability Density Functions
jeffreys

Jeffreys distance (lowlevel function)
jaccard

Jaccard distance (lowlevel function)
distance

Distances and Similarities between Probability Density Functions
divergence_sq

Divergence squared distance (lowlevel function)
inner_product

Inner product distance (lowlevel function)
minkowski

Minkowski distance (lowlevel function)
intersection_dist

Intersection distance (lowlevel function)
euclidean

Euclidean distance (lowlevel function)
matusita

Matusita distance (lowlevel function)
kullback_leibler_distance

kullback-Leibler distance (lowlevel function)
lin.cor

Linear Correlation
gJSD

Generalized Jensen-Shannon Divergence
fidelity

Fidelity distance (lowlevel function)
kumar_hassebrook

Kumar hassebrook distance (lowlevel function)
kumar_johnson

Kumar-Johnson distance (lowlevel function)
prob_symm_chi_sq

Probability symmetric chi-squared distance (lowlevel function)
manhattan

Manhattan distance (lowlevel function)
pearson_chi_sq

Pearson chi-squared distance (lowlevel function)
lorentzian

Lorentzian distance (lowlevel function)
soergel

Soergel distance (lowlevel function)
harmonic_mean_dist

Harmonic mean distance (lowlevel function)
ruzicka

Ruzicka distance (lowlevel function)
hellinger

Hellinger distance (lowlevel function)
squared_euclidean

Squared euclidean distance (lowlevel function)
squared_chord

Squared chord distance (lowlevel function)
sorensen

Sorensen distance (lowlevel function)
squared_chi_sq

Squared chi-squared distance (lowlevel function)
taneja

Taneja difference (lowlevel function)
tanimoto

Tanimoto distance (lowlevel function)
jensen_difference

Jensen difference (lowlevel function)
jensen_shannon

Jensen-Shannon distance (lowlevel function)
kulczynski_d

Kulczynski_d distance (lowlevel function)
neyman_chi_sq

Neyman chi-squared distance (lowlevel function)
k_divergence

K-Divergence (lowlevel function)
motyka

Motyka distance (lowlevel function)
topsoe

Topsoe distance (lowlevel function)
wave_hedges

Wave hedges distance (lowlevel function)
H

Shannon's Entropy \(H(X)\)
additive_symm_chi_sq

Additive symmetric chi-squared distance (lowlevel function)
JSD

Jensen-Shannon Divergence
bhattacharyya

Bhattacharyya distance (lowlevel function)
binned.kernel.est

Kernel Density Estimation
JE

Shannon's Joint-Entropy \(H(X,Y)\)
KL

Kullback-Leibler Divergence
MI

Shannon's Mutual Information \(I(X,Y)\)
CE

Shannon's Conditional-Entropy \(H(X | Y)\)
avg

AVG distance (lowlevel function)
cosine_dist

Cosine distance (lowlevel function)
canberra

Canberra distance (lowlevel function)
dist_one_one

Distances and Similarities between Two Probability Density Functions