Learn R Programming

GREED : Bayesian greedy clustering

Greed enables model-based clustering of networks, matrices of count data and much more with different types of generative models. Model-selection and clustering are performed in combination by optimizing the Integrated Classification Likelihood. Details of the algorithms and methods proposed by this package can be found in Côme, Jouvin, Latouche, and Bouveyron (2021) 10.1007/s11634-021-00440-z.

Dedicated to clustering and visualization, the package is very general and currently handles the following tasks:

  • Continuous data clustering with Gaussian Mixture Models. A GMM tutorial is available. See also the documentation for the Gmm and DiagGmm S4 classes.
  • Graph data clustering with the Stochastic Block Model or its degree corrected variants. A SBM tutorial is available . See also the documentation for the Sbm and dcSbm S4 classes.
  • Categorical data clustering with the Latent Class Analysis. An LCA tutorial is available. See also the documentation for the Lca S4 class.
  • Count data clustering with the Mixture of Multinomials model. A tutorial will soon be available. For now, we refer to the documentation for the Mom S4 class.
  • Mixed-typed data clustering, e.g. categorical and numerical but the package handles virtually any type of data combination by stacking models on top of each data types. For example graph data with continuous or categorical data attached to the nodes are handled. A CombinedModels tutorial is available. See also the documentation for the CombinedModels S4 class.
  • Mixture of regression for simultaneous clustering and fitting a regression model in each cluster. A MoR tutorial is available. See also the documentation for the MoR S4 class.
  • Co-clustering of binary and count-data via the Latent Block Model and its degree-corrected variant. A tutorial will soon be available. For now, we refer to the documentation for the DcLbm S4 class.

With the Integrated Classification Likelihood, the parameters of the models are integrated out with a natural regularization effect for complex models. This penalization allows to automatically find a suitable value for the number of clusters K. A user only needs to provide an initial guess for the number of clusters K, as well as values for the prior parameters (reasonable default values are used if no prior information is given). The default optimization is performed thanks to a combination of a greedy local search and a genetic algorithm described in Côme, Jouvin, Latouche, and Bouveyron (2021), but several other optimization algorithms are also available.

Eventually, a whole hierarchy of solutions from K to 1 cluster is extracted. This enables an ordering of the clusters, and the exploration of simpler clustering along the hierarchy. The package also provides some plotting functionality.

Installation

You can install the development version of greed from GitHub with:

#GitHub
install.packages("devtools")
devtools::install_github("comeetie/greed")

Or use the CRAN version:

#CRAN
install.packages("greed")

Usage: the greed function

The main entry point for using the package is simply thegreed function (see ?greed). The generative model will be chosen automatically to fit the type of the provided data, but you may specify another choice with the model argument.

We illustrate its use on a graph clustering example with the classical Books network ?Books.

More use cases and their specific plotting functionality are described in the vignettes.

library(greed)
data(Books)
sol <- greed(Books$X) 
#> 
#> ── Fitting a guess DCSBM model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -1347 and 3 clusters.
#> ℹ Generation 2 : best solution with an ICL of -1346 and 4 clusters.
#> ℹ Generation 3 : best solution with an ICL of -1346 and 4 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a DCSBM model 3 clusters and an ICL of -1345

You may specify the model you want to use and set the priors parameters with the (model argument), the optimization algorithm (alg argument) and the initial number of cluster K. Here Books$X is a square sparse matrix and a graph clustering ?`DcSbm-class` model will be used by default. By default, the Hybrid genetic algorithm is used.

The next example illustrates a usage without default values. A binary Sbm prior is used, along with a spectral clustering algorithm for graphs.

sol <- greed(Books$X,model=Sbm(),alg=Seed(),K=10)
#> 
#> ── Fitting a guess SBM model ──
#> 
#> ── Final clustering ──
#> 
#> ── Clustering with a SBM model 5 clusters and an ICL of -1255

Result analysis

The results of greed() is an S4 class which depends on the model argument (here, an SBM) which comes with readily implemented methods: clustering() to access the estimated partitions, K() the estimated number of clusters, and coef() the (conditional) maximum a posteriori of the model parameters.

table(Books$label,clustering(sol)) %>% knitr::kable()
12345
c016366
l830500
n02830
K(sol)
#> [1] 5
coef(sol)
#> $pi
#> [1] 0.07619048 0.31428571 0.18095238 0.37142857 0.05714286
#> 
#> $thetakl
#>             [,1]        [,2]        [,3]        [,4]       [,5]
#> [1,] 0.821428571 0.367424242 0.065789474 0.003205128 0.00000000
#> [2,] 0.367424242 0.106060606 0.006379585 0.003885004 0.00000000
#> [3,] 0.065789474 0.006379585 0.251461988 0.016194332 0.04385965
#> [4,] 0.003205128 0.003885004 0.016194332 0.099865047 0.42735043
#> [5,] 0.000000000 0.000000000 0.043859649 0.427350427 0.73333333

Inspecting the hierarchy

An important aspect of the greed package is its hierarchical clustering algorithm which extract a set of nested partitions from K=K(sol) to K=1. This hierarchy may be visualized thanks to a dendogram representing the fusion order and the level of regularization  − log (α) needed for each fusion.

plot(sol, type='tree') # try also: type="path"

Moreover, similar to standard hierarchical algorithm such as hclust, the cut() method allows you to extract a partition at any stage of the hierarchy. Its results is still an S4 object, and the S4 methods introduced earlier may again be used to investigate the results.

sol_K3 = cut(sol, K=3)
K(sol_K3)
#> [1] 3
table(Books$label,clustering(sol_K3)) %>% knitr::kable()
123
c1642
l3850
n283

Visualization

Finally, the greed package propose efficient and model-adapted visualization via the plot() methods. In this graph clustering example, the "blocks" and "nodelink" display the cluster-aggregated adjacency matrix and diagram of the graph respectively. Note that the ordering of the clusters is the same than the one computed for the dendrogram, greatly enhancing visualization of the hierarchical structure.

plot(sol,type='blocks')
plot(sol, type='nodelink')

Other models

As explained above, the greed package implements many standard models and the list may be displayed with

available_models()

Many plotting functions are available and, depending of the specified model, different type argument may be specified. For further information we refer to the vignettes linked above for each use case.

Using parallel computing

For large datasets, it is possible to use parallelism to speed-up the computations thanks to the future package. You only need to specify the type of back-end you want to use, before calling the ?greed function:

library(future)
plan(multisession, workers=2) # may be increased

Copy Link

Version

Install

install.packages('greed')

Monthly Downloads

246

Version

0.6.1

License

GPL

Issues

Pull Requests

Stars

Forks

Maintainer

Etienne C<c3><b4>me

Last Published

October 3rd, 2022

Functions in greed (0.6.1)

DcSbmFit-class

Degree Corrected Stochastic Block Model fit results class
DcSbm

Degree Corrected Stochastic Block Model Prior class
DcSbmPath-class

Degree Corrected Stochastic Block Model hierarchical fit results class
Fifa

Fifa data
DiagGmmPath-class

Diagonal Gaussian mixture model hierarchical fit results class
DiagGmmFit-class

Diagonal Gaussian mixture model fit results class
IclFit-class

Abstract class to represent a clustering result
ICL

Generic method to extract the ICL value from an IclFit-class object
Genetic-class

Genetic optimization algorithm
DlvmCoPrior-class

Abstract class to represent a generative model for co-clustering
DiagGmm

Diagonal Gaussian Mixture Model Prior description class
H

Compute the entropy of a discrete sample
Gmm

Gaussian Mixture Model Prior description class
MoRPath-class

Multivariate mixture of regression model hierarchical fit results class
MultSbm

Multinomial Stochastic Block Model Prior class
MI

Compute the mutual information of two discrete samples
Hybrid-class

Hybrid optimization algorithm
MoM

Mixture of Multinomial Model Prior description class
LcaFit-class

Latent Class Analysis fit results class
LcaPath-class

Latent Class Analysis hierarchical fit results class
IclPath-class

Abstract class to represent a hierarchical clustering result
Jazz

Jazz musicians network dataset
SevenGraders

SevenGraders data
SbmFit-class

Stochastic Block Model fit results class
Sbm

Stochastic Block Model Prior class
coef,DcSbmFit-method

Extract parameters from an DcSbmFit-class object
coef,DiagGmmFit-method

Extract mixture parameters from DiagGmmFit-class object
DlvmPrior-class

Abstract class to represent a generative model for clustering
extractSubModel

Extract a part of a CombinedModelsPath-class object
cut,IclPath-method

Generic method to cut a path solution to a desired number of cluster
Football

American College football network dataset
coef,MoRFit-method

Extract mixture parameters from MoRFit-class object using MAP estimation
Youngpeoplesurvey

Young People survey data
coef,MultSbmFit-method

Extract parameters from an MultSbmFit-class object
plot,DcSbmFit,missing-method

Plot a DcSbmFit-class object
plot,DiagGmmFit,missing-method

Plot a DiagGmmFit-class object
rlca

Generate data from lca model
rlbm

Generate a data matrix using a Latent Block Model
plot,GmmFit,missing-method

Plot a GmmFit-class object
plot,IclPath,missing-method

Plot an IclPath-class object
GmmFit-class

Gaussian mixture model fit results class
K

Generic method to get the number of clusters from an IclFit-class object
GmmPath-class

Gaussian mixture model hierarchical fit results class
rmm

Generate data using a Multinomial Mixture
MoR

Multivariate mixture of regression Prior model description class
Lca

Latent Class Analysis Model Prior class
NewGuinea

NewGuinea data
Ndrangheta

Ndrangheta mafia covert network dataset
rmreg

Generate data from a mixture of regression model
MoMFit-class

Mixture of Multinomial fit results class
available_algorithms

Display the list of every currently available optimization algorithm
MoMPath-class

Mixture of Multinomial hierarchical fit results class
MoRFit-class

Clustering with a multivariate mixture of regression model fit results class
MultSbmPath-class

Multinomial Stochastic Block Model hierarchical fit results class
MultSbmFit-class

Multinomial Stochastic Block Model fit results class
SbmPath-class

Stochastic Block Model hierarchical fit results class
Seed-class

Greedy algorithm with seeded initialization
coef,GmmFit-method

Extract mixture parameters from GmmFit-class object
coef,IclFit-method

Extract parameters from an IclFit-class object
to_multinomial

Convert a binary adjacency matrix with missing value to a cube
Multistarts-class

Greedy algorithm with multiple start class
fashion

Fashion mnist dataset
available_models

Display the list of every currently available DLVM
plot,DcLbmFit,missing-method

Plot a DcLbmFit-class
gmmpairs

Make a matrix of plots with a given data and gmm fitted parameters
plot,DcLbmPath,missing-method

Plot a DcLbmPath-class
coef,LcaFit-method

Extract parameters from an LcaFit-class object
coef,MoMFit-method

Extract parameters from an MoMFit-class object
NMI

Compute the normalized mutual information of two discrete samples
plot,MultSbmFit,missing-method

Plot a MultSbmFit-class object
plot,LcaFit,missing-method

Plot a LcaFit-class object
rmultsbm

Generate a graph adjacency matrix using a Stochastic Block Model
plot,SbmFit,missing-method

Plot a SbmFit-class object
plot,MoMFit,missing-method

Plot a MoMFit-class object
rsbm

Generate a graph adjacency matrix using a Stochastic Block Model
clustering

Method to extract the clustering results from an IclFit-class object
show,IclFit-method

Show an IclPath object
spectral

Regularized spectral clustering
coef,DcLbmFit-method

Extract parameters from an DcLbmFit-class object
cut,DcLbmPath-method

Method to cut a DcLbmPath solution to a desired number of cluster
coef,SbmFit-method

Extract parameters from an SbmFit-class object
greed

Model based hierarchical clustering
rdcsbm

Generates graph adjacency matrix using a degree corrected SBM
prior

Generic method to extract the prior used to fit IclFit-class object
mushroom

Mushroom data
Alg-class

Abstract optimization algorithm class
DcLbmFit-class

Degree corrected Latent Block Model fit results class
Books

Books about US politics network dataset
CombinedModelsPath-class

Combined Models hierarchical fit results class
DcLbm

Degree Corrected Latent Block Model for bipartite graph class
CombinedModels

Combined Models classes
CombinedModelsFit-class

Combined Models fit results class
DcLbmPath-class

Degree corrected Latent Block Model hierarchical fit results class