A reimplementation of Genie - a robust and outlier resistant
clustering algorithm by Gagolewski, Bartoszuk, and Cena (2016).
The Genie algorithm is based on the minimum spanning tree (MST) of the
pairwise distance graph of a given point set.
Just like the Single Linkage method, it consumes the edges
of the MST in an increasing order of weights. However, it prevents
the formation of clusters of highly imbalanced sizes; once the Gini index
(see gini_index()) of the cluster size distribution
raises above gini_threshold, merging a point group
of the smallest size is enforced.
A clustering can also be computed with respect to the
\(M\)-mutual reachability distance (based, e.g., on the Euclidean metric),
which is used in the definition of the HDBSCAN* algorithm
(see mst() for the definition).
For the smoothing factor \(M>0\), outliers are pulled away from
their neighbours. This way, the Genie algorithm gives an alternative
to the HDBSCAN* algorithm (Campello et al., 2013) that is able to detect
a predefined number of clusters and indicate outliers (via deadwood; see
Gagolewski, 2026) without depending on DBSCAN*'s eps or HDBSCAN*'s
min_cluster_size parameters. Also make sure to check out
the Lumbermark algorithm (package lumbermark) that is also based on MSTs.
gclust(d, ...)# S3 method for default
gclust(
d,
gini_threshold = 0.3,
M = 0L,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
verbose = FALSE,
...
)
# S3 method for dist
gclust(d, gini_threshold = 0.3, M = 0L, verbose = FALSE, ...)
# S3 method for mst
gclust(d, gini_threshold = 0.3, verbose = FALSE, ...)
genie(d, ...)
# S3 method for default
genie(
d,
k,
gini_threshold = 0.3,
M = 0L,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
verbose = FALSE,
...
)
# S3 method for dist
genie(d, k, gini_threshold = 0.3, M = 0L, verbose = FALSE, ...)
# S3 method for mst
genie(d, k, gini_threshold = 0.3, verbose = FALSE, ...)
gclust() computes the entire clustering hierarchy; it
returns a list of class hclust; see hclust.
Use cutree to obtain an arbitrary \(k\)-partition.
genie() returns an object of class mstclust, which defines
a \(k\)-partition, i.e., a vector whose \(i\)-th element denotes
the \(i\)-th input point's cluster label between 1 and \(k\).
In both cases, the mst attribute gives the computed minimum
spanning tree which can be reused in further calls to the functions
from genieclust, lumbermark, and deadwood.
For genie(), the cut_edges attribute gives the \(k-1\)
indexes of the MST edges whose omission leads to the requested
\(k\)-partition (connected components of the resulting spanning forest).
In gclust(), these are exactly the last \(k-1\) indexes in the
links attribute (but sorted).
a numeric matrix (or an object coercible to one,
e.g., a data frame with numeric-like columns) or an
object of class dist (see dist),
or an object of class mst (see mst)
further arguments passed to mst()
threshold for the Genie correction, i.e., the Gini index of the cluster size distribution; threshold of 1.0 leads to the single linkage algorithm; low thresholds highly penalise the formation of small clusters
smoothing factor; \(M \leq 1\) gives the selected distance;
otherwise, the mutual reachability distance is used
metric used to compute the linkage, one of:
"euclidean" (synonym: "l2"),
"manhattan" (a.k.a. "l1" and "cityblock"),
"cosine"
logical; whether to print diagnostic messages and progress information
the desired number of clusters to detect
Marek Gagolewski and other contributors
As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.
If d is a numeric matrix or an object of class dist,
mst() will be called to compute an MST, which generally
takes at most \(O(n^2)\) time. However, by default, a faster algorithm
based on K-d trees is selected automatically for low-dimensional Euclidean
spaces; see mst_euclid from
the quitefastmst package.
Once a minimum spanning tree is determined, the Genie algorithm runs in
\(O(n \sqrt{n})\) time. If you want to test different
gini_thresholds or \(k\)s, it is best to compute
the MST explicitly beforehand.
Due to Genie's original definition, the resulting partition tree (dendrogram)
might violate the ultrametricity property (merges might occur at levels that
are not increasing w.r.t. a between-cluster distance).
gclust() automatically corrects departures from
ultrametricity by applying height = rev(cummin(rev(height))).
M. Gagolewski, M. Bartoszuk, A. Cena, Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23, tools:::Rd_expr_doi("10.1016/j.ins.2016.05.003")
R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, tools:::Rd_expr_doi("10.1007/978-3-642-37456-2_14")
M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, tools:::Rd_expr_doi("10.1007/s00357-024-09483-1")
M. Gagolewski, genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722, tools:::Rd_expr_doi("10.1016/j.softx.2021.100722")
M. Gagolewski, deadwood, in preparation, 2026
M. Gagolewski, quitefastmst, in preparation, 2026
The official online manual of genieclust at https://genieclust.gagolewski.com/
Gagolewski, M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15:100722, 2021, tools:::Rd_expr_doi("10.1016/j.softx.2021.100722")
mst() for the minimum spanning tree routines
normalized_clustering_accuracy() (amongst others) for external
cluster validity measures
library("datasets")
data("iris")
X <- jitter(as.matrix(iris[3:4]))
h <- gclust(X)
y_pred <- cutree(h, 3)
y_test <- as.integer(iris[,5])
plot(X, col=y_pred, pch=y_test, asp=1, las=1)
adjusted_rand_score(y_test, y_pred)
normalized_clustering_accuracy(y_test, y_pred)
# detect 3 clusters and find outliers with Deadwood
library("deadwood")
y_pred2 <- genie(X, k=3, M=5) # the 5-mutual reachability distance
plot(X, col=y_pred2, asp=1, las=1)
is_outlier <- deadwood(y_pred2)
points(X[!is_outlier, ], col=y_pred2[!is_outlier], pch=16)
Run the code above in your browser using DataLab