Determine a(*) minimum spanning tree (MST) of the complete undirected graph representing a set of \(n\) points whose weights correspond to the pairwise distances between the points.
mst(d, ...)# S3 method for default
mst(
d,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
M = 1L,
verbose = FALSE,
...
)
# S3 method for dist
mst(d, M = 1L, verbose = FALSE, ...)
Returns a numeric matrix of class mst
with \(n-1\) rows and
three columns: from
, to
, and dist
sorted
nondecreasingly. Its i-th row specifies the i-th edge of the MST
which is incident to the vertices from[i]
and to[i]
with
from[i] < to[i]
(in 1,...,n)
and dist[i]
gives the corresponding weight, i.e., the
distance between the point pair.
The Size
attribute specifies the number of points, \(n\).
The Labels
attribute gives the labels of the input points,
if available.
The method
attribute provides the name of the distance function used.
If \(M>1\), the nn.index
attribute gives the indices
of the M
-1 nearest neighbours of each point
and nn.dist
provides the corresponding distances,
both in the form of an \(n\) by \(M-1\) matrix.
either a numeric matrix (or an object coercible to one,
e.g., a data frame with numeric-like columns) or an
object of class dist
; see dist
further arguments passed to or from other methods,
in particular, to mst_euclid
from the quitefastmst package
metric used in the case where d
is a matrix; one of:
"euclidean"
(synonym: "l2"
),
"manhattan"
(a.k.a. "l1"
and "cityblock"
),
"cosine"
smoothing factor; M
= 1 selects the requested
distance
; otherwise, the corresponding degree-M
mutual
reachability distance is used; M
should be rather small,
say, \(\leq 20\)
logical; whether to print diagnostic messages and progress information
Marek Gagolewski and other contributors
(*) Note that if the distances are non unique, there might be multiple minimum trees spanning a given graph.
If d
is a matrix and the use of Euclidean distance is requested
(the default), then mst_euclid
is called
to determine the MST. It is quite fast in spaces of low intrinsic
dimensionality, even for 10M points.
Otherwise, a much slower implementation of the Jarník (Prim/Dijkstra)-like
method, which requires \(O(n^2)\) time, is used. The algorithm is parallelised;
the number of threads is determined by the OMP_NUM_THREADS
environment
variable. As a rule of thumb, datasets up to 100k points
should be processed relatively quickly.
If \(M>1\), then the mutual reachability distance \(m(i,j)\)
with the smoothing factor \(M\) (see Campello et al. 2013)
is used instead of the chosen "raw" distance \(d(i,j)\).
It holds \(m(i, j)=\max\{d(i,j), c(i), c(j)\}\), where \(c(i)\) is
the core distance, i.e., the distance between the \(i\)-th point and
its (M
-1)-th nearest neighbour.
This makes "noise" and "boundary" points being "pulled away" from each other.
The Genie clustering algorithm (see gclust
) with respect to
the mutual reachability distance can mark some observations as noise points.
V. Jarník, O jistem problemu minimalnim, Prace Moravske Prirodovedecke Spolecnosti 6, 1930, 57-63.
C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21, 1995, 1313-1325.
R. Prim, Shortest connection networks and some generalisations, The Bell System Technical Journal 36(6), 1957, 1389-1401.
O. Borůvka, O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 3, 1926, 37–58.
J.L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18(9), 509–517, 1975, tools:::Rd_expr_doi("10.1145/361002.361007"). W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning tree: Algorithm, analysis, and applications, Proc. 16th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining (KDD '10), 2010, 603–612.
R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, tools:::Rd_expr_doi("10.1007/978-3-642-37456-2_14").
The official online manual of genieclust at https://genieclust.gagolewski.com/
Gagolewski, M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15:100722, 2021, tools:::Rd_expr_doi("10.1016/j.softx.2021.100722")
mst_euclid
library("datasets")
data("iris")
X <- jitter(as.matrix(iris[1:2])) # some data
T <- mst(X)
plot(X, asp=1, las=1)
segments(X[T[, 1], 1], X[T[, 1], 2],
X[T[, 2], 1], X[T[, 2], 2])
Run the code above in your browser using DataLab