hdbscan: Hierarchical DBSCAN (HDBSCAN)

Description

Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related algorithms.

Usage

hdbscan(x, minPts, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE)
# S3 method for hdbscan
plot(
  x,
  scale = "suggest",
  gradient = c("yellow", "red"),
  show_flat = FALSE,
  ...
)

Arguments

a data matrix (Euclidean distances are used) or a dist object calculated with an arbitrary distance metric.

minPts

integer; Minimum size of clusters. See details.

gen_hdbscan_tree

logical; should the robust single linkage tree be explicitly computed (see cluster tree in Chaudhuri et al, 2010).

gen_simplified_tree

logical; should the simplified hierarchy be explicitly computed (see Campello et al, 2013).

scale

integer; used to scale condensed tree based on the graphics device. Lower scale results in wider trees.

gradient

character vector; the colors to build the condensed tree coloring with.

show_flat

logical; whether to draw boxes indicating the most stable clusters.

...

additional arguments are passed on.

Value

A object of class hdbscan with the following components:

cluster

A integer vector with cluster assignments. Zero indicates noise points.

minPts

value of the minPts parameter.

cluster_scores

The sum of the stability scores for each salient (flat) cluster. Corresponds to cluster IDs given the in "cluster" element.

membership_prob

The probability or individual stability of a point within its clusters. Between 0 and 1.

outlier_scores

The GLOSH outlier score of each point.

An hclust object of the HDBSCAN hierarchy.

Details

This fast implementation of HDBSCAN (Hahsler et al, 2019) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction proposed by Campello et al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution.

Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015) is available in function glosh() and the ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter minPts.

Note that minPts not only acts as a minimum cluster size to detect, but also as a "smoothing" factor of the density estimates implicitly computed from HDBSCAN.

predict() assigns each new data point to the same cluster as the point in the original data that has the smallest mutual reachability distance.

References

Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), 1-30. 10.18637/jss.v091.i01

Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, Lecture Notes in Computer Science 7819, p. 160. 10.1007/978-3-642-37456-2_14

Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(5):1-51. 10.1145/2733381

Examples

Run this code

# NOT RUN {
## cluster the moons data set with HDBSCAN
data(moons)

res <- hdbscan(moons, minPts = 5)
res

plot(res)
plot(moons, col = res$cluster + 1L)

## cluster the moons data set with HDBSCAN using Manhattan distances
res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5)
plot(res)
plot(moons, col = res$cluster + 1L)

## DS3 from Chameleon
data("DS3")

res <- hdbscan(DS3, minPts = 50)
res

## Plot the simplified tree, highlight the most stable clusters
plot(res, show_flat = TRUE)

## Plot the actual clusters (noise has cluster id 0 and is shown in black)
plot(DS3, col = res$cluster + 1L, cex = .5)

## Predict cluster for new data points.
newdata <- data.frame(
  X = runif(n = 10, min(DS3[, 1]), max(DS3[, 1])),
  Y = runif(n = 10, min(DS3[, 2]), max(DS3[, 2]))
  )

plot(DS3, col = res$cluster + 1L, pch = ".")
points(newdata, col = "red", pch = 4, lwd = 2)

pred_label <- predict(res, newdata, data = DS3)
pred_label

points(newdata, col = pred_label + 1L, pch = 1, lwd = 2, cex = 2)
# }

Run the code above in your browser using DataLab