dbscan: DBSCAN

Description

Fast reimplementation of the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm using a kd-tree. The implementation is significantly faster and can work with larger data sets then dbscan in fpc.

Usage

dbscan(x, eps, minPts = 5, borderPoints = TRUE,
    search = "kdtree", bucketSize = 10,
    splitRule = "suggest", approx = 0)

Arguments

Value

A object of class 'dbscan' with the following components:epsvalue of the eps parameter.minPtsvalue of the minPts parameter.clusterA integer vector with cluster assignments. Zero indicates noise points.

Details

This implementation of DBSCAN implements the original algorithm as described by Ester et al (1996). DBSCAN estimates the density around each data point by counting the number of points in a user-specified eps-neighborhood and applies a used-specified minPts thresholds to identify core, border and noise points. In a second step, core points are joined into a cluster if they are density-reachable (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). Finally, border points are assigned to clusters.

Border points are arbitrarily assigned to clusters in the original algorithm. DBSCAN* (see Campello et al 2013) treats all border points as noise points. This is implemented with borderPoints = FALSE.

kNNdistplot can be used to find suitable values for eps.

References

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, Lecture Notes in Computer Science 7819, p. 160.

Examples

Run this code

data(iris)
iris <- as.matrix(iris[,1:4])

res <- dbscan(iris, .4, 4)
res

pairs(iris, col=res$cluster+1)

## compare with dbscan from package fpc (only if installed)
if (requireNamespace("fpc", quietly = TRUE)) {
  res2 <- fpc::dbscan(iris, .4, 4)
  pairs(iris, col=res2$cluster+1)

  ## make sure both version produce the same results
  all(res$cluster == res2$cluster)
  }

## find suitable eps parameter (look at knee)
kNNdistplot(iris, k=4)


## example data from fpc
set.seed(665544)
n <- 600
x <- cbind(
  x = runif(10, 0, 10)+rnorm(n, sd=0.2),
  y = runif(10, 0, 10) + rnorm(n, sd=0.2)
  )

res <- dbscan::dbscan(x, .2, 4)
res

plot(x, col=res$cluster+1)

## compare speed against fpc version (if microbenchmark is installed)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  t_dbscan <- microbenchmark::microbenchmark(
    dbscan::dbscan(x, .2, 4), times = 10, unit="ms")
  t_dbscan_linear <- microbenchmark::microbenchmark(
    dbscan::dbscan(x, .2, 4, search="linear"), times = 10, unit="ms")
  t_fpc <- microbenchmark::microbenchmark(
    fpc::dbscan(x, .2, 4), times = 10, unit="ms")

  boxplot(rbind(t_fpc, t_dbscan_linear, t_dbscan),
    names=c("fpc (R)", "dbscan (linear)", "dbscan (kdtree)"),
    main = "Runtime comparison in ms")

  ## speedup of the kd-tree-based version compared to the fpc implementation
  median(t_fpc$time)/median(t_dbscan$time)
}

Run the code above in your browser using DataLab