dbscan

0th

Percentile

DBSCAN density reachability and connectivity clustering

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

Keywords
multivariate, cluster
Usage
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
    "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
  # S3 method for dbscan
print(x, ...)
  # S3 method for dbscan
plot(x, data, ...)
  # S3 method for dbscan
predict(object, data, newdata = NULL,
predict.max=1000, ...)
Arguments
data

data matrix, data.frame, dissimilarity matrix or dist-object. Specify method="dist" if the data should be interpreted as dissimilarity matrix or object. Otherwise Euclidean distances will be used.

eps

Reachability distance, see Ester et al. (1996).

MinPts

Reachability minimum no. of points, see Ester et al. (1996).

scale

scale the data if TRUE.

method

"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements).

seeds

FALSE to not include the isseed-vector in the dbscan-object.

showplot

0 = no plot, 1 = plot per iteration, 2 = plot per subiteration.

countmode

NULL or vector of point numbers at which to report progress.

x

object of class dbscan.

object

object of class dbscan.

newdata

matrix or data.frame with raw data to predict.

predict.max

max. batch size for predictions.

...

Further arguments transferred to plot methods.

Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).

We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

print.dbscan shows a statistic of the number of points belonging to the clusters that are seeds and border points.

plot.dbscan distinguishes between seed and border points by plot symbol.

Value

predict.dbscan gives out a vector of predicted clusters for the points in newdata.

dbscan gives out an object of class 'dbscan' which is a LIST with components

cluster

integer vector coding cluster membership with noise observations (singletons) coded as 0

isseed

logical vector indicating whether a point is a seed (not border, not noise)

eps

parameter eps

MinPts

parameter MinPts

Note

this is a simplified version of the original algorithm (no K-D-trees used), thus we have \(o(n^2)\) instead of \(o(n*log(n))\)

References

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Aliases
  • dbscan
  • print.dbscan
  • plot.dbscan
  • predict.dbscan
Examples
# NOT RUN {
  set.seed(665544)
  n <- 600
  x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  par(bg="grey40")
  ds <- dbscan(x, 0.2)
# run with showplot=1 to see how dbscan works.
  ds
  plot(ds, x)

  x2 <- matrix(0,nrow=4,ncol=2)
  x2[1,] <- c(5,2)
  x2[2,] <- c(8,3)
  x2[3,] <- c(4,4)
  x2[4,] <- c(9,9)
  predict(ds, x, x2)

  n <- 600
  x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))

# Not run, but results from my machine are 0.105 - 0.068 - 0.255:
#  system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] 
#  system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
#  system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
#    method="dist"))[3]
# }
Documentation reproduced from package fpc, version 2.2-3, License: GPL

Community examples

Looks like there are no examples yet.