dbscan: Clustering: DBSCAN density reachability and connectivity

Description

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

Usage

dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
    "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
  ## S3 method for class 'dbscan':
print(x, ...)
  ## S3 method for class 'dbscan':
plot(x, data, ...)
  ## S3 method for class 'dbscan':
predict(object, data, newdata = NULL, predict.max
=1000, ...)

Arguments

data

data matrix, data.frame, dissimilarity matrix or dist-object

eps

Reachability Distance

MinPts

Reachability minimum no. of points

scale

scale the data

method

"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distanc

seeds

FALSE to not include the isseed vector in the dbscan object

showplot

0 = no plot, 1 = plot per iteration, 2 = plot per subiteration

countmode

NULL or vector of point numbers at which to report progress

object of class dbscan.

object

object of class dbscan.

newdata

matrix or data.frame with raw data to predict

predict.max

max. batch size for predictions

...

Further arguments transferred to plot methods.

Value

predict.dbscan gives out a vector of predicted clusters for the points in newdata. dbscan gives out an object of class 'dbscan' which is a LIST with components
clusterinteger vector coding cluster membership with noise observations (singletons) coded as 0
isseedlogical vector indicating whether a point is a seed (not border, not noise)
epsparameter eps
MinPtsparameter MinPts

Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise). We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

print.dbscan shows a statistic of the number of points belonging to the clusters that are seeds and border points.

plot.dbscan distinguishes between seed and border points by plot symbol.

References

Martin Ester, Hans-Peter Kriegel, J�rg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Examples

Run this code

set.seed(665544)
  n <- 600
  x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  par(bg="grey40")
  ds <- dbscan(x, 0.2, showplot=TRUE)
  ds
  plot(ds, x)

  x2 <- matrix(0,nrow=4,ncol=2)
  x2[1,] <- c(5,2)
  x2[2,] <- c(8,3)
  x2[3,] <- c(4,4)
  x2[4,] <- c(9,9)
  predict(ds, x, x2)

  n <- 600
  x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))


  system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] 
  system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
  system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
    method="dist"))[3]

Run the code above in your browser using DataLab