Learn R Programming

fpc (version 1.2-1)

dbscan: Clustering: DBSCAN density reachability and connectivity

Description

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

Usage

dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
    "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
  ## S3 method for class 'dbscan':
print(x, ...)
  ## S3 method for class 'dbscan':
plot(x, data, ...)
  ## S3 method for class 'dbscan':
predict(object, data, newdata = NULL, predict.max
=1000, ...)

Arguments

data
data matrix, data.frame, dissimilarity matrix or dist-object
eps
Reachability Distance
MinPts
Reachability minimum no. of points
scale
scale the data
method
"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distanc
seeds
FALSE to not include the isseed vector in the dbscan object
showplot
0 = no plot, 1 = plot per iteration, 2 = plot per subiteration
countmode
NULL or vector of point numbers at which to report progress
x
object of class dbscan.
object
object of class dbscan.
newdata
matrix or data.frame with raw data to predict
predict.max
max. batch size for predictions
...
Further arguments transferred to plot methods.

Value

  • predict.dbscan gives out a vector of predicted clusters for the points in newdata. dbscan gives out an object of class 'dbscan' which is a LIST with components
  • clusterinteger vector coding cluster membership with noise observations (singletons) coded as 0
  • isseedlogical vector indicating whether a point is a seed (not border, not noise)
  • epsparameter eps
  • MinPtsparameter MinPts

Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise). We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

print.dbscan shows a statistic of the number of points belonging to the clusters that are seeds and border points.

plot.dbscan distinguishes between seed and border points by plot symbol.

References

Martin Ester, Hans-Peter Kriegel, J�rg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Examples

Run this code
set.seed(665544)
  n <- 600
  x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  par(bg="grey40")
  ds <- dbscan(x, 0.2, showplot=TRUE)
  ds
  plot(ds, x)

  x2 <- matrix(0,nrow=4,ncol=2)
  x2[1,] <- c(5,2)
  x2[2,] <- c(8,3)
  x2[3,] <- c(4,4)
  x2[4,] <- c(9,9)
  predict(ds, x, x2)

  n <- 600
  x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))


  system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] 
  system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
  system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
    method="dist"))[3]

Run the code above in your browser using DataLab