# dbscan

##### DBSCAN density reachability and connectivity clustering

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

- Keywords
- multivariate, cluster

##### Usage

```
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
"dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
# S3 method for dbscan
print(x, ...)
# S3 method for dbscan
plot(x, data, ...)
# S3 method for dbscan
predict(object, data, newdata = NULL,
predict.max=1000, ...)
```

##### Arguments

- data
data matrix, data.frame, dissimilarity matrix or

`dist`

-object. Specify`method="dist"`

if the data should be interpreted as dissimilarity matrix or object. Otherwise Euclidean distances will be used.- eps
Reachability distance, see Ester et al. (1996).

- MinPts
Reachability minimum no. of points, see Ester et al. (1996).

- scale
scale the data if

`TRUE`

.- method
"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements).

- seeds
FALSE to not include the

`isseed`

-vector in the`dbscan`

-object.- showplot
0 = no plot, 1 = plot per iteration, 2 = plot per subiteration.

- countmode
NULL or vector of point numbers at which to report progress.

- x
object of class

`dbscan`

.- object
object of class

`dbscan`

.- newdata
matrix or data.frame with raw data to predict.

- predict.max
max. batch size for predictions.

- ...
Further arguments transferred to plot methods.

##### Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).

We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

`print.dbscan`

shows a statistic of the number of points
belonging to the clusters that are seeds and border points.

`plot.dbscan`

distinguishes between seed and border points by
plot symbol.

##### Value

`predict.dbscan`

gives out a vector of predicted clusters for the
points in `newdata`

.

`dbscan`

gives out
an object of class 'dbscan' which is a LIST with components

integer vector coding cluster membership with noise observations (singletons) coded as 0

logical vector indicating whether a point is a seed (not border, not noise)

parameter eps

parameter MinPts

##### Note

this is a simplified version of the original algorithm (no K-D-trees used), thus we have \(o(n^2)\) instead of \(o(n*log(n))\)

##### References

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

##### Examples

```
# NOT RUN {
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
par(bg="grey40")
ds <- dbscan(x, 0.2)
# run with showplot=1 to see how dbscan works.
ds
plot(ds, x)
x2 <- matrix(0,nrow=4,ncol=2)
x2[1,] <- c(5,2)
x2[2,] <- c(8,3)
x2[3,] <- c(4,4)
x2[4,] <- c(9,9)
predict(ds, x, x2)
n <- 600
x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))
# Not run, but results from my machine are 0.105 - 0.068 - 0.255:
# system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3]
# system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
# system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
# method="dist"))[3]
# }
```

*Documentation reproduced from package fpc, version 2.2-3, License: GPL*