
Last chance! 50% off unlimited learning
Sale ends in
Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).
dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
"dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
# S3 method for dbscan
print(x, ...)
# S3 method for dbscan
plot(x, data, ...)
# S3 method for dbscan
predict(object, data, newdata = NULL,
predict.max=1000, ...)
data matrix, data.frame, dissimilarity matrix or
dist
-object. Specify method="dist"
if the data should
be interpreted as dissimilarity matrix or object. Otherwise
Euclidean distances will be used.
Reachability distance, see Ester et al. (1996).
Reachability minimum no. of points, see Ester et al. (1996).
scale the data if TRUE
.
"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements).
FALSE to not include the isseed
-vector in the
dbscan
-object.
0 = no plot, 1 = plot per iteration, 2 = plot per subiteration.
NULL or vector of point numbers at which to report progress.
object of class dbscan
.
object of class dbscan
.
matrix or data.frame with raw data to predict.
max. batch size for predictions.
Further arguments transferred to plot methods.
predict.dbscan
gives out a vector of predicted clusters for the
points in newdata
.
dbscan
gives out
an object of class 'dbscan' which is a LIST with components
integer vector coding cluster membership with noise observations (singletons) coded as 0
logical vector indicating whether a point is a seed (not border, not noise)
parameter eps
parameter MinPts
Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).
We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.
print.dbscan
shows a statistic of the number of points
belonging to the clusters that are seeds and border points.
plot.dbscan
distinguishes between seed and border points by
plot symbol.
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
# NOT RUN {
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
par(bg="grey40")
ds <- dbscan(x, 0.2)
# run with showplot=1 to see how dbscan works.
ds
plot(ds, x)
x2 <- matrix(0,nrow=4,ncol=2)
x2[1,] <- c(5,2)
x2[2,] <- c(8,3)
x2[3,] <- c(4,4)
x2[4,] <- c(9,9)
predict(ds, x, x2)
n <- 600
x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))
# Not run, but results from my machine are 0.105 - 0.068 - 0.255:
# system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3]
# system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
# system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
# method="dist"))[3]
# }
Run the code above in your browser using DataLab