DSC_BICO: BICO - Fast computation of k-means coresets in a data stream

Description

BICO maintains a tree which is inspired by the clustering tree of BIRCH, a SIGMOD Test of Time award-winning clustering algorithm. Each node in the tree represents a subset of these points. Instead of storing all points as individual objects, only the number of points, the sum and the squared sum of the subset's points are stored as key features of each subset. Points are inserted into exactly one node.

Usage

DSC_BICO(k = 5, space = 10, p = 10, iterations = 10)

Arguments

number of centres

space

coreset size

number of random projections used for nearest neighbour search in first level

iterations

number of repetitions for the kmeans++ procedure in the offline component

Details

In this implementation, the nearest neighbour search on the first level of the tree ist sped up by projecting all points to random 1-d subspaces. The first estimation of the optimal clustering cost is computed in a buffer phase at the beginning of the algorithm.

This implementation interfaces the original C++ implementation available here: http://ls2-www.cs.tu-dortmund.de/grav/de/bico. For micro-clustering, the algorithm computes the coreset of the stream. Reclustering is performed by using the kmeans++ algorithm on the coreset.

References

Hendrik Fichtenberger, Marc Gille, Melanie Schmidt, Chris Schwiegelshohn, Christian Sohler: BICO: BIRCH Meets Coresets for k-Means Clustering. ESA 2013: 481-492.

Examples

Run this code

# NOT RUN {
stream <- DSD_Gaussians(k = 3, d = 2)

BICO <- DSC_BICO(k = 3, p = 10, space = 100, iterations = 10)
update(BICO, stream, n = 500)

plot(BICO,stream, type = "both")
# }

Run the code above in your browser using DataLab