DSC_BICO: BICO - Fast computation of k-means coresets in a data stream
Description
Micro Clusterer.
BICO maintains a tree which is inspired by the clustering tree of BIRCH, a
SIGMOD Test of Time award-winning clustering algorithm. Each node in the
tree represents a subset of these points. Instead of storing all points as
individual objects, only the number of points, the sum and the squared sum
of the subset's points are stored as key features of each subset. Points are
inserted into exactly one node.
Usage
DSC_BICO(k = 5, space = 10, p = 10, iterations = 10)
Arguments
k
number of centres
space
coreset size
p
number of random projections used for nearest neighbour search in
first level
iterations
number of repetitions for the kmeans++ procedure in the
offline component
Author
R-Interface: Matthias Carnein
(Matthias.Carnein@uni-muenster.de), Dennis Assenmacher.
C-Implementation: Hendrik Fichtenberger, Marc Gille, Melanie Schmidt, Chris
Schwiegelshohn, Christian Sohler.
Details
In this implementation, the nearest neighbour search on the first level of
the tree ist sped up by projecting all points to random 1-d subspaces. The
first estimation of the optimal clustering cost is computed in a buffer
phase at the beginning of the algorithm. This implementation interfaces the
original C++ implementation available here:
http://ls2-www.cs.tu-dortmund.de/grav/de/bico. For micro-clustering,
the algorithm computes the coreset of the stream. Reclustering is performed
by using the kmeans++ algorithm on the coreset.
References
Hendrik Fichtenberger, Marc Gille, Melanie Schmidt, Chris
Schwiegelshohn, Christian Sohler: BICO: BIRCH Meets Coresets for k-Means
Clustering. ESA 2013: 481-492.