DSC_BIRCH: Balanced Iterative Reducing Clustering using Hierarchies
Description
BIRCH builds a balanced-tree of Clustering Features (CFs) to summarize the stream.
A CF is a tuple (n, LS, SS) which represents a cluster by storing the number of elements (n), their linear sum (LS) and their squared sum (SS).
Each new observation descends the tree by following its closest CF until a leaf node is reached.
It is either merged into its closest leaf-CF or inserted as a new one.
All leaf-CFs form the micro-clusters. Rebuilding the tree is realized by inserting all leaf-CF nodes into a new tree structure with an increased treshold.
treshold used to check whether a new datapoint can be absorbed or not
branching
branching factor (maximum amount of child nodes for a nonleaf node) of the CF-Tree.
maxLeaf
maximum number of entries within a leaf node
outlierThreshold
threshold for identifying outliers when rebuilding the CF-Tree
maxMem
memory limitation for the whole CFTree in bytes. Default is 0, indicating no memory restriction.
References
Zhang T, Ramakrishnan R and Livny M (1996), "BIRCH: An Efficient Data Clustering Method for Very Large Databases", In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Montreal, Quebec, Canada , pp. 103-114. ACM.
Zhang T, Ramakrishnan R and Livny M (1997), "BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery. Vol. 1(2), pp. 141-182.