Ckmeans.1d.dp (version 4.3.0)

Ckmeans.1d.dp-package: Optimal, Fast, and Reproducible Univariate Clustering

Description

This package provides fast optimal univariate clustering by dynamic programming. Four types of problem including univariate \(k\)-means, \(k\)-median, \(k\)-segments, and multi-channel weighted \(k\)-means are solved with guaranteed optimality and reproducibility. The core algorithm minimizes the (weighted) sum of within-cluster distances using respective metrics. Its advantage over heuristic clustering in efficiency and accuracy is increasingly pronounced as the number of clusters \(k\) increases. Weighted \(k\)-means can also optimally segment time series to perform peak calling. An auxiliary function generates histograms that are adaptive to patterns in data. This package provides a powerful set of tools for univariate data analysis with guaranteed optimality, efficiency, and reproducibility.

The Ckmeans.1d.dp algorithm clusters (weighted) univariate data given by a numeric vector \(x\) into \(k\) groups by dynamic programming Wang2011CkmeansCkmeans.1d.dp. It guarantees the optimality of clustering---the total of within-cluster sums of squares is always the minimum given the number of clusters \(k\). In contrast, heuristic univariate clustering algorithms may be non-optimal or inconsistent from run to run. As unequal non-negative weights are supported for each point, the algorithm can also segment a time course using the time points as input and the values at each time point as weight. Utilizing the optimal clusters, a function can generate histograms adaptive to patterns in data. This method is numerically stable.

Apart from the time for sorting \(x\), the weighted univariate clustering algorithm takes a runtime of \(O(kn)\), linear in both sample size \(n\) and the number of clusters \(k\). The new approach was first introduced into version 3.4.9 (not publicly released) on July 16, 2016, using a new divide-and-conquer strategy integrating a previous theoretical result on matrix search aggarwal1987geometricCkmeans.1d.dp and a novel in-place search space reduction method. Earlier versions since 3.4.6 used a divide-and-conquer strategy in dynamic programming to reduce the runtime from \(O(kn^2)\) down to \(O(kn\lg n)\).

Bellman1973;textualCkmeans.1d.dp first described a general dynamic programming strategy for solving univariate clustering problems with additive optimality measures. The strategy, however, did not address any specific characteristics of the \(k\)-means problem and its implied general algorithm will have a time complexity of \(O(kn^3)\) on an input of \(n\) points.

The space complexity in all cases is \(O(kn)\).

This package provides a powerful alternative to heuristic clustering and also new functionality for weighted clustering, segmentation, and peak calling with guaranteed optimality. It is practical for Ckmeans.1d.dp to cluster millions of sample points with optional weights in seconds using a single processor on a not very recent desktop computer.

Third parties have ported various versions of this package to JavaScript, MATLAB, Python, Ruby, SAS, and Scala.

Arguments

Disclaimer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Details

Package: Ckmeans.1d.dp
Type: Package
Version: 4.3.0
Initial release version: 1.0
Initial release date: 2010-10-26

The most important function of this package is Ckmeans.1d.dp that implements optimal univariate clustering either weighted or unweighted. It also contains an adaptive histogram function ahist, plotting functions plot.Ckmeans.1d.dp and plotBIC, and a print function print.Ckmeans.1d.dp.

References

See Also

The kmeans function in package stats that implements several heuristic \(k\)-means algorithms.