Ckmeans.1d.dp v4.3.3

0

Monthly downloads

0th

Percentile

Optimal, Fast, and Reproducible Univariate Clustering

Fast, optimal, and reproducible weighted univariate clustering by dynamic programming. Four problem are solved, including univariate k-means (Wang & Song 2011) <doi:10.32614/RJ-2011-015> (Song & Zhong 2020) <doi:10.1093/bioinformatics/btaa613>, k-median, k-segments, and multi-channel weighted k-means. Dynamic programming is used to minimize the sum of (weighted) within-cluster distances using respective metrics. Its advantage over heuristic clustering in efficiency and accuracy is pronounced at a large number of clusters. Weighted k-means can also process time series to perform peak calling. Multi-channel weighted k-means groups multiple univariate signals into k clusters. An auxiliary function generates histograms adaptive to patterns in data. This package provides a powerful set of tools for univariate data analysis with guaranteed optimality, efficiency, and reproducibility.

Readme

The 'Ckmeans.1d.dp' R package

Project Status: Active – The project has reached a stable, usable state and is being actively developed. CRAN_Status_Badge CRAN_latest_release_date metacran downloads metacran downloads lifecycle

Overview

The package provides a powerful set of tools for fast, optimal, and reproducible univariate clustering by dynamic programming. It is practical to cluster millions of sample points into a few clusters in seconds using a single core on a typical desktop computer. It solves four types of problem, including univariate $k$-means, $k$-median, $k$-segments, and multi-channel weighted $k$-means. Dynamic programming is used to minimize the (weighted) sum of within-cluster distances using respective metrics. Its advantage over heuristic clustering in efficiency and accuracy is increasingly pronounced as the number of clusters $k$ increases. Weighted $k$-means can also optimally segment time series to perform peak calling. An auxiliary function generates histograms that are adaptive to patterns in data. The package is used to identify dysregulated genomic zones in human cancers (Song and Zhong, 2020) <10.1093 bioinformatics="" btaa613="">.

The main method

The Ckmeans.1d.dp algorithm clusters (weighted) univariate data given by a numeric vector $x$ into $k$ groups by dynamic programming (Wang and Song, 2011) (Song and Zhong, 2020) <10.1093 bioinformatics="" btaa613="">. It guarantees the optimality of clustering---the total of within-cluster sums of squares is always the minimum given the number of clusters $k$. In contrast, heuristic univariate clustering algorithms may be non-optimal or inconsistent from run to run. As unequal non-negative weights are supported for each point, the algorithm can also segment a time course using the time points as input and the values at each time point as weight. Utilizing the optimal clusters, a function can generate histograms adaptive to patterns in data.

Excluding the time for sorting $x$, the default weighted univariate clustering algorithm takes a runtime of $O(kn)$ (Song and Zhong, 2020) <10.1093 bioinformatics="" btaa613="">, linear in both sample size $n$ and the number of clusters $k$, using a new divide-and-conquer strategy based on a theoretical result on matrix search (Aggarwal et al., 1987) implemented via a novel in-place search space reduction method (Song and Zhong, 2020) <10.1093 bioinformatics="" btaa613="">. The space complexity is $O(kn)$. This method is numerically stable.

When to use the package

This package provides a powerful alternative to heuristic clustering and also new functionality for weighted clustering, segmentation, and peak calling with guaranteed optimality.

To download and install the package

install.packages("Ckmeans.1d.dp")

Functions in Ckmeans.1d.dp

Name Description
print.Cksegs.1d.dp Print Optimal Univariate Segmentation Results
ahist Adaptive Histograms
Univariate Clustering Optimal (Weighted) Univariate Clustering
print.Ckmeans.1d.dp Print Optimal Univariate Clustering Results
plotBIC Plot Bayesian Information Criterion as a Function of Number of Clusters
plot.Ckmeans.1d.dp Plot Optimal Univariate Clustering Results
plot.Cksegs.1d.dp Plot Optimal Univariate Segmentation Results
Ckmeans.1d.dp-package Optimal, Fast, and Reproducible Univariate Clustering
Univariate Segmentation Optimal Univariate Segmentation
MultiChannel.WUC Optimal Multi-channel Weighted Univariate Clustering
No Results!

Vignettes of Ckmeans.1d.dp

Name
Ckmeans.1d.dp.Rmd
Weights.Rmd
ahist.Rmd
No Results!

Last month downloads

Details

Type Package
Date 2020-07-21
License LGPL (>= 3)
Encoding UTF-8
LazyData true
LinkingTo Rcpp
NeedsCompilation yes
RdMacros Rdpack
VignetteBuilder knitr
Packaged 2020-07-22 03:03:50 UTC; joesong
Repository CRAN
Date/Publication 2020-07-22 13:42:15 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/Ckmeans.1d.dp)](http://www.rdocumentation.org/packages/Ckmeans.1d.dp)