Ckmeans.1d.dp: Optimal K-means Clustering in One-dimension by Dynamic Programming

Description

Perform optimal k-means clustering on one-dimensional data.

Usage

Ckmeans.1d.dp(x, k=c(1,9))

Arguments

a numeric vector of data to be clustered. The vector cannot contain any NA.

either an exact integer number of clusters, or a vector of two integers specifying the minimum and maximum numbers of clusters. The default is c(1,9). When a range is provided, the actual number of clusters will be determined by Bayesian info

Value

An object of class "Ckmeans.1d.dp" with a print method. It is a list containing the following components:
clustera vector of clusters assigned to each element in x. Each cluster is indexed by an integer from 1 to k.
centersa numeric vector of the means for each cluster.
withinssa numeric vector of the within-cluster sum of squares for each cluster.
sizea vector of the number of points in each cluster.
totsstotal sum of squares of the input numbers.
tot.withinsstotal sum of within-cluster distance squares.
betweenssbetween-cluster sum of squares, equal to the sum of squared cluster means weighed by cluster size.

Details

In contrast to the heuristic k-means algorithms implemented in function kmeans, this function optimally assigns elements in numeric vector x into k clusters by dynamic programming (Wang and Song, 2011). It minimizes the sum of squares of within-cluster distances (withinss) from each element to its corresponding cluster center (mean). When a range is provided for k, the exact number of clusters is determined by Bayesian information criterion. Different from the heuristic k-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be reproducible and optimal, and its advantage over k-means in efficiency and accuracy is most pronounced at k=2 or large k.

References

Wang, H. and Song, M. (2011) Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal 3(2), 29--33. Retrieved from http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf

Examples

Run this code

# Ex. 1 The number of clusters is provided.
# Generate data from a Gaussian mixture model of two components
x <- c(rnorm(50, sd=0.3), rnorm(50, mean=1, sd=0.3))
# Divide x into 2 clusters
k <- 2
result <- Ckmeans.1d.dp(x, k)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal k-means clustering given k",
     sub=paste("Number of clusters given:", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("bottom", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")

# Ex. 2 The number of clusters is determined by Bayesian information criterion
# Generate data from a Gaussian mixture model of two components
x <- c(rnorm(50, mean=-1, sd=0.3), rnorm(50, mean=1, sd=1))
# Divide x into k clusters, k automatically selected (default: 1~9)
result <- Ckmeans.1d.dp(x)
k <- max(result$cluster)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal k-means clustering with k estimated",
     sub=paste("Number of clusters is estimated to be", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")

Run the code above in your browser using DataLab