Perform k-means clustering on a data matrix.
kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace=FALSE) ## S3 method for class 'kmeans': fitted(object, method = c("centers", "classes"), ...)
- numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
- either the number of clusters, say $k$, or a set of
initial (distinct) cluster centres. If a number, a random set of
(distinct) rows in
xis chosen as the initial centres.
- the maximum number of iterations allowed.
centersis a number, how many random sets should be chosen?
- character: may be abbreviated. Note that
"Forgy"are alternative names for one algorithm.
- an Robject of class
"kmeans", typically the result
ob <- kmeans(..).
- character: may be abbreviated.
fittedto return cluster centers (one for each input point) and
fittedto return a vector of class assignments.
- logical or integer number, currently only used in the
default method (
"Hartigan-Wong"): if positive (or true), tracing information on the progress of the algorithm is produced. Higher values may produce more tracing information.
- not used.
The data given by
x are clustered by the $k$-means method,
which aims to partition the points into $k$ groups such that the
sum of squares from points to the assigned cluster centres is minimized.
At the minimum, all cluster centres are at the mean of their Voronoi
sets (the set of data points which are nearest to the cluster centre).
The algorithm of Hartigan and Wong (1979) is used by default. Note
that some authors use $k$-means to refer to a specific algorithm
rather than the general method: most commonly the algorithm given by
MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy
(1965). The Hartigan--Wong algorithm generally does a better job than
either of those, but trying several random starts (
1$) is often recommended. In rare cases, when some of the points
x) are extremely close, the algorithm may not converge
ifault = 4). Slight
rounding of the data may be advisable in that case.
For ease of programmatic exploration, $k=1$ is allowed, notably
returning the center and
Except for the Lloyd--Forgy method, $k$ clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan--Wong method.
kmeansreturns an object of class
"kmeans"which has a
fittedmethod. It is a list with at least the following components:
cluster A vector of integers (from
1:k) indicating the cluster to which each point is allocated.
centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e.
betweenss The between-cluster sum of squares, i.e.
size The number of points in each cluster. iter The number of (outer) iterations. ifault integer: indicator of a possible algorithm problem -- for experts.
Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768--769.
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100--108.
Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128--137.
MacQueen, J. (1967) Some methods for classification and analysis of
multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability,
eds L. M. Le Cam & J. Neyman,
require(graphics) # a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex = 2) # sum of squares ss <- function(x) sum(scale(x, scale = FALSE)^2) ## cluster centers "fitted" to each obs.: fitted.x <- fitted(cl); head(fitted.x) resid.x <- x - fitted(cl) ## Equalities : ---------------------------------- cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns c(ss(fitted.x), ss(resid.x), ss(x))) stopifnot(all.equal(cl$ totss, ss(x)), all.equal(cl$ tot.withinss, ss(resid.x)), ## these three are the same: all.equal(cl$ betweenss, ss(fitted.x)), all.equal(cl$ betweenss, cl$totss - cl$tot.withinss), ## and hence also all.equal(ss(x), ss(fitted.x) + ss(resid.x)) ) kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x)) ## random starts do help here with too many clusters ## (and are often recommended anyway!): (cl <- kmeans(x, 5, nstart = 25)) plot(x, col = cl$cluster) points(cl$centers, col = 1:5, pch = 8)