bigkmeans: The Bigmemory Project's memory-efficient k-means cluster analysis

Description

k-means cluster analysis without the memory overhead, and possibly in parallel using shared memory.

Usage

bigkmeans(x, centers, iter.max = 10, nstart = 1)

Arguments

a big.matrix object.

centers

a scalar denoting the number of clusters, or for k clusters, a k by ncol(x) matrix.

iter.max

the maximum number of iterations.

nstart

number of random starts, to be done in parallel if there is a registered backend (see below).

Value

An object of class kmeans, just as produced by kmeans.

Details

The real benefit is the lack of memory overhead compared to the standard kmeans function. Part of the overhead from kmeans() stems from the way it looks for unique starting centers, and could be improved upon. The bigkmeans() function works on either regular Rmatrix objects, or on big.matrix objects. In either case, it requires no extra memory (beyond the data, other than recording the cluster memberships), whereas kmeans() makes at least two extra copies of the data. And kmeans() is even worse if multiple starts (nstart>1) are used.

If nstart>1 and you are using bigkmeans() in parallel, a vector of cluster memberships will need to be stored for each worker, which could be memory-intensive for large data. This isn't a problem if you use are running the multiple starts sequentially.

Unless you have a really big data set (where a single run of kmeans not only burns memory but takes more than a few seconds), use of parallel computing for multiple random starts is unlikely to be much faster than running iteratively.

Only the algorithm by MacQueen is used here.

References

Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100--108.

MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281--297. Berkeley, CA: University of California Press.

Examples

Run this code

# Simple example (with one processor):

  library(bigmemory)

  x <- big.matrix(100000, 3, init=0, type="double")
  x[seq(1,100000,by=2),] <- rnorm(150000)
  x[seq(2,100000,by=2),] <- rnorm(150000, 5, 1)
  head(x)
  ans <- bigkmeans(x, 1)              # One cluster isn't always allowed
                                      # but is convenient.
  ans$centers
  ans$withinss
  ans$size
  apply(x, 2, mean)
  ans <- bigkmeans(x, 2, nstart=5)    # Sequential multiple starts.
  class(ans)
  names(ans)
  ans$centers
  ans$withinss
  ans$size

  # To use a parallel backend, try something like the following,
  # assuming you have at least 3 cores available on this machine.
  # Each processor does incur memory overhead for the storage of
  # cluster memberships.
  library(doSNOW)
    cl <- makeCluster(3, type="SOCK")
    registerDoSNOW(cl)
    ans <- bigkmeans(x, 2, nstart=5)

  # Both the following are run iteratively, but with less memory overhead
  # using bigkmeans().  Note that the gc() comparisons aren't completely
  # fair, because the big.matrix objects aren't reflected in the gc()
  # summary.  But the savings is there.
  gc(reset=TRUE)
  time.new <- system.time(print(bigkmeans(x, 2, nstart=5)$centers))
  gc()
  y <- x[,]
  rm(x)
  gc(reset=TRUE)
  time.old <- system.time(print(kmeans(y, 2, nstart=5)$centers))
  gc()
  # The new kmeans() centers should match the old kmeans() centers, without
  # the memory overhead amd running more quickly.
  time.new
  time.old

Run the code above in your browser using DataLab