KM: K-means over dense representation of data

Description

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data.

Usage

KM(
  X,
  centroid,
  Xw = rep(1, ncol(X)),
  minkP = 2,
  maxIter = 100L,
  maxCore = 7L,
  verbose = TRUE
  )

Arguments

A d x N numeric matrix where N is the number of data points --- each column is an observation, and d is the dimensionality. Column-observation representation promotes cache locality.

centroid

A d x K numeric matrix where K is the number of clusters. Each column represents a cluster center.

A numeric vector of size N. Xw[i] is the weight on observation X[, i]. Users should normalize Xw such that the elements sum up to N. Default uniform weights for all observations.

minkP

A numeric value or a character string. If numeric, minkP is the power p in the definition of Minkowski distance. If character string, "max" implies Chebyshev distance, "cosine" implies cosine dissimilarity. Default 2.

maxIter

An integer. The maximal number of iterations. Default 100.

maxCore

An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.

verbose

A boolean value. TRUE prints progress.

Value

A list of size K, the number of clusters. Each element is a list of 3 vectors:

centroid

a numeric vector of size d.

clusterMember

an integer vector of indexes of elements grouped to centroid.

member2centroidDistance

a numeric vector of the same size of clusterMember. The ith element is the Minkowski distance or cosine dissimilarity from centroid to the clusterMember[i]th observation in X.

Empty clusterMember implies empty cluster.

Details

Implementation highlights include:

(i) In Minkowski distance calculation, integer power no greater than 30 uses multiplications. Fractional powers or powers above 30 call std::pow().

(ii) Multithreaded observation-centroid distance calculations. Distances are memorized to avoid unnecessary recomputations if centroids did not change in the last iteration.

(iii) A lookup table is built for storing observation - centroid ID pairs during the assignment step. Observation IDs are then grouped by centroid IDs which allows parallel computing cluster means.

(iv) Function allows non-uniform weights on observations.

(v) Meta-template programming trims branches over different distance functions and other computing methods at compile time.

Examples

Run this code

# NOT RUN {
# ===========================================================================
# Play random numbers. See speed.
# ===========================================================================
N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)


# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
  X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
  seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)


centroid = dat[, centroidInd]


# Euclidean.
system.time({rst = GMKMcharlie::KM(
  X = dat, centroid = centroid, maxIter = 100,
  minkP = 2, maxCore = 2, verbose = TRUE)})


# Cosine dissimilarity.
dat = apply(dat, 2, function(x) x / sum(x ^ 2) ^ 0.5)
centroid = dat[, centroidInd]
system.time({rst2 = GMKMcharlie::KM(
  X = dat, centroid = centroid, maxIter = 100,
  minkP = "cosine", maxCore = 2, verbose = TRUE)})


# ===========================================================================
# Test against R's inbuilt km()
# ===========================================================================
dat = t(iris[1:4])
dimnames(dat) = NULL


# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
  X = dat, K = 3, firstSelection = 1L, minkP = 2, stochastic = FALSE,
  seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)
centroid = dat[, centroidInd]


rst = GMKMcharlie::KM(X = dat, centroid = centroid, maxIter = 100,
                      minkP = 2, maxCore = 2, verbose = TRUE)
rst = lapply(rst, function(x) sort(x$clusterMember))


rst2 = kmeans(x = t(dat), centers = t(centroid), algorithm = "Lloyd")
rst2 = aggregate(list(1L : length(rst2$cluster)),
                 list(rst2$cluster), function(x) sort(x))[[2]]


setdiff(rst, rst2)
# }

Run the code above in your browser using DataLab