KMconstrained: K-means over dense data input with constraints on cluster weights

Description

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.

Usage

KMconstrained(
  X,
  centroid,
  Xw = rep(1, ncol(X)),
  clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)),
  minkP = 2,
  convergenceTail = 5L,
  tailConvergedRelaErr = 1e-04,
  maxIter = 100L,
  maxCore = 7L,
  paraSortInplaceMerge = FALSE,
  verbose = TRUE
  )

Arguments

A d x N numeric matrix where N is the number of data points --- each column is an observation, and d is the dimensionality. Column-observation representation promotes cache locality.

centroid

A d x K numeric matrix where K is the number of clusters. Each column represents a cluster center.

A numeric vector of size N. Xw[i] is the weight on observation X[, i]. Users should normalize Xw such that the elements sum up to N. Default uniform weights for all observations.

clusterWeightUB

An integer vector of size K. The upper bound of weight for each cluster. If Xw are all 1, clusterWeightUB upper-bound cluster sizes.

minkP

A numeric value or a character string. If numeric, minkP is the power p in the definition of Minkowski distance. If character string, "max" implies Chebyshev distance, "cosine" implies cosine dissimilarity. Default 2.

convergenceTail

An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last convergenceTail iterations has a relative difference less than tailConvergedRelaErr against the cost from the prior iteration, the program stops.

tailConvergedRelaErr

A numeric value, explained in convergenceTail.

maxIter

An integer. The maximal number of iterations. Default 100.

maxCore

An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.

paraSortInplaceMerge

A boolean value. TRUE let the algorithm call std::inplace_merge() (std refers to C++ STL namespace) instead of std::merge() for parallel-sorting the observation-centroid distances. In-place merge is slower but requires no extra memory.

verbose

A boolean value. TRUE prints progress.

Value

A list of size K, the number of clusters. Each element is a list of 3 vectors:

centroid

a numeric vector of size d.

clusterMember

an integer vector of indexes of elements grouped to centroid.

member2centroidDistance

a numeric vector of the same size of clusterMember. The ith element is the Minkowski distance or cosine dissimilarity from centroid to the clusterMember[i]th observation in X.

Empty clusterMember implies empty cluster.

Details

See details in KM() for common implementation highlights. Weight upper bounds are implemented as follows:

In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.

Examples

Run this code

# NOT RUN {
N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)


# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
  X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
  seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)


centroid = dat[, centroidInd]


# Each cluster size should not be greater than N / K * 2.
sizeConstraints = as.integer(rep(N / K * 2, K))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})


# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
# }

Run the code above in your browser using DataLab