KMsparse: K-means over sparse representation of data

Description

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data.

Usage

KMsparse(
  X,
  d,
  centroid,
  Xw = rep(1, length(X)),
  minkP = 2,
  maxIter = 100L,
  maxCore = 7L,
  verbose = TRUE
  )

Arguments

A list of size N, the number of observations. X[[i]] is a 2-column data frame. The 1st column is a sorted integer vector of the indexes of nonzero dimensions. Values in these dimensions are stored in the 2nd column as a numeric vector. Internally the algorithm sets a 32-bit int pointer to the beginning of the 1st column and a 64-bit double pointer to the beginning of the 2nd column, so it is critical that the input has the correct type.

An integer. The dimensionality of X. d MUST be no less than the maximum of all index vectors in X.

centroid

A list of size K, the number of clusters. centroid[[i]] can be in dense or sparse representation. If dense, a numeric vector of size d. If sparse, a 2-column data frame in the same sense as X[[i]].

A numeric vector of size N. Xw[i] is the weight on observation X[[i]]. Users should normalize Xw such that the elements sum up to N. Default uniform weights for all observations.

minkP

A numeric value or a character string. If numeric, minkP is the power p in the definition of Minkowski distance. If character string, "max" implies Chebyshev distance, "cosine" implies cosine dissimilarity. Default 2.

maxIter

An integer. The maximal number of iterations. Default 100.

maxCore

An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.

verbose

A boolean value. TRUE prints progress.

Value

A list of size K, the number of clusters. Each element is a list of 3 vectors:

centroid

a numeric vector of size d.

clusterMember

an integer vector of indexes of elements grouped to centroid.

member2centroidDistance

a numeric vector of the same size of clusterMember. The ith element is the Minkowski distance or cosine dissimilarity from centroid to the clusterMember[i]th observation in X.

Empty clusterMember implies empty cluster.

Details

See details in KM() for implementation highlights. There are some other optimizations such as, except for the maximum norm, cost of computing the distance between a dense centroid vector and a sparse observation is linear to the size of the sparse observation, which should be largely less than the size of the dense vector. This is done by letting every centroid memorize its before-root Minkowski norm. The full distance can then be inferred from adding the residual norm to the partial distance.

Examples

Run this code

# NOT RUN {
# ===========================================================================
# Play random numbers. See speed.
# ===========================================================================
N = 10000L # Number of points.
d = 500L # Dimensionality.
K = 100L # Number of clusters.


# Create a data matrix, about 95% of which are zeros.
dat = matrix(unlist(lapply(1L : N, function(x)
{
  tmp = numeric(d)
  # Nonzero entries.
  Nnz = as.integer(max(1, d * runif(1, 0, 0.05)))
  tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz)
  tmp
})), nrow = d); gc()


# Convert to sparse representation.
# GMKMcharlie::d2s() acheives the same.
sparsedat = apply(dat, 2, function(x)
{
  nonz = which(x != 0)
  list(nonz, x[nonz])
}); gc()


centroidInd = sample(length(sparsedat), K)


# Test speed using dense representation.
centroid = dat[, centroidInd]
system.time({rst = GMKMcharlie::KM(
  X = dat, centroid = centroid, maxIter = 100,
  minkP = 2, maxCore = 2, verbose = TRUE)})


# Test speed using sparse representation.
sparseCentroid = sparsedat[centroidInd]
system.time({sparseRst = GMKMcharlie::KMsparse(
  X = sparsedat, d = d, centroid = sparseCentroid,
  maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
# }

Run the code above in your browser using DataLab