KMppIni: Minkowski and spherical, deterministic and stochastic, multithreaded K-means++ initialization over dense representation of data

Description

Find suitable observations as initial centroids.

Usage

KMppIni(
  X,
  K,
  firstSelection = 1L,
  minkP = 2,
  stochastic = FALSE,
  seed = 123,
  maxCore = 7L,
  verbose = TRUE
  )

Arguments

A d x N numeric matrix where N is the number of data points --- each column is an observation, and d is the dimensionality. Column-observation representation promotes cache locality.

An integer, the number of centroids.

firstSelection

An integer, index of the observation selected as the first initial centroid in X. Should be no greater than N.

minkP

A numeric value or a character string. If numeric, minkP is the power p in the definition of Minkowski distance. If character string, "max" implies Chebyshev distance, "cosine" implies cosine dissimilarity. Default 2.

stochastic

A boolean value. TRUE runs the stochastic K-means++ initialization by Arthur and Vassilvitskii (2007). Roughly speaking, the algorithm is stochastic in the sense that each of the remaining observations has a probability of being selected as the next centroid, and the probability is an increasing function of the minimal distance between this observation and the existing centroids. In the same context, the deterministic version selects as the next centroid with probability 1 the observation that has the longest minimal distance to the existing centroids.

seed

Random seed if stochastic.

maxCore

An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.

verbose

A boolean value. TRUE prints progress.

Value

An integer vector of size K. The vector contains the indexes of observations selected as the initial centroids.

Details

In each iteration, the distances between the newly selected centroid and all the other observations are computed with multiple threads. Scheduling is homemade for minimizing the overhead of thread communication.

Examples

Run this code

# NOT RUN {
N = 30000L
d = 300L
K = 30L
X = matrix(rnorm(N * d) + 2, nrow = d)
# CRAN check allows examples invoking 2 threads at most. Change `maxCore`
# for acceleration.
kmppSt = KMppIni(X, K, firstSelection = 1L, minkP = 2,
                 stochastic = TRUE, seed = sample(1e9L, 1), maxCore = 2L)
kmppDt = KMppIni(X, K, firstSelection = 1L, minkP = 2,
                 stochastic = FALSE, maxCore = 2L)
str(kmppSt)
str(kmppDt)
# }

Run the code above in your browser using DataLab