dkmeans: Distributed kmeans

Description

dkmeans function is intended to be a distributed alternative for kmeans function.

Usage

dkmeans(X, centers, iter.max = 10, nstart = 1,  sampling_threshold = 1e+06, trace = FALSE,  na_action = c("exclude","fail"),  completeModel = FALSE)

Arguments

a darray (dense or sparse) which contains the samples.

centers

either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) samples in X is chosen as the initial centres.

iter.max

the maximum number of iterations allowed.

nstart

when the value specified for 'centers' argument is a number, clustering will be performed several times and the best result is reported. The best result would be the one with highest value of 'withinss' regardless of its number of iterations. 'nstart' gives the number of times that a random set of centers is chosen and clustering is performed. When 'centers' argument is a matrix of centers, or when completeModel=TRUE 'nstart' will be discarded.

sampling_threshold

threshold for the method which Randomly finds centers (centralized or distributed). It should be always smaller than 1e9. When (blockSize > sampling_threshold || nSample > 1e9), the distributed sampling is selected, in which first a set of blocks are randomly chosen, and then the centers are randomly selected from the samples of those bloks. Here, blockSize is the number of samples in each partition of X, and nSample is the total number of samples in X.

trace

when this argument is true, intermediate steps of the progress are displayed.

na_action

it indicates what should happen when the data contain missed values. Values of NA, NaN, and Inf in samples are treated as missed values. There are two options for this argument exclude and fail. When exclude is selected (the default choice), any sample with missed values will be ignored in the clustering process. In the darray which will be created for cluster, the value corresponding to these samples will be NA. When fail is selected, the function will stop in the case of any missed value in the dataset.

completeModel

when it is FALSE (default), the function does not return cluster label of the samples and measurements of cluster quality. Therefore, it can perform faster.

Value

cluster: (available only when completeModel=TRUE; otherwise it is NULL) a darray of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: a matrix of cluster centres.
totss: (available only when completeModel=TRUE; otherwise it is NA) the total sum of squares.
withinss: (available only when completeModel=TRUE; otherwise it is NA) vector of within-cluster sum of squares, one component per cluster.
tot.withinss: (available only when completeModel=TRUE; otherwise it is NA) total within-cluster sum of squares, i.e., sum(withinss).
betweenss: (available only when completeModel=TRUE; otherwise it is NA) the between-cluster sum of squares, i.e. totss-tot.withinss.
size: the number of points in each cluster.
iter: the number of iterations used for clustering. Its value will be iter.max+1 when the algorithm is not converged.

Details

The data given by X is clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centres is minimized. At the minimum, all cluster centres are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster centre).

The algorithm of Lloyd-Forgy (Lloyd 1957 and Forgy 1965) is used at the current version. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which currently generates a warning message.

References

Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769.

Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137.

Examples

Run this code

 ## Not run: 
#     library(kmeans.ddR)
# 
#     iris2 <- iris[,-5]
#     centers <- matrix(c(5.901613,5.006000,6.850000,2.748387,3.428000,
# 3.073684,4.393548,1.462000,5.742105,1.433871,0.246000,2.071053),3,4)
#     dimnames(centers) <- list(1L:3L, colnames(iris2))
# 
#     X <- as.darray(data.matrix(iris2))
# 
#     (mykm1 <- dkmeans(X,centers=centers))
#     
#     (mykm2 <- dkmeans(X,centers=3, completeModel=TRUE))
#  ## End(Not run)

Run the code above in your browser using DataLab