Optimal_Clusters_KMeans: Optimal number of Clusters for k-means

Description

Optimal number of Clusters for k-means

Usage

Optimal_Clusters_KMeans(data, max_clusters, criterion = "variance_explained",
  fK_threshold = 0.85, num_init = 1, max_iters = 200,
  initializer = "optimal_init", threads = 1, tol = 1e-04,
  plot_clusters = TRUE, verbose = FALSE, tol_optimal_init = 0.3,
  seed = 1)

Arguments

data

matrix or data frame

max_clusters

the maximum number of clusters

criterion

one of variance_explained, WCSSE, dissimilarity, silhouette, distortion_fK, AIC, BIC and Adjusted_Rsquared. See details for more information.

fK_threshold

a float number used in the 'distortion_fK' criterion

num_init

number of times the algorithm will be run with different centroid seeds

max_iters

the maximum number of clustering iterations

initializer

the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random. See details for more information

threads

an integer specifying the number of cores to run in parallel. OpenMP will be utilized to parallelize the number of initializations (num_init)

tol

a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged

plot_clusters

either TRUE or FALSE, indicating whether the results of the Optimal_Clusters_KMeans function should be plotted

verbose

either TRUE or FALSE, indicating whether progress is printed during clustering

tol_optimal_init

tolerance value for the 'optimal_init' initializer. The higher this value is, the far appart from each other the centroids are.

seed

integer value for random number generator (RNG)

Value

a vector with the results for the specified criterion (except for the 'distortion_fK' which returns the WCSS). If plot_clusters is TRUE the it plots also the results.

Details

---------------criteria--------------------------

variance_explained : the sum of the within-cluster-sum-of-squares-of-all-clusters divided by the total sum of squares

WCSSE : the sum of the within-cluster-sum-of-squares-of-all-clusters

dissimilarity : the average intra-cluster-dissimilarity of all clusters (the distance metric defaults to euclidean)

silhouette : the average silhouette width of all clusters (the distance metric defaults to euclidean)

distortion_fK : this criterion is based on the following paper, 'Selection of K in K-means clustering' (https://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf)

AIC : the Akaike information criterion

BIC : the Bayesian information criterion

Adjusted_Rsquared : the adjusted R^2 statistic

---------------initializers----------------------

optimal_init : this initializer adds rows of the data incrementally, while checking that they do not already exist in the centroid-matrix

quantile_init : initialization of centroids by using the cummulative distance between observations and by removing potential duplicates

kmeans++ : kmeans++ initialization. Reference : http://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf AND http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work

random : random selection of data rows as initial centroids

Examples

Run this code

# NOT RUN {
data(dietary_survey_IBS)

dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]

dat = center_scale(dat)

opt = Optimal_Clusters_KMeans(dat, max_clusters = 10, plot_clusters = FALSE)

# }

Run the code above in your browser using DataLab