Learn R Programming

ClusterR (version 1.0.1)

Optimal_Clusters_KMeans: Optimal number of Clusters for k-means

Description

Optimal number of Clusters for k-means

Usage

Optimal_Clusters_KMeans(data, max_clusters, criterion = "variance_explained", fK_threshold = 0.85, num_init = 1, max_iters = 200, initializer = "optimal_init", threads = 1, tol = 1e-04, plot_clusters = TRUE, verbose = FALSE, tol_optimal_init = 0.5, seed = 1)

Arguments

data
matrix or data frame
max_clusters
the maximum number of clusters
criterion
one of variance_explained, WCSSE, dissimilarity, silhouette, distortion_fK, AIC, BIC and Adjusted_Rsquared. See details for more information.
fK_threshold
a float number used in the 'distortion_fK' criterion
num_init
number of times the algorithm will be run with different centroid seeds
max_iters
the maximum number of clustering iterations
initializer
the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random. See details for more information
threads
an integer specifying the number of cores to run in parallel. OpenMP will be utilized to parallelize the number of initializations (num_init)
tol
a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged
plot_clusters
either TRUE or FALSE, indicating whether the results of the Optimal_Clusters_KMeans function should be plotted
verbose
either TRUE or FALSE, indicating whether progress is printed during clustering
tol_optimal_init
tolerance value for the 'optimal_init' initializer. The higher this value is, the far appart from each other the centroids are.
seed
integer value for random number generator (RNG)

Value

a vector with the results for the specified criterion (except for the 'distortion_fK' which returns the WCSS). If plot_clusters is TRUE the it plots also the results.

Details

---------------criteria--------------------------

variance_explained : the sum of the within-cluster-sum-of-squares-of-all-clusters divided by the total sum of squares

WCSSE : the sum of the within-cluster-sum-of-squares-of-all-clusters

dissimilarity : the average intra-cluster-dissimilarity of all clusters (the distance metric defaults to euclidean)

silhouette : the average silhouette width of all clusters (the distance metric defaults to euclidean)

distortion_fK : this criterion is based on the following paper, 'Selection of K in K-means clustering' (https://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf)

AIC : the Akaike information criterion

BIC : the Bayesian information criterion

Adjusted_Rsquared : the adjusted R^2 statistic

---------------initializers----------------------

optimal_init : this initializer adds rows of the data incrementally, while checking that they do not already exist in the centroid-matrix

quantile_init : initialization of centroids by using the cummulative distance between observations and by removing potential duplicates

kmeans++ : kmeans++ initialization. Reference : http://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf AND http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work

random : random selection of data rows as initial centroids

Examples

Run this code

data(dietary_survey_IBS)

dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]

dat = center_scale(dat)

opt = Optimal_Clusters_KMeans(dat, max_clusters = 10, plot_clusters = FALSE)

Run the code above in your browser using DataLab