Optimal number of Clusters for k-means
Optimal_Clusters_KMeans(data, max_clusters, criterion = "variance_explained",
fK_threshold = 0.85, num_init = 1, max_iters = 200,
initializer = "optimal_init", threads = 1, tol = 1e-04,
plot_clusters = TRUE, verbose = FALSE, tol_optimal_init = 0.3,
seed = 1)
matrix or data frame
the maximum number of clusters
one of variance_explained, WCSSE, dissimilarity, silhouette, distortion_fK, AIC, BIC and Adjusted_Rsquared. See details for more information.
a float number used in the 'distortion_fK' criterion
number of times the algorithm will be run with different centroid seeds
the maximum number of clustering iterations
the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random. See details for more information
an integer specifying the number of cores to run in parallel. OpenMP will be utilized to parallelize the number of initializations (num_init)
a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged
either TRUE or FALSE, indicating whether the results of the Optimal_Clusters_KMeans function should be plotted
either TRUE or FALSE, indicating whether progress is printed during clustering
tolerance value for the 'optimal_init' initializer. The higher this value is, the far appart from each other the centroids are.
integer value for random number generator (RNG)
a vector with the results for the specified criterion (except for the 'distortion_fK' which returns the WCSS). If plot_clusters is TRUE the it plots also the results.
---------------criteria--------------------------
variance_explained : the sum of the within-cluster-sum-of-squares-of-all-clusters divided by the total sum of squares
WCSSE : the sum of the within-cluster-sum-of-squares-of-all-clusters
dissimilarity : the average intra-cluster-dissimilarity of all clusters (the distance metric defaults to euclidean)
silhouette : the average silhouette width of all clusters (the distance metric defaults to euclidean)
distortion_fK : this criterion is based on the following paper, 'Selection of K in K-means clustering' (https://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf)
AIC : the Akaike information criterion
BIC : the Bayesian information criterion
Adjusted_Rsquared : the adjusted R^2 statistic
---------------initializers----------------------
optimal_init : this initializer adds rows of the data incrementally, while checking that they do not already exist in the centroid-matrix
quantile_init : initialization of centroids by using the cummulative distance between observations and by removing potential duplicates
kmeans++ : kmeans++ initialization. Reference : http://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf AND http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work
random : random selection of data rows as initial centroids
# NOT RUN {
data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
opt = Optimal_Clusters_KMeans(dat, max_clusters = 10, plot_clusters = FALSE)
# }
Run the code above in your browser using DataLab