Compute Bayesian cluster validity index (BCVI) from two to kmax groups using the score function (SF) as the underling cluster validity index (CVI) with the user's selected Dirichlet prior parameters. The full detail of BCVI can be found in the paper Wiroonsri and Preedasawakul (2024).
B_SF.IDX(x, kmax, method = "kmeans", nstart = 100, alpha = "default", mult.alpha = 1/2)
the dataframe where the first and the second columns are the number of groups k and BCVI\((k)\), respectively, for k from 2 to kmax.
the data frame where the first and the second columns are the number of groups k and the variance of \(p_k\), respectively, for k from 2 to kmax.
the data frame where the first and the second columns are the number of groups k and the original SF\((k)\), respectively, for k from 2 to kmax.
a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
a maximum number of clusters to be considered.
a character string indicating which clustering method to be used ("kmeans", "hclust_complete", "hclust_average", "hclust_single"). The default is "kmeans".
a maximum number of initial random sets for kmeans for method = "kmeans". The default is 100.
Dirichlet prior parameters \(\alpha_2,...,\alpha_k\) where \(\alpha_k\) is the parameter corresponding to "the probability of having k groups" (selecting each \(\alpha_k\) between 0 to 30 is recommended and using the other parameter mult.alpha to be its multiplier. The default is "default".
the power \(s\) from \(n^s\) to be multiplied to the Dirichlet prior parameters alpha (selecting mult.alpha in [0,1) is recommended). The default is \(\frac{1}{2}\).
Nathakhun Wiroonsri and Onthada Preedasawakul
BCVI-SF is defined as follows.
Let
$$r_k(\bf x) = \dfrac{\max_j SF(j)- SF(k)}{\sum_{i=2}^K (\max_j SF(j) - SF(i))}.$$
Assume that
$$f({\bf x}|{\bf p}) = C({\bf p}) \prod_{k=2}^Kp_k^{nr_k(x)}$$
represents the conditional probability density function of the dataset given \(\bf p\), where \(C({\bf p})\) is the normalizing constant. Assume further that \({\bf p}\) follows a Dirichlet prior distribution with parameters \({\bm \alpha} = (\alpha_2,\ldots,\alpha_K)\). The posterior distribution of \(\bf p\) still remains a Dirichlet distribution with parameters \((\alpha_2+nr_2({\bf x}),\ldots,\alpha_K+nr_K({\bf x}))\).
The BCVI is then defined as
$$BCVI(k) = E[p_k|{\bf x}] = \frac{\alpha_k + nr_k({\bf x})}{\alpha_0+n}$$
where \(\alpha_0 = \sum_{k=2}^K \alpha_k.\)
The variance of \(p_k\) can be computed as $$Var(p_k|{\bf x}) = \dfrac{(\alpha_k + nr_k(x))(\alpha_0 + n -\alpha_k-nr_k(x))}{(\alpha_0 + n)^2(\alpha_0 + n +1 )}.$$
S. Saitta, B. Raphael, I. Smith, "A bounded index for cluster validity," In Perner, P.: Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science, 4571, Springer (2007).
O. Preedasawakul, and N. Wiroonsri, A Bayesian Cluster Validity Index, Computational Statistics & Data Analysis, 202, 108053, 2025. tools:::Rd_expr_doi("10.1016/j.csda.2024.108053")
B2_data, B_TANG.IDX, B_WP.IDX, B_Wvalid, B_DB.IDX
library(BayesCVI)
# The data included in this package.
data = B2_data[,1:2]
# alpha
aalpha = c(5,5,5,20,20,20,0.5,0.5,0.5)
B.SF = B_SF.IDX(x = scale(data), kmax=10, method = "kmeans",
nstart = 100, alpha = aalpha, mult.alpha = 1/2)
# plot the BCVI
pplot = plot_BCVI(B.SF)
pplot$plot_index
pplot$plot_BCVI
pplot$error_bar_plot
Run the code above in your browser using DataLab