Learn R Programming

UniversalCVI (version 1.2.0)

SH.IDX: Silhouette index

Description

Computes the SH (Rousseeuw, 1987; Kaufman and Rousseeuw, 2009) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

SH.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Value

SH

the SH index for k from kmin to kmax shown in a data frame where the first and the second columns are k and the SH index, respectively.

Arguments

x

a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.

kmax

a maximum number of clusters to be considered.

kmin

a minimum number of clusters to be considered. The default is 2.

method

a character string indicating which clustering method to be used ("kmeans", "hclust_complete", "hclust_average", "hclust_single"). The default is "kmeans".

nstart

a maximum number of initial random sets for kmeans for method = "kmeans". The default is 100.

Author

Nathakhun Wiroonsri and Onthada Preedasawakul

Details

For \(i \in [n]\), \(l \in [k]\), and \(x_i \in C_l\), let

$$a(i) = \dfrac{1}{|C_l|-1}\sum_{y \in C_l} \left\|x_i-y\right\| and$$ $$b(i) = \min_{r \neq l} \dfrac{1}{|C_r|} \sum_{y \in C_r} \left\|x_i-y\right\|.$$ The silhouette value of one data point \(x_j\) is defined as:

$$s(j) = \begin{cases} \dfrac{b(j) - a(j)}{\max\{a(j),b(i)\}} &\text{ \ \ if \ } |C_j| > 1 \\ 0 &\text{ \ \ if \ } |C_j| = 1 \end{cases}. $$

The silhouette index is defined as

\(SH(k) = \dfrac{1}{n} \sum_{i = 1}^n s(i).\)

The largest value of \(SH(k)\) indicates a valid optimal partition.

References

Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.

Kaufman, L. and Rousseeuw, P.J., 2009. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.

See Also

Hvalid, Wvalid, DI.IDX, FzzyCVIs, R1_data

Examples

Run this code

library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Hierarchical ----

# Average linkage

# Compute the SH index
H.SH = SH.IDX(scale(x), kmax = 10, kmin = 2, method = "hclust_average", nstart = 1)
print(H.SH)

# The optimal number of cluster
H.SH[which.max(H.SH$SH),]

Run the code above in your browser using DataLab