validation_kproto: Validating k Prototypes Clustering

Description

Calculating the prefered validation index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the choosen index for k-Prototype clustering. Possible validation indices are: cindex, dunn, gamma, gplus, mcclain, ptbiserial, silhouette and tau.

Usage

validation_kproto(
  method = NULL,
  object = NULL,
  data = NULL,
  k = NULL,
  lambda = NULL,
  kp_obj = "optimal",
  ...
)

Arguments

method

character specifying the validation index: cindex, dunn, gamma, gplus, mcclain, ptbiserial, silhouette or tau.

object

Object of class kproto resulting from a call with kproto(..., keep.data=TRUE)

data

Original data; only required if object == NULL and neglected if object != NULL.

Vector specifying the search range for optimum number of clusters; if NULL the range will set as 2:sqrt(n). Only required if object == NULL and neglected if object != NULL.

lambda

Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.

kp_obj

character either "optimal" or "all": Output of the index-optimal clustering (kp_obj == "optimal") or all computed clusterpartitions (kp_obj == "all"); only required if object != NULL.

...

Further arguments passed to kproto, like:

nstart: If > 1 repetetive computations of kproto with random initializations are computed.
verbose: Logical whether information about the cluster procedure should be given. Caution: If verbose=FALSE, the reduction of the number of clusters is not mentioned.

Value

For computing the optimal number of clusters based on the choosen validation index for k-Prototype clustering the output contains:

k_opt

optimal number of clusters (sampled in case of ambiguity)

index_opt

index value of the index optimal clustering

indices

calculated indices for $k=2,...,k_{max}$

kp_obj

if(kp_obj == "optimal") the kproto object of the index optimal clustering and if(kp_obj == "all") all kproto which were calculated

For computing the index-value for a given k-Prototype clustering the output contains:

index

calculated index-value

Details

More information about the implemented validation indices:

cindex $$Cindex = \frac{S_w-S_{min}}{S_{max}-S_{min}}$$ For $S_{min}$ and $S_{max}$ it is nessesary to calculate the distances between all pairs of points in the entire data set ($\frac{n(n-1)}{2}$). $S_{min}$ is the sum of the "total number of pairs of objects belonging to the same cluster" smallest distances and $S_{max}$ is the sum of the "total number of pairs of objects belonging to the same cluster" largest distances. $S_w$ is the sum of the within-cluster distances. The minimum value of the index is used to indicate the optimal number of clusters.
dunn$$Dunn = \frac{\min_{1 \leq i < j \leq q} d(C_i, C_j)}{\max_{1 \leq k \leq q} diam(C_k)}$$ The following applies: The dissimilarity between the two clusters $C_i$ and $C_j$ is defined as $d(C_i, C_j)=\min_{x \in C_i, y \in C_j} d(x,y)$ and the diameter of a cluster is defined as $diam(C_k)=\max_{x,y \in C} d(x,y)$. The maximum value of the index is used to indicate the optimal number of clusters.
gamma$$Gamma = \frac{s(+)-s(-)}{s(+)+s(-)}$$ Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. $s(+)$ is the number of concordant comparisons and $s(-)$ is the number of discordant comparisons. A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity. The maximum value of the index is used to indicate the optimal number of clusters.
gplus$$Gplus = \frac{2 \cdot s(-)}{\frac{n(n-1)}{2} \cdot (\frac{n(n-1)}{2}-1)}$$ Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. $s(-)$ is the number of discordant comparisons and a comparison is named discordant if a within-cluster dissimilarity is strictly greater than a between-cluster dissimilarity. The minimum value of the index is used to indicate the optimal number of clusters.
mcclain$$McClain = \frac{\bar{S}_w}{\bar{S}_b}$$ $\bar{S}_w$ is the sum of within-cluster distances divided by the number of within-cluster distances and $\bar{S}_b$ is the sum of between-cluster distances divided by the number of between-cluster distances. The minimum value of the index is used to indicate the optimal number of clusters.
ptbiserial$$Ptbiserial = \frac{(\bar{S}_b-\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}$$ $\bar{S}_w$ is the sum of within-cluster distances divided by the number of within-cluster distances and $\bar{S}_b$ is the sum of between-cluster distances divided by the number of between-cluster distances. $N_t$ is the total number of pairs of objects in the data, $N_w$ is the total number of pairs of objects belonging to the samecluster and $N_b$ is the total number of pairs of objects belonging to different clusters. $s_d$ is the standard deviation of all distances. The maximum value of the index is used to indicate the optimal number of clusters.
silhouette$$Silhouette = \frac{1}{n} \sum_{i=1}^n \frac{b(i)-a(i)}{max(a(i),b(i))}$$ $a(i)$ is the average dissimilarity of the ith object to all other objects of the same/own cluster. $b(i)=min(d(i,C))$, where $d(i,C)$ is the average dissimilarity of the ith object to all the other clusters except the own/same cluster. The maximum value of the index is used to indicate the optimal number of clusters.
tau$$Tau = \frac{s(+) - s(-)}{((\frac{N_t(N_t-1)}{2}-t)\frac{N_t(N_t-1)}{2})^{0.5}}$$ Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities. $s(+)$ is the number of concordant comparisons and $s(-)$ is the number of discordant comparisons. A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity. $N_t$ is the total number of distances $\frac{n(n-1)}{2}$ and $t$ is the number of comparisons of two pairs of objects where both pairs represent within-cluster comparisons or both pairs are between-cluster comparisons. The maximum value of the index is used to indicate the optimal number of clusters.

References

Aschenbruck, R., Szepannek, G. (2020): Cluster Validation for Mixed-Type Data. Archives of Data Science, Series A, Vol 6, Issue 1. https://doi.org/10.5445/KSP/1000098011/02.
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014): NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, Vol 61, Issue 6. www.jstatsoft.org.

Examples

Run this code

# NOT RUN {
# generate toy data with factors and numerics
n   <- 10
prb <- 0.99
muk <- 2.5 

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)
x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)
x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x <- data.frame(x1,x2,x3,x4)


# calculate optimal number of cluster, index values and clusterpartition with Silhouette-index
val <- validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5)


# apply k-prototypes
kpres <- kproto(x, 4, keep.data = TRUE)

# calculate cindex-value for the given clusterpartition
cindex_value <- validation_kproto(method = "cindex", object = kpres)

# }

Run the code above in your browser using DataLab