gMADD_DI: Modified K-Means Algorithm by Using a New Dissimilarity Measure, MADD and DUNN Index

Description

Performs modified K-means algorithm by using a new dissimilarity measure, called MADD and DUNN index, and provides estimated cluster (class) labels or memberships and corresponding DUNN index of the observations.

Usage

gMADD_DI(s_psi, s_h, kmax, lb, M)

Arguments

s_psi

function required for clustering, 1 for \(t^2\), 2 for \(1-\exp(-t)\), 3 for \(1-\exp(-t^2)\), 4 for \(\log(1+t)\), 5 for \(t\)

s_h

function required for clustering, 1 for \(\sqrt t\), 2 for \(t\)

kmax

maximum value of total number of clusters to estimate total number of clusters in the whole observations

each observation is partitioned into some numbers of smaller vectors of same length \(lb\)

\(n\times d\) observations matrix of pooled sample, the observations should be grouped by their respective classes

Value

a \(kmax \times (n+1)\) matrix of the estimated cluster (class) labels and corresponding DUNN indexes of observations

Details

DUNN index is used for cluster validation, but here we use it to estimate total number of cluster \(k\) by \(\hat k = argmax_{2\le k' \le k^*}DI(k')\). Here \(DI(k')\) represents the DUNN index and we use \(k^*=2*k\).

References

Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.

Soham Sarkar and Anil K Ghosh (2019). On perfect clustering of high dimension, low sample size data, IEEE transactions on pattern analysis and machine intelligence, doi:10.1109/TPAMI.2019.2912599.

Joseph C Dunn (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, doi:10.1080/01969727308546046.

Examples

Run this code

# NOT RUN {
  # Modified K-means algorithm:
  # muiltivariate normal distribution
  # generate data with dimension d = 500
  set.seed(151)
  n1=n2=n3=n4=10
  d = 500
  I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d)
  I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) 
  I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) 
  I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) 
  n_cl <- 4
  N <- n1+n2+n3+n4
  X <- as.matrix(rbind(I1,I2,I3,I4)) 
  dvec_di_mat <-  gMADD_DI(1,1,2*n_cl,1,X)
  est_no_cl <- which.max(dvec_di_mat[ ,(N+1)])
  dvec_di_mat[est_no_cl,1:N]

   ## outputs:
   #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
   
# }

Run the code above in your browser using DataLab