nomclust: Hierarchical Cluster Analysis for Nominal Data

Description

The nomclust() function runs hierarchical cluster analysis (HCA) with objects characterized by nominal (categorical) variables. It completely covers the clustering process, from the proximity matrix calculation to the evaluation of the quality of clustering. The function contains thirteen similarity measures for nominal data summarized in (Boriah et al., 2008) or introduced by Morlini and Zani in (Morlini and Zani, 2012), and by (Sulc and Rezankova, 2019). It offers three linkage methods that can be used for categorical data. The obtained clusters can be evaluated by seven evaluation criteria, see (Sulc et al., 2018). The output of the nomclust() function may serve as an input for visualization functions in the nomclust package.

Usage

nomclust(
  data,
  measure = "lin",
  method = "average",
  clu.high = 6,
  eval = TRUE,
  prox = 100,
  opt = TRUE
)

Arguments

data

A data.frame or a matrix with cases in rows and variables in colums.

measure

A character string defining the similarity measure used for computation of proximity matrix in HCA: "eskin", "good1", "good2", "good3", "good4", "iof", "lin", "lin1", "morlini", "of", "sm", "ve", "vm".

method

A character string defining the clustering method. The following methods can be used: "average", "complete", "single".

clu.high

A numeric value expressing the maximal number of cluster for which the cluster memberships variables are produced.

eval

A logical operator; if TRUE, evaluation of the clustering results is performed.

prox

A logical operator or a numeric value. If a logical value TRUE indicates that the proximity matrix is a part of the output. A numeric value (integer) of this argument indicates the maximal number of cases in a dataset for which a proximity matrix will occur in the output.

opt

A logical operator; if TRUE, the time optimization method is run to substantially decrease computation time of the dissimilarity matrix calcation. Time optimalization method cannot be run if the proximity matrix is to be produced. In such a case, this parameter is automatically set to FALSE.

Value

The function returns a list with up to five components.

The mem component contains cluster membership partitions for the selected numbers of clusters in the form of a list.

The eval component contains seven evaluation criteria in as vectors in a list. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE), Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayessian (BIC) and Akaike (AIC) information criteria for categorical data and the BK index. To see them all in once, the form of a data.frame is more appropriate.

The opt component is present in the output together with the eval component. It displays the optimal number of clusters for the evaluation criteria from the eval component, except for WCM and WCE, where the optimal number of clusters is based on the elbow method.

The prox component contains the dissimilarity matrix in a form of a matrix.

The dend component can be found in the output only together with the prox component. It contains all the necessary information for dendrogram creation.

References

Boriah S., Chandola V. and Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.

Morlini I. and Zani S. (2012). A new class of weighted similarity indices using polytomous variables. Journal of Classification, 29(2), p. 199-226.

Sulc Z., Cibulkova J., Prochazka J., Rezankova H. (2018). Internal Evaluation Criteria for Categorical Data in Hierarchical Clustering: Optimal Number of Clusters Determination, Metodoloski Zveski, 15(2), p. 1-20.

Sulc Z. and Rezankova H. (2019). Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. Journal of Classification. 2019, 35(1), p. 58-72. DOI: 10.1007/s00357-019-09317-5.

Examples

Run this code

# NOT RUN {
# sample data
data(data20)

# creating an object with results of hierarchical clustering of 
hca.object <- nomclust(data20, measure = "lin", method = "average",
 clu.high = 5, prox = TRUE, opt = FALSE)

# obtaining values of evaluation indices
data20.eval <- hca.object$eval

# getting the optimal numbers of clusters
data20.opt <- hca.object$opt

# extracting cluster membership variables
data20.mem <- hca.object$mem

# extracting cluster membership variables as a data frame
data20.mem <- as.data.frame(hca.object$mem)

# obtaining a proximity matrix
data20.prox <- hca.object$prox

# setting the maximal number of objects for which a proximity matrix is provided in the output to 30
hca.object <- nomclust(data20, measure = "lin", method = "average",
 clu.high = 5, prox = 30, opt = FALSE)

# generating of a larger dataset containing repeatedly occuring objects
set.seed(150)
sample150 <- sample(1:nrow(data20), 150, replace = TRUE)
data150 <- data20[sample150, ]

# running hierarchical clustering WITH the time optimization
start <- Sys.time()
hca.object.opt.T <- nomclust(data150, measure = "lin", opt = TRUE)
end <- Sys.time()
end - start

# running hierarchical clustering WITHOUT the time optimization
start <- Sys.time()
hca.object.opt.F <- nomclust(data150, measure = "lin", opt = FALSE)
end <- Sys.time()
end - start

# }

Run the code above in your browser using DataLab