AIBcat: Cluster Categorical Data Using the Agglomerative Information Bottleneck Algorithm

Description

The AIBcat function implements the Agglomerative Information Bottleneck (AIB) algorithm for hierarchical clustering of datasets containing categorical variables. This method merges clusters so that information retention is maximised at each step to create meaningful clusters, leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively slonim_aib_1999IBclust.

Usage

AIBcat(X, lambda = -1)

Value

A list containing the following elements:

merges: A data frame with 2 columns and $n$ rows, showing which observations are merged at each step.
merge_costs: A numeric vector tracking the cost incurred by each merge $I(Z_{m} ; Y) - I(Z_{m-1} ; Y)$.
partitions: A list containing $n$ sub-lists. Each sub-list includes the cluster partition at each step.
I_Z_Y: A numeric vector including the mutual information $I(Z_{m}; Y)$ as the number of clusters $m$ increases.
I_X_Y: A numeric value of the mutual information $I(X; Y)$ between observation indices and location.
info_ret: A numeric vector of length $n$ including the fraction of the original information retained after each merge.
dendrogram: A dendrogram visualising the cluster hierarchy. The height is determined by the cost of cluster merges.

Arguments

X: A data frame containing the categorical data to be clustered. All variables should be categorical, either factor (for nominal variables) or ordered (for ordinal variables).
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables, the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories. For ordinal variables, the maximum allowable value of lambda is $1$.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The AIBcat function applies the Agglomerative Information Bottleneck algorithm to do hierarchical clustering of datasets containing only categorical variables, both nominal and ordinal. The algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step to create meaningful clusters with maximal information about the original distribution.

To estimate the distributions of categorical features, the function utilizes specialized kernel functions, as follows:

$$K_u(x = x'; \lambda) = \begin{cases} 1 - \lambda, & \text{if } x = x' \\ \frac{\lambda}{\ell - 1}, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell},$$ where $\ell$ is the number of categories, and $\lambda$ controls the smoothness of the Aitchison & Aitken kernel for nominal variables aitchison_kernel_1976IBclust.

$$K_o(x = x'; \nu) = \begin{cases} 1, & \text{if } x = x' \\ \nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1,$$ where $\nu$ is the bandwidth parameter for ordinal variables, accounting for the ordinal relationship between categories li_nonparametric_2003IBclust.

Here, $\lambda$, and $\nu$ are bandwidth or smoothing parameters, while $\ell$ is the number of levels of the categorical variable. The lambda parameter is automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define $\nu$.

References

slonim_aib_1999IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclust

Examples

Run this code

# Simulated categorical data
set.seed(123)
X <- data.frame(
  Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)),  # Nominal variable
  Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Run AIBcat with automatic lambda selection 
result <- AIBcat(X = X, lambda = -1)

# Print clustering results
plot(result$dendrogram, xlab = "", sub = "")  # Plot dendrogram

Run the code above in your browser using DataLab