AIBmix: Agglomerative Information Bottleneck Clustering for Mixed-Type Data

Description

The AIBmix function implements the Agglomerative Information Bottleneck (AIB) algorithm for hierarchical clustering of datasets containing mixed-type variables, including categorical (nominal and ordinal) and continuous variables. This method merges clusters so that information retention is maximised at each step to create meaningful clusters, leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively slonim_aib_1999IBclust.

Usage

AIBmix(X, catcols, contcols, lambda = -1, s = -1, scale = TRUE)

Value

A list containing the following elements:

merges: A data frame with 2 columns and $n$ rows, showing which observations are merged at each step.
merge_costs: A numeric vector tracking the cost incurred by each merge $I(Z_{m} ; Y) - I(Z_{m-1} ; Y)$.
partitions: A list containing $n$ sub-lists. Each sub-list includes the cluster partition at each step.
I_Z_Y: A numeric vector including the mutual information $I(Z_{m}; Y)$ as the number of clusters $m$ increases.
I_X_Y: A numeric value of the mutual information $I(X; Y)$ between observation indices and location.
info_ret: A numeric vector of length $n$ including the fraction of the original information retained after each merge.
dendrogram: A dendrogram visualising the cluster hierarchy. The height is determined by the cost of cluster merges.

Arguments

X: A data frame containing the categorical data to be clustered. All variables should be categorical, either factor (for nominal variables) or ordered (for ordinal variables).
catcols: A vector indicating the indices of the categorical variables in X.
contcols: A vector indicating the indices of the continuous variables in X.
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables, the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories. For ordinal variables, the maximum allowable value of lambda is $1$.
s: A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than $0$. The default value is $-1$, which enables the automatic selection of optimal bandwidth(s).
scale: A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The AIBmix function produces a hierarchical agglomerative clustering of the data while retaining maximal information about the original variable distributions. The Agglomerative Information Bottleneck algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step, hence creating meaningful clusters with maximal information about the original distribution. Bandwidth parameters for categorical (nominal, ordinal) and continuous variables are adaptively determined if not provided. This process identifies stable and interpretable cluster assignments by maximizing mutual information while controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates information from all variable types effectively.

The following kernel functions are used to estimate densities for the clustering procedure:

Continuous variables: Gaussian kernel $$K_c\left(\frac{x-x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{\left(x-x'\right)^2}{2s^2} \right\}, \quad s > 0.$$
Nominal categorical variables: Aitchison & Aitken kernel $$K_u\left(x = x' ; \lambda\right) = \begin{cases} 1-\lambda & \text{if } x = x' \\ \frac{\lambda}{\ell-1} & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell-1}{\ell}.$$
Ordinal categorical variables: Li & Racine kernel $$K_o\left(x = x' ; \nu\right) = \begin{cases} 1 & \text{if } x = x' \\ \nu^{|x - x'|} & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$

References

slonim_aib_1999IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclustsilverman_density_1998IBclust

Examples

Run this code

# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data <- data.frame(
  cat_var = factor(sample(letters[1:3], 100, replace = TRUE)),      # Nominal categorical variable
  ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                   levels = c("low", "medium", "high"),
                   ordered = TRUE),                                # Ordinal variable
  cont_var1 = rnorm(100),                                          # Continuous variable 1
  cont_var2 = runif(100)                                           # Continuous variable 2
)

# Perform Mixed-Type Hierarchical Clustering with Agglomerative IB
result <- AIBmix(X = data, catcols = 1:2, contcols = 3:4, lambda = -1, s = -1, scale = TRUE)

# Print clustering results
plot(result$dendrogram, xlab = "", sub = "")  # Plot dendrogram

Run the code above in your browser using DataLab