AIBmix: Agglomerative Information Bottleneck Clustering for Mixed-Type Data

Description

The AIBmix function implements the Agglomerative Information Bottleneck (AIB) algorithm for hierarchical clustering of datasets containing mixed-type variables, including categorical (nominal and ordinal) and continuous variables. This method merges clusters so that information retention is maximised at each step to create meaningful clusters, leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively slonim_aib_1999IBclust.

Usage

AIBmix(X, s = -1, lambda = -1,
       scale = TRUE, contkernel = "gaussian",
       nomkernel = "aitchisonaitken", ordkernel = "liracine",
       cat_first = FALSE)

Value

An object of class "aibclust" representing the final clustering result. The returned object is a list with the following components:

merges: A data frame with 2 columns and $n$ rows, showing which observations are merged at each step.
merge_costs: A numeric vector tracking the cost incurred by each merge $I(T_{m} ; Y) - I(T_{m-1} ; Y)$.
partitions: A list containing $n$ sub-lists. Each sub-list includes the cluster partition at each step.
I_T_Y: A numeric vector including the mutual information $I(T_{m}; Y)$ as the number of clusters $m$ increases.
I_X_Y: A numeric value of the mutual information $I(X; Y)$ between observation indices and location.
info_ret: A numeric vector of length $n$ including the fraction of the original information retained after each merge.
s: A numeric vector of bandwidth parameters used for the continuous variables. A value of $-1$ is returned if all variables are categorical.
lambda: A numeric vector of bandwidth parameters used for the categorical variables. A value of $-1$ is returned if all variables are continuous.
call: The matched call.
n: Number of observations.
contcols: Indices of continuous columns in X.
catcols: Indices of categorical columns in X.
kernels: List with names of kernels used for continuous, nominal, and ordinal features.
obs_names: Names of rows in X; used for plotting the cluster hierarchy using a dendrogram.

Objects of class "aibclust" support the following methods:

print.aibclust: Display a concise description of the cluster hierarchy.
summary.aibclust: Show detailed information including cluster sizes for 2, 3, and 5 clusters, information-theoretic metrics, and hyperparameters.
plot.aibclust: Produce diagnostic plots:
- type = "dendrogram": dendrogram visualising the hierarchy of partitions obtained.
- type = "info": information retention curve; the proportion of information preserved $I(T_m;Y)/I(X;Y)$ by the clustering $T_m$ is plotted against the number of clusters $m$.

Arguments

X: A data frame containing the data to be clustered. Variables should be of type numeric (for continuous variables), factor (for nominal variables) or ordered (for ordinal variables).
s: A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than $0$. The default value is $-1$, which enables the automatic selection of optimal bandwidth(s). Argument is ignored when no variables are continuous.
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables and nomkernel = 'aitchisonaitken', the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories, whereas for nomkernel = 'liracine' the maximum allowable value is $1$. For ordinal variables, the maximum allowable value of lambda is $1$, regardless of what ordkernel is being used. Argument is ignored when all variables are continuous.
scale: A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE. Argument is ignored when all variables are categorical.
contkernel: Kernel used for continuous variables. Can be one of gaussian (default) or epanechnikov. Argument is ignored when no variables are continuous.
nomkernel: Kernel used for nominal (unordered categorical) variables. Can be one of aitchisonaitken (default) or liracine. Argument is ignored when no variables are nominal.
ordkernel: Kernel used for ordinal (ordered categorical) variables. Can be one of liracine (default) or wangvanryzin. Argument is ignored when no variables are ordinal.
cat_first: A logical value indicating whether bandwidth selection is prioritised for the categorical variables, instead of the continuous. Defaults to FALSE. Set to TRUE if you suspect that the continuous variables are not informative of the cluster structure. Can only be TRUE when all bandwidths are selected automatically (i.e. s = -1, lambda = -1).

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The AIBmix function produces a hierarchical agglomerative clustering of the data while retaining maximal information about the original variable distributions. The Agglomerative Information Bottleneck algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step, hence creating meaningful clusters with maximal information about the original distribution. Bandwidth parameters for categorical (nominal, ordinal) and continuous variables are adaptively determined if not provided. This process identifies stable and interpretable cluster assignments by maximizing mutual information while controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates information from all variable types effectively.

The following kernel functions can be used to estimate densities for the clustering procedure. For continuous variables:

Gaussian (RBF) kernel silverman_density_1998IBclust: $$K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.$$
Epanechnikov kernel epanechnikov1969nonIBclust: $$K_c(x - x'; s) = \begin{cases} \frac{3}{4\sqrt{5}}\left(1 - \frac{(x-x')^2}{5s^2} \right), & \text{if } \frac{(x - x')^2}{s^2} < 5 \\ 0, & \text{otherwise} \end{cases}, \quad s > 0.$$

For nominal (unordered categorical variables):

Aitchison & Aitken kernel aitchison_kernel_1976IBclust: $$K_u(x = x'; \lambda) = \begin{cases} 1 - \lambda, & \text{if } x = x' \\ \frac{\lambda}{\ell - 1}, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell}.$$
Li & Racine kernel ouyang2006crossIBclust: $$K_u(x = x'; \lambda) = \begin{cases} 1, & \text{if } x = x' \\ \lambda, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq 1.$$

For ordinal (ordered categorical) variables:

Li & Racine kernel li_nonparametric_2003IBclust: $$K_o(x = x'; \nu) = \begin{cases} 1, & \text{if } x = x' \\ \nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$
Wang & van Ryzin kernel wang1981classIBclust: $$K_o(x = x'; \nu) = \begin{cases} 1 - \nu, & \text{if } x = x' \\ \frac{1-\nu}{2}\nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$

The bandwidth parameters $s$, $\lambda$, and $\nu$ control the smoothness of the density estimate and are automatically determined by the algorithm if not provided by the user using the approach in costa_dib_2025;textualIBclust. $\ell$ is the number of levels of the categorical variable. For ordinal variables, the lambda parameter of the function is used to define $\nu$.

References

slonim_aib_1999IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclustsilverman_density_1998IBclustouyang2006crossIBclustwang1981classIBclustepanechnikov1969nonIBclustcosta_dib_2025IBclust

Examples

Run this code

# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data_mix <- data.frame(
  cat_var = factor(sample(letters[1:3], 100, replace = TRUE)),      # Nominal categorical variable
  ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                   levels = c("low", "medium", "high"),
                   ordered = TRUE),                                # Ordinal variable
  cont_var1 = rnorm(100),                                          # Continuous variable 1
  cont_var2 = runif(100)                                           # Continuous variable 2
)

# Perform Mixed-Type Hierarchical Clustering with Agglomerative IB
result_mix <- AIBmix(X = data_mix, lambda = -1, s = -1, scale = TRUE)

# Print clustering results
plot(result_mix, type = "dendrogram", xlab = "", sub = "", cex = 0.5)  # Plot dendrogram
plot(result_mix, type = "info", col = "black", pch = 16)  # Plot dendrogram

# Simulated categorical data example
set.seed(123)
data_cat <- data.frame(
  Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)),  # Nominal variable
  Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Run AIBmix with automatic lambda selection 
result_cat <- AIBmix(X = data_cat, lambda = -1)

# Print clustering results
plot(result_cat, type = "dendrogram", xlab = "", sub = "", cex = 0.5)  # Plot dendrogram

# Results summary
summary(result_cat)

# Simulated continuous data example
set.seed(123)
# Continuous data with 200 observations, 5 features
data_cont <- as.data.frame(matrix(rnorm(1000), ncol = 5))

# Run AIBmix with automatic bandwidth selection 
result_cont <- AIBmix(X = data_cont, s = -1, scale = TRUE)

# Print concise summary ofoutput
print(result_cont)

# Print clustering results
plot(result_cont, type = "dendrogram", xlab = "", sub = "", cex = 0.5)  # Plot dendrogram

Run the code above in your browser using DataLab