GIBmix: Generalised Information Bottleneck Clustering for Mixed-Type Data

Description

The GIBmix function implements the Generalised Information Bottleneck (GIB) algorithm for clustering datasets containing mixed-type variables, including categorical (nominal and ordinal) and continuous variables. This method optimizes an information-theoretic objective to preserve relevant information in the cluster assignments while achieving effective data compression strouse_dib_2017IBclust.

Usage

GIBmix(X, ncl, beta, alpha, catcols, contcols, randinit = NULL,
       lambda = -1, s = -1, scale = TRUE,
       maxiter = 100, nstart = 100,
       verbose = FALSE)

Value

A list containing the following elements:

Cluster: A cluster membership matrix.
Entropy: A numeric value representing the entropy of the cluster assignment, $H(T)$.
RelEntropy: A numeric value representing the relative entropy of cluster assignment, given the observation weights $H(X \mid T)$.
MutualInfo: A numeric value representing the mutual information, $I(Y;T)$, between the original labels ($Y$) and the cluster assignments ($T$).
beta: A numeric value of the regularisation strength beta used.
alpha: A numeric value of the strength of relative entropy used.
s: A numeric vector of bandwidth parameters used for the continuous variables.
lambda: A numeric vector of bandwidth parameters used for the categorical variables.
ht: A numeric vector tracking the entropy value of the cluster assignments across iterations.
hy_t: A numeric vector tracking the relative entropy values between the cluster assignments and observations weights across iterations.
iyt: A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations.
losses: A numeric vector tracking the final loss values across iterations.

Arguments

X: A data frame containing the input data to be clustered. It should include categorical variables (factor for nominal and Ord.factor for ordinal) and continuous variables (numeric).
ncl: An integer specifying the number of clusters.
beta: Regularisation strength.
alpha: Strength of relative entropy term.
catcols: A vector indicating the indices of the categorical variables in X.
contcols: A vector indicating the indices of the continuous variables in X.
randinit: An optional vector specifying the initial cluster assignments. If NULL, cluster assignments are initialized randomly.
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables, the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories. For ordinal variables, the maximum allowable value of lambda is 1.
s: A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than $0$. The default value is $-1$, which enables the automatic selection of optimal bandwidth(s).
scale: A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE.
maxiter: The maximum number of iterations allowed for the clustering algorithm. Defaults to $100$.
nstart: The number of random initializations to run. The best clustering solution is returned. Defaults to $100$.
verbose: Logical. Default to FALSE to suppress progress messages. Change to TRUE to print.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The GIBmix function produces a fuzzy clustering of the data while retaining maximal information about the original variable distributions. The Generalised Information Bottleneck algorithm optimizes an information-theoretic objective that balances information preservation and compression. Bandwidth parameters for categorical (nominal, ordinal) and continuous variables are adaptively determined if not provided. This iterative process identifies stable and interpretable cluster assignments by maximizing mutual information while controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates information from all variable types effectively. Set $\alpha = 1$ and $\alpha = 0$ to recover the Information Bottleneck and its Deterministic variant, respectively. If $\alpha = 0$, the algorithm ignores the value of the regularisation parameter $\beta$.

The following kernel functions are used to estimate densities for the clustering procedure:

Continuous variables: Gaussian kernel $$K_c\left(\frac{x-x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{\left(x-x'\right)^2}{2s^2} \right\}, \quad s > 0.$$
Nominal categorical variables: Aitchison & Aitken kernel $$K_u\left(x = x' ; \lambda\right) = \begin{cases} 1-\lambda & \text{if } x = x' \\ \frac{\lambda}{\ell-1} & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell-1}{\ell}.$$
Ordinal categorical variables: Li & Racine kernel $$K_o\left(x = x' ; \nu\right) = \begin{cases} 1 & \text{if } x = x' \\ \nu^{|x - x'|} & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$

Here, $s$, $\lambda$, and $\nu$ are bandwidth or smoothing parameters, while $\ell$ is the number of levels of the categorical variable. $s$ and $\lambda$ are automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define $\nu$.

References

strouse_dib_2017IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclustsilverman_density_1998IBclust

Examples

Run this code

# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data <- data.frame(
  cat_var = factor(sample(letters[1:3], 100, replace = TRUE)),      # Nominal categorical variable
  ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                   levels = c("low", "medium", "high"),
                   ordered = TRUE),                                # Ordinal variable
  cont_var1 = rnorm(100),                                          # Continuous variable 1
  cont_var2 = runif(100)                                           # Continuous variable 2
)

# Perform Mixed-Type Fuzzy Clustering with Generalised IB
result <- GIBmix(X = data, ncl = 3, beta = 2, alpha = 0.5, catcols = 1:2, 
contcols = 3:4, nstart = 20)

# Print clustering results
print(result$Cluster)       # Cluster membership matrix
print(result$Entropy)       # Entropy of final clustering
print(result$RelEntropy)    # Relative entropy of final clustering
print(result$MutualInfo)    # Mutual information between Y and T

Run the code above in your browser using DataLab