IBmix: Information Bottleneck Clustering for Mixed-Type Data

Description

The IBmix function implements the Information Bottleneck (IB) algorithm for clustering datasets containing continuous, categorical (nominal and ordinal), and mixed-type variables. This method optimizes an information-theoretic objective to preserve relevant information in the cluster assignments while achieving effective data compression strouse_ib_2019IBclust.

Usage

IBmix(X, ncl, beta, randinit = NULL,
       s = -1, lambda = -1, scale = TRUE,
       maxiter = 100, nstart = 100,
       conv_tol = 1e-5, contkernel = "gaussian",
       nomkernel = "aitchisonaitken", ordkernel = "liracine",
       cat_first = FALSE, verbose = FALSE)

Value

An object of class "gibclust" representing the final clustering result. The returned object is a list with the following components:

Cluster: An integer vector giving the cluster assignments for each data point.
Entropy: A numeric value representing the entropy of the cluster assignments at convergence.
CondEntropy: A numeric value representing the conditional entropy of cluster assignment, given the observation weights $H(T \mid X)$.
MutualInfo: A numeric value representing the mutual information, $I(Y;T)$, between the original labels ($Y$) and the cluster assignments ($T$).
InfoXT: A numeric value representing the mutual information, $I(X;T)$, between the original observations weights ($X$) and the cluster assignments ($T$).
beta: A numeric vector of the final beta values used in the iterative procedure.
alpha: A numeric value of the strength of conditional entropy used, controlling fuzziness of the solution. This is by default equal to $0$ for DIBmix.
s: A numeric vector of bandwidth parameters used for the continuous variables. A value of $-1$ is returned if all variables are categorical.
lambda: A numeric vector of bandwidth parameters used for the categorical variables. A value of $-1$ is returned if all variables are continuous.
call: The matched call.
ncl: Number of clusters.
n: Number of observations.
iters: Number of iterations used to obtain the returned solution.
converged: Logical indicating whether convergence was reached before maxiter.
conv_tol: Numeric convergence tolerance.
contcols: Indices of continuous columns in X.
catcols: Indices of categorical columns in X.
kernels: List with names of kernels used for continuous, nominal, and ordinal features.

Objects of class "gibclust" support the following methods:

print.gibclust: Display a concise description of the cluster assignment.
summary.gibclust: Show detailed information including cluster sizes, information-theoretic metrics, hyperparameters, and convergence details.
plot.gibclust: Produce diagnostic plots:
- type = "sizes": barplot of cluster sizes or hardened sizes (IB/GIB).
- type = "info": barplot of entropy, conditional entropy, and mutual information.
- type = "beta": trajectory of $\log \beta$ over iterations (only available for hard clustering outputs obtained using DIBmix).

Arguments

X: A data frame containing the input data to be clustered. It should include categorical variables (factor for nominal and Ord.factor for ordinal) and continuous variables (numeric).
ncl: An integer specifying the number of clusters.
beta: Regularisation strength characterizing the tradeoff between compression and relevance. Must be non-negative.
randinit: An optional vector specifying the initial cluster assignments. If NULL, cluster assignments are initialized randomly.
s: A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than $0$. The default value is $-1$, which enables the automatic selection of optimal bandwidth(s). Argument is ignored when no variables are continuous.
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables and nomkernel = 'aitchisonaitken', the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories, whereas for nomkernel = 'liracine' the maximum allowable value is $1$. For ordinal variables, the maximum allowable value of lambda is $1$, regardless of what ordkernel is being used. Argument is ignored when all variables are continuous.
scale: A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE. Argument is ignored when all variables are categorical.
maxiter: The maximum number of iterations allowed for the clustering algorithm. Defaults to $100$.
nstart: The number of random initializations to run. The best clustering solution is returned. Defaults to $100$.
conv_tol: Convergence tolerance level; for a cluster membership matrix $U^{(m)}$ at iteration $m$, convergence is achieved if $\sum_{i,j}\lvert U_{i,j}^{m+1} - U_{i,j}^m \rvert \le $ conv_tol. Must be in range $[0, 1]$. Defaults to 1e-5.
contkernel: Kernel used for continuous variables. Can be one of gaussian (default) or epanechnikov. Argument is ignored when no variables are continuous.
nomkernel: Kernel used for nominal (unordered categorical) variables. Can be one of aitchisonaitken (default) or liracine. Argument is ignored when no variables are nominal.
ordkernel: Kernel used for ordinal (ordered categorical) variables. Can be one of liracine (default) or wangvanryzin. Argument is ignored when no variables are ordinal.
cat_first: A logical value indicating whether bandwidth selection is prioritised for the categorical variables, instead of the continuous. Defaults to FALSE. Set to TRUE if you suspect that the continuous variables are not informative of the cluster structure. Can only be TRUE when data is of mixed-type and all bandwidths are selected automatically (i.e. s = -1, lambda = -1).
verbose: Logical. Defaults to FALSE to suppress progress messages. Change to TRUE to print.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The IBmix function produces a fuzzy clustering of the data while retaining maximal information about the original variable distributions. The Information Bottleneck algorithm optimizes an information-theoretic objective that balances information preservation and compression. Bandwidth parameters for categorical (nominal, ordinal) and continuous variables are adaptively determined if not provided. This iterative process identifies stable and interpretable cluster assignments by maximizing mutual information while controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates information from all variable types effectively.

The following kernel functions can be used to estimate densities for the clustering procedure. For continuous variables:

Gaussian (RBF) kernel silverman_density_1998IBclust: $$K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.$$
Epanechnikov kernel epanechnikov1969nonIBclust: $$K_c(x - x'; s) = \begin{cases} \frac{3}{4\sqrt{5}}\left(1 - \frac{(x-x')^2}{5s^2} \right), & \text{if } \frac{(x - x')^2}{s^2} < 5 \\ 0, & \text{otherwise} \end{cases}, \quad s > 0.$$

For nominal (unordered categorical variables):

Aitchison & Aitken kernel aitchison_kernel_1976IBclust: $$K_u(x = x'; \lambda) = \begin{cases} 1 - \lambda, & \text{if } x = x' \\ \frac{\lambda}{\ell - 1}, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell}.$$
Li & Racine kernel ouyang2006crossIBclust: $$K_u(x = x'; \lambda) = \begin{cases} 1, & \text{if } x = x' \\ \lambda, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq 1.$$

For ordinal (ordered categorical) variables:

Li & Racine kernel li_nonparametric_2003IBclust: $$K_o(x = x'; \nu) = \begin{cases} 1, & \text{if } x = x' \\ \nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$
Wang & van Ryzin kernel wang1981classIBclust: $$K_o(x = x'; \nu) = \begin{cases} 1 - \nu, & \text{if } x = x' \\ \frac{1-\nu}{2}\nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.$$

The bandwidth parameters $s$, $\lambda$, and $\nu$ control the smoothness of the density estimate and are automatically determined by the algorithm if not provided by the user using the approach in costa_dib_2025;textualIBclust. $\ell$ is the number of levels of the categorical variable. For ordinal variables, the lambda parameter of the function is used to define $\nu$.

References

strouse_ib_2019IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclustsilverman_density_1998IBclustouyang2006crossIBclustwang1981classIBclustepanechnikov1969nonIBclust

Examples

Run this code

# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data_mix <- data.frame(
  cat_var = factor(sample(letters[1:3], 100, replace = TRUE)),      # Nominal categorical variable
  ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                   levels = c("low", "medium", "high"),
                   ordered = TRUE),                                # Ordinal variable
  cont_var1 = rnorm(100),                                          # Continuous variable 1
  cont_var2 = runif(100)                                           # Continuous variable 2
)

# Perform Mixed-Type Fuzzy Clustering
result_mix <- IBmix(X = data_mix, ncl = 3, beta = 2, nstart = 1)

# Print clustering results
print(result_mix$Cluster)       # Cluster membership matrix
print(result_mix$InfoXT)        # Mutual information between X and T
print(result_mix$MutualInfo)    # Mutual information between Y and T

# Summary of output
summary(result_mix)

# Simulated categorical data example
set.seed(123)
data_cat <- data.frame(
  Var1 = as.factor(sample(letters[1:3], 100, replace = TRUE)),  # Nominal variable
  Var2 = as.factor(sample(letters[4:6], 100, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Perform fuzzy clustering on categorical data with standard IB
result_cat <- IBmix(X = data_cat, ncl = 3, beta = 15, lambda = -1, nstart = 2, maxiter = 200)

# Print clustering results
print(result_cat$Cluster)       # Cluster membership matrix
print(result_cat$InfoXT)        # Mutual information between X and T
print(result_cat$MutualInfo)    # Mutual information between Y and T

plot(result_cat, type = "sizes") # Bar plot of cluster sizes (hardened assignments)
plot(result_cat, type = "info")  # Information-theoretic quantities plot

# Simulated continuous data example
set.seed(123)
# Continuous data with 100 observations, 5 features
data_cont <- as.data.frame(matrix(rnorm(500), ncol = 5))

# Perform fuzzy clustering on continuous data with standard IB
result_cont <- IBmix(X = data_cont, ncl = 3, beta = 50, s = -1, nstart = 2)

# Print clustering results
print(result_cont$Cluster)       # Cluster membership matrix
print(result_cont$InfoXT)        # Mutual information between X and T
print(result_cont$MutualInfo)    # Mutual information between Y and T

Run the code above in your browser using DataLab