GIBcat: Cluster Categorical Data Using the Generalised Information Bottleneck Algorithm

Description

The GIBcat function implements the Generalised Information Bottleneck (GIB) algorithm for fuzzy clustering of datasets containing categorical variables. This method balances information retention and data compression to create meaningful clusters, leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively strouse_ib_2019IBclust.

Usage

GIBcat(X, ncl, beta, alpha, randinit = NULL, lambda = -1,
       maxiter = 100, nstart = 100, verbose = FALSE)

Value

A list containing the following elements:

Cluster: A cluster membership matrix.
Entropy: A numeric value representing the entropy of the cluster assignment, $H(T)$.
RelEntropy: A numeric value representing the relative entropy of cluster assignment, given the observation weights $H(X \mid T)$.
MutualInfo: A numeric value representing the mutual information, $I(Y;T)$, between the original labels ($Y$) and the cluster assignments ($T$).
beta: A numeric value of the regularisation strength beta used.
alpha: A numeric value of the strength of relative entropy used.
lambda: A numeric vector of bandwidth parameters for categorical variables, controlling how categories are weighted in the clustering.
ht: A numeric vector tracking the entropy value of the cluster assignments across iterations.
hy_t: A numeric vector tracking the relative entropy values between the cluster assignments and observations weights across iterations.
iyt: A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations.
losses: A numeric vector tracking the final loss values across iterations.

Arguments

X: A data frame containing the categorical data to be clustered. All variables should be categorical, either factor (for nominal variables) or ordered (for ordinal variables).
beta: Regularisation strength.
alpha: Strength of relative entropy term.
ncl: An integer specifying the number of clusters to form.
randinit: Optional. A vector specifying initial cluster assignments. If NULL, cluster assignments are initialized randomly.
lambda: A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is $-1$, which enables automatic determination of the optimal bandwidth. For nominal variables, the maximum allowable value of lambda is $(l - 1)/l$, where $l$ represents the number of categories. For ordinal variables, the maximum allowable value of lambda is $1$.
maxiter: The maximum number of iterations for the clustering algorithm. Defaults to $100$.
nstart: The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to $100$.
verbose: Logical. Default to FALSE to suppress progress messages. Change to TRUE to print.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The GIBcat function applies the Generalised Information Bottleneck algorithm to do fuzzy clustering of datasets containing only categorical variables, both nominal and ordinal. The algorithm optimizes an information-theoretic objective to balance the trade-off between data compression and the retention of information about the original distribution. Set $\alpha = 1$ and $\alpha = 0$ to recover the Information Bottleneck and its Deterministic variant, respectively. If $\alpha = 0$, the algorithm ignores the value of the regularisation parameter $\beta$.

To estimate the distributions of categorical features, the function utilizes specialized kernel functions, as follows:

$$K_u(x = x'; \lambda) = \begin{cases} 1 - \lambda, & \text{if } x = x' \\ \frac{\lambda}{\ell - 1}, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell},$$ where $\ell$ is the number of categories, and $\lambda$ controls the smoothness of the Aitchison & Aitken kernel for nominal variables aitchison_kernel_1976IBclust.

$$K_o(x = x'; \nu) = \begin{cases} 1, & \text{if } x = x' \\ \nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1,$$ where $\nu$ is the bandwidth parameter for ordinal variables, accounting for the ordinal relationship between categories li_nonparametric_2003IBclust.

Here, $\lambda$, and $\nu$ are bandwidth or smoothing parameters, while $\ell$ is the number of levels of the categorical variable. The lambda parameter is automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define $\nu$.

References

strouse_dib_2017IBclustaitchison_kernel_1976IBclustli_nonparametric_2003IBclust

Examples

Run this code

# Simulated categorical data
set.seed(123)
X <- data.frame(
  Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)),  # Nominal variable
  Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Run GIBcat with automatic lambda selection and multiple initializations
result <- GIBcat(X = X, ncl = 2, beta = 25, alpha = 0.75, lambda = -1, nstart = 20)

# Print clustering results
print(result$Cluster)       # Cluster membership matrix
print(result$Entropy)       # Entropy of final clustering
print(result$RelEntropy)    # Relative entropy of final clustering
print(result$MutualInfo)    # Mutual information between Y and T

Run the code above in your browser using DataLab