IBcont: Cluster Continuous Data Using the Information Bottleneck Algorithm

Description

The IBcont function implements the Information Bottleneck (IB) algorithm for fuzzy clustering of continuous data. This method optimizes an information-theoretic objective to preserve relevant information while forming concise and interpretable cluster representations strouse_ib_2019IBclust.

Usage

IBcont(X, ncl, beta, randinit = NULL, s = -1, scale = TRUE,
       maxiter = 100, nstart = 100, verbose = FALSE)

Value

A list containing the following elements:

Cluster: A cluster membership matrix.
InfoXT: A numeric value representing the mutual information, $I(X;T)$, between the original observations weights ($X$) and the cluster assignments ($T$).
InfoYT: A numeric value representing the mutual information, $I(Y;T)$, between the original labels ($Y$) and the cluster assignments ($T$).
beta: A numeric value of the regularisation strength beta used.
s: A numeric vector of bandwidth parameters used for the continuous variables.
ixt: A numeric vector tracking the mutual information values between original observation weights and cluster assignments across iterations.
iyt: A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations.
losses: A numeric vector tracking the final loss values across iterations.

Arguments

X: A numeric matrix or data frame containing the continuous data to be clustered. All variables should be of type numeric.
ncl: An integer specifying the number of clusters to form.
beta: Regularisation strength.
randinit: Optional. A vector specifying initial cluster assignments. If NULL, cluster assignments are initialized randomly.
s: A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than $0$. The default value is $-1$, which enables the automatic selection of optimal bandwidth(s).
scale: A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE.
maxiter: The maximum number of iterations allowed for the clustering algorithm. Defaults to $100$.
nstart: The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to 100.
verbose: Logical. Default to FALSE to suppress progress messages. Change to TRUE to print.

Author

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Details

The IBcont function applies the Information Bottleneck algorithm to do fuzzy clustering of datasets comprising only continuous variables. This method leverages an information-theoretic objective to optimize the trade-off between data compression and the preservation of relevant information about the underlying data distribution.

The function utilizes the Gaussian kernel silverman_density_1998IBclust for estimating probability densities of continuous features. The kernel is defined as:

$$K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.$$

The bandwidth parameter $s$, which controls the smoothness of the density estimate, is automatically determined by the algorithm if not provided by the user.

References

strouse_ib_2019IBclustsilverman_density_1998IBclust

Examples

Run this code

# Generate simulated continuous data
set.seed(123)
X <- matrix(rnorm(200), ncol = 5)  # 200 observations, 5 features

# Run IBcont with automatic bandwidth selection and multiple initializations
result <- IBcont(X = X, ncl = 3, beta = 50, s = -1, nstart = 20)

# Print clustering results
print(result$Cluster)       # Cluster membership matrix
print(result$InfoXT)       # Mutual information between X and T
print(result$InfoYT)    # Mutual information between Y and T

Run the code above in your browser using DataLab