runGMM: Run Gaussian Mixture Model (GMM) Clustering with Multiple Initialization Strategies

Description

Applies the Gaussian Mixture Model (GMM) to a dataset using multiple initialization strategies. It runs the Expectation-Maximization (EM) algorithm for each initialization method and returns results for all methods.

Usage

runGMM(x, k, max_iter = 100,
    run_number = 10, smax_iter = 3, 
       s_iter = 10, c_iter = 10, tol = 1e-6, burn_in = 3,verbose = FALSE)

Value

A named list where each element corresponds to an initialization method and contains the results of the EM algorithm:

"Random": Results using random initialization.
"hierarchical.average": Results using hierarchical clustering (average linkage).
"hierarchical.ward": Results using hierarchical clustering (Ward’s method).
"kmeans": Results using K-means clustering initialization.
"emEM": Results using multi-start Expectation-Maximization (EM).
"emAEM": Results using the alternative EM initialization method.
"sem": Results using the Stochastic Expectation-Maximization (SEM).
"cem": Results using the Classification Expectation-Maximization (CEM).
"mclust": Results using model-based clustering from the mclust package.

Each element in the returned list contains:

BIC: The Bayesian Information Criterion (BIC) value for model selection.
param: A list with the estimated GMM parameters:
- pi_k: Updated mixing proportions.
- mu: Updated cluster means.
- sigma: Updated covariance matrices.
cluster_assignments: Cluster labels assigned to each observation.
Z: Posterior probability matrix of cluster memberships.

If an initialization method fails, the corresponding list element will contain an error message.

Arguments

x: A numeric matrix or data frame where rows represent observations and columns represent variables.
k: An integer specifying the number of clusters.
max_iter: maximum iteration for running long EM
run_number: number of short em for emEM and emAEM initialization methods(default is 10).
smax_iter: An integer specifying the maximum number of iterations for short EM (default is 3).
s_iter: An integer specifying the number of iterations for the Stochastic Expectation-Maximization (SEM) algorithm (default is 10).
c_iter: An integer specifying the number of iterations for the Classification Expectation-Maximization (CEM) algorithm (default is 10).
tol: A numeric value specifying the convergence tolerance threshold (default is 1e-6).
burn_in: An integer specifying the number of burn-in iterations for stochastic methods (default is 3).
verbose: Logical; if TRUE, prints progress messages.

Details

The runGMM applies multiple initialization strategies for fitting a Gaussian Mixture Model (GMM) using the Expectation-Maximization (EM) algorithm. Each initialization method is evaluated separately, and the results are returned for all tested methods.

Examples

Run this code

# Generate sample data
set.seed(123)
data <- matrix(rnorm(100 * 2), ncol = 2)

# Run GMM clustering with different initialization strategies
results <- runGMM(data, k = 2)
results