gmix: Gaussian Mixture Modelling

Description

Fast implementation of the EM algorithm for ML estimation and clustering of Gaussian mixture models with covariance matrix regularization based on eigenvalue ratio constraints.

Usage

gmix(
   data,
   K = NA,
   erc = 50,
   iter.max = 1000,
   tol = 1e-8,
   init = "kmed",
   init.nstart = 25,
   init.iter.max = 30,
   init.tol = tol,
   save_cluster = TRUE,
   save_params = TRUE,
   save_taus = FALSE)

Value

An S3 object of class 'mbcfit'. Output components are as follows:

info

information on convergence and errors (see notes).

iter

number of iterations performed in the underlying EM-algorithm.

N

number of data points.

P

data dimension.

K

number of clusters.

eloglik

sample expected log-likelihood.

size

cluster size (counts).

cluster

cluster assignment based on the maximum a posteriori rule (MAP).

taus

a matrix of dimension (N x {K}) where tau[i, k] is the estimated posterior probability that the ith observation belongs to the kth cluster.

params

a list containing mixture components parameters. The elements of the list are as follows: $prop=vector of proportions; $mean==matrix of dimension (P x K) containing mean parameters; $cov=array of size (P x P x K) containing covariance matrices.

info

a list with two components named giving information about underlysing EM algorithm. The code obejects can take the following values:

code=1: the algorithm converged within iter.max.
code=2: the algorithm reached iter.max.
code=3: the algorithm did not move from initial values.
code=-1: unexpected memory allocation issues occured.
code=-2: unexpected LAPACK routines errors occured.

The flag obejects can take the following values:

flag=0 no flag.
flag=1 numerically degenerate posterior probabilities. (taus) could not be prevented.
flag=2 the ERC was enforced at least once.
flag=3 if condition of flag=1 and flag=2 occurred.

Arguments

data: a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Let N=nrows(data) and P=ncol(data). Categorical variables and NA values are not allowed.
K: the number of mixture components or clusters. It can be left NA for certain specifications of init (see below) where the number of cluster is retrieved from the initial partition.
erc: a numeric value >=1 specifying the eigenvalue ratio constraint (See Details).
iter.max: maximum number of iterations for the EM algorithm.
tol: tolerance for the convergence of the EM algorithm.
init: a character in the set c("kmed", "kmeans", "pam"), a vector, a matrix, or a callable giving the initial assignment of data points (see Details). The default choice is "kmed".
init.nstart: number of initial partitions ((see Details)).
init.iter.max: maximum number of iterations for each run of the kmedian initialization.
init.tol: tolerance for the convergence of each ran of the kmedian initialization.
save_cluster: logical, if TRUE the point-to-cluster assignment based on the maximum a posteriori probability (MAP) rule is returned.
save_params: logical, if TRUE the estimated mixture parameters are returned.
save_taus: logical, if TRUE the posterior class probabilities are returned (these are also known as posterior weights or fuzzy weights).

Details

The function implements the constrained ML estimator studied in Coretto and Hennig (2023). The convariance matrix constraints are computed according to the CM1-step of Algorithm 2 of Coretto and Hennig (2017). This function uses highly optimized C code for fast execution. The constrained M-step extensively uses low-level common linear algebra matrix operations (BLAS/LAPACK routines). Consequently, to maximize computational efficiency, it is recommended that the best available shared libraries, such as OpenBLAS, Intel Math Kernel Library (MKL), etc., be set up.

Initialization. The default method, set with init="kmed", uses fast C implementation of the k-medians algorithm with random initial centers drawn uniformly over the data rows init.iter.max times. Depending on the computer power available it is suggested to set init.iter.max as large as possible particularly in cases where the data set dimensionality is large in terms of both sample size and number of features. Setting init="kmeans" one replaces the K-medians with the K-means. With init="pam" initial clusters are determined using the PAM algorithm based on Euclidian distances. The latter does not perform multiple starts. The user can also set init = x where x is a vector of integers of length N=nrow(data) representing an initial hard assignment of data points to the mixture components or clusters (see Examples). Another possibility is to set init = W where W is a matrix of dimension (N x {K}) containing the initial posterior probabilities that the ith observation belongs to the kth cluster. The assignment provided via W can be hard (0-1 weights with the constraint that only a 1 is possible in each row of W, or smooth (each row of W must sum up to 1). W can be seen as the initial version of the object tau describied in the Value section below. The last alternative is to set init = f(data). Here f(data) is a function that takes data as an input and returns the matrix with an initial hard/smooth assignment as the W matrix previously described (see the example below).

Eigenvalue ratio constraint (erc). It is the maximum allowed ratio between within-cluster covariance matrix eigenvalues. It defines the so-called eigenratio constraint. erc=1 enforces spherical clusters with equal covariance matrices. A large erc allows for large between-cluster covariance discrepancies. It is suggested to never set erc arbitrarily large, its main role is to prevent degenerate covariance parameters and the related emergence of spurious clusters (see Referenceses below). Finally, in order to facilitate the setting of erc, it is suggested to scale the columns of data whenever measurement units of the different variables are grossly incompatible.

References

Coretto, Pietro and Christian Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. URL: https://jmlr.org/papers/v18/16-382.html

Coretto, Pietro and Christian Hennig (2023) Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. arXiv:2311.06108. URL: https://arxiv.org/abs/2311.06108

Examples

Run this code

# --- load data
data("banknote")
dat <- banknote[-1]
n   <- nrow(dat) #sample size
nc  <- 2         #number of clusters


   
# fit 2 clusters using the default k-median initialization 
# In real applications set 'init.nstart' as large as possibile
set.seed(101)
fit1 <- gmix(dat, K = nc, init.nstart = 1)
print(fit1)

# plot partition (default)
plot(x = fit1, data = dat)


# plot partition onto the first 3 principal component coordinates 
plot(x = fit1, data = prcomp(dat)$x, margins = c(1,2,3),
     pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"),
     main = "Principal Components")


 
# user-defined random initialization with hard assignment labels 
set.seed(102)
i2   <- sample(1:nc, size = n, replace = TRUE)
fit2 <- gmix(dat, K = 2, init = i2)
plot(x=fit2, data = dat)



# user-defined smooth "toy" initialization: 
# 50% of the points are assigned to cluster 1 with probability 0.95 and to
# cluster 2 with probability 5%. The remaining data points are assigned to
# cluster 1 with probability 10% and  to cluster 2 with probability 10%
# 
set.seed(103)
idx        <- sample(c(TRUE, FALSE), size = n, replace = TRUE)
i3         <- matrix(0, nrow = n, ncol = nc) 
i3[idx,  ] <- c(0.9, 0.1)
i3[!idx, ] <- c(0.1, 0.9)
# fit
fit3  <- gmix(dat, K = nc, init = i3)
plot(x=fit3, data = dat)



# user-defined function for initialization
# this one produces a 0-1 hard posterior matrix W based on kmeans
#
compute_init <- function(data, K){
  cl  <- kmeans(data, K, nstart=1, iter.max=10)$cluster
  W   <- sapply(seq(K), function(x) as.numeric(cl==x))
  return(W)
} 
fit4 <- gmix(dat, K = nc, init = compute_init)
plot(fit4, data = dat)

Run the code above in your browser using DataLab