predict.mbcfit: Predict Hard Clustering Assignments for using Mixture Models

Description

This function predicts cluster assignments for new data based on an existing model of class mbcfit. The prediction leverages information from the fitted model to categorize new observations into clusters.

Usage

# S3 method for mbcfit
predict(object, newdata, ...)

Value

A vector of length nrow(data) containing the estimated cluster labels for each observation in the provided data.

Arguments

object: An object of class mbcfit, representing the fitted mixture model. This is typically obtained in output from the gmix function. See Details.
newdata: A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and NA values are not allowed. The number of columns must be coherent with that implied by x. See Details.
...: Further arguments passed to or from other methods.

Details

The predict.mbcfit function utilizes the parameters of a previously fitted mbcfit model to allocate new data points to estimated clusters. The function performs necessary checks to ensure the mbcfit model returns valid estimates and the dimensionality of the new data aligns with the model.

The mbcfit object must contain a component named params, which is itself a list containing the following necessary elements, for a mixture model with K components:

proportions: A numeric vector of length K, with elements summing to 1, representing cluster proportions.
mean: A numeric matrix of dimensions c(P, K), representing cluster centers.
cov: A numeric array of dimensions c(P, P, K), representing cluster covariance matrices.

Data dimensionality is P, and new data dimensionality must match (ncol(data) must be equal to P) or otherwise the function terminates with an error message.

The predicted clustering is obtained as the MAP estimator using posterior weights of a Gaussian mixture model parametrized at params. Denoting with $z(x)$ the predicted cluster label for point $x$, and with $\phi$ the (multivariate) Gaussian density: $$z(x) = \underset{k=\{1,\ldots,K\}}{\arg\,\max} \frac{\pi_k\phi(x, \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j\phi(x, \mu_j, \Sigma_j)}$$

References

Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")

Examples

Run this code

# load data
data(banknote)
dat <- banknote[,-1]

# Estimate 3-components gaussian mixture model
set.seed(123)
res <- gmix(dat, K = 3)

# Cluster in output from gmix
print(res$cluster)

# Predict cluster on a single point
# (keep table dimension)
predict(res, dat[1, , drop=FALSE])

# Predict cluster on a subset
predict(res, dat[1:10, ])

# Predicted cluster on original dataset are equal to the clustering from the gmix model
all(predict(res, dat) == res$cluster)

Run the code above in your browser using DataLab