Missingness-Aware Gaussian Mixture Models
Zachary McCaw Updated: 2026-02-26
This package performs estimation and inference for Gaussian Mixture Models (GMMs) where the input data may contain missing values. Rather than imputing missing values before fitting the GMM, this package uses an extended EM algorithm to obtain the true maximum likelihood estimates of all model parameters given the observed data. In particular MGMM performs the following tasks:
- Maximum likelihood estimation of cluster means, covariances, and proportions.
- Calculation of cluster membership probabilities and maximum a posteriori classification of the input vectors.
- Deterministic completion of the input data, by imputing missing elements to their posterior means, and stochastic completion of the input data, by drawing missing elements from the fitted GMM.
The method is detailed in Fitting Gaussian mixture models on incomplete data.
Main Functions
FitGMMestimates model parameters, performs classification and imputation.rGMMsimulates observations from a GMM, potentially with missingness.ChooseKprovides guidance on choosing the number of clusters.GenImputationperforms stochastic imputation for multiple imputation-based inference.
Compact Example
set.seed(101)
library(MGMM)
# Parameter settings.
mean_list <- list(
c(1, 1),
c(-1, -1)
)
cov_list <- list(
matrix(c(1, -0.5, -0.5, 1), nrow = 2),
matrix(c(1, 0.5, 0.5, 1), nrow = 2)
)
# Generate data.
data <- rGMM(
n = 1e3,
d = 2,
k = 2,
miss = 0.1,
means = mean_list,
covs = cov_list
)
# Original data.
head(data)## y1 y2
## 1 1.6512855 2.60621938
## 2 -0.5721069 NA
## 2 -2.0045376 -2.31888263
## 2 -0.6229388 -1.51543968
## 1 2.0258413 0.06921658
## 2 -1.3476380 -1.51915826# Choose cluster number.
choose_k <- ChooseK(
data,
k0 = 2,
k1 = 4,
boot = 10,
maxit = 10,
eps = 1e-4,
report = TRUE
)## Cluster size 2 complete. 11 fit(s) succeeded.
## Cluster size 3 complete. 11 fit(s) succeeded.
## Cluster size 4 complete. 11 fit(s) succeeded.# Cluster number recommendations.
show(choose_k$Choices)## Metric k_opt Metric_opt k_1se Metric_1se
## 1 BIC 4 2285.4269195 2 2340.4441587
## 2 CHI 4 4.5081110 4 4.5081110
## 3 DBI 2 0.7818011 2 0.7818011
## 4 SIL 2 0.4785951 2 0.4785951# Estimation.
fit <- FitGMM(
data,
k = 2,
maxit = 10
)## Objective increment: 11.1
## Objective increment: 2.39
## Objective increment: 1.57
## Objective increment: 1.08
## Objective increment: 0.91
## Objective increment: 0.745
## Objective increment: 0.617
## Objective increment: 0.507
## Objective increment: 0.416
## Objective increment: 0.34
## 10 update(s) performed without reaching tolerance limit.# Estimated means.
show(fit@Means)## [[1]]
## y1 y2
## -1.056071 -1.070412
##
## [[2]]
## y1 y2
## 0.9437473 0.9513797# Estimated covariances.
show(fit@Covariances)## [[1]]
## y1 y2
## y1 0.9447484 0.5201638
## y2 0.5201638 0.9611714
##
## [[2]]
## y1 y2
## y1 0.9973684 -0.4489898
## y2 -0.4489898 0.9728258# Cluster assignments.
head(fit@Assignments)## Assignments Entropy
## 1 2 7.957793e-02
## 2 1 8.426200e-01
## 2 1 8.629837e-07
## 2 1 7.790894e-03
## 1 2 8.451183e-02
## 2 1 4.521627e-04# Deterministic imputation.
head(fit@Completed)## y1 y2
## 1 1.6512855 2.60621938
## 2 -0.5721069 -0.14379539
## 2 -2.0045376 -2.31888263
## 2 -0.6229388 -1.51543968
## 1 2.0258413 0.06921658
## 2 -1.3476380 -1.51915826# Stochastic imputation.
imp <- GenImputation(fit)
head(imp)## y1 y2
## 1 1.6512855 2.60621938
## 2 -0.5721069 0.88613853
## 2 -2.0045376 -2.31888263
## 2 -0.6229388 -1.51543968
## 1 2.0258413 0.06921658
## 2 -1.3476380 -1.51915826Documentation
A detailed write-up with derivations and examples (LaTeX-compiled PDF) is here.