EM_FMM_SemiSupervised: EM for Semi-Supervised FMM with a Mixed-Missingness Mechanism (MCAR + entropy-based MAR)

Description

Runs an EM-like procedure that models a mixed-missingness mechanism: unlabeled indicator \(m_j\) follows a mixed-missingness mechanism with MCAR (prob \(\alpha\)) and entropy-based MAR via a logistic link \(q_j = \text{logit}^{-1}(\xi_0 + \xi_1 \log e_j)\). Supports shared (ncov = 1) or class-specific (ncov = 2) covariance.

Usage

EM_FMM_SemiSupervised(
  data,
  g = 2,
  init_res,
  max_iter = 5,
  tol = 1e-06,
  ncov = 1,
  verbose = FALSE
)

Value

A list with elements: pi, mu, Sigma, xi, alpha, loglik, and ncov.

Arguments

data

A data.frame or matrix with \(p+2\) columns: first \(p\) are features, then missing (0 = labelled, 1 = unlabelled), and z (class label for labelled rows; ignored otherwise).

g

Integer, number of mixture components (classes).

init_res

A list with initial parameters:

pi: numeric length-g (mixture weights, sum to 1)
mu: list of length g, each length-p mean vector
Sigma: if ncov = 1, a p x p matrix; if ncov = 2, a list of g p x p matrices
alpha: scalar in (0,1)
xi: numeric length-2, logistic coefficients (xi0, xi1)

max_iter

Integer, max EM iterations.

tol

Convergence tolerance on log-likelihood increase.

ncov

Integer covariance structure: 1 = shared/equal, 2 = class-specific/unequal.

verbose

Logical; if TRUE, progress messages are printed using message(). Default is FALSE.

Details

This function expects the following helpers to be available:

pack_theta(pi_k, mu_k, Sigma_k, g, p, ncov)
unpack_theta(theta, g, p, ncov)
neg_loglik(theta, Y_all, m_j, Z_all, d2_yj, xi, alpha_k, unpacker)
get_entropy(dat, n, p, g, paralist) returning per-observation entropy-like values

Examples

Run this code

# \donttest{
  ## Toy example using a simple partially-labelled dataset
  set.seed(1)

  ## 1) Construct an n x (p+2) partially-labelled dataset:
  ##    first p columns = features, then 'missing' and 'z'
  n <- 100; p <- 2; g <- 2

  X <- matrix(rnorm(n * p), nrow = n, ncol = p)

  ## missing: 0 = labelled, 1 = unlabelled
  missing <- rbinom(n, size = 1, prob = 0.3)

  ## z: observed class labels for labelled rows, NA for unlabelled
  z <- rep(NA_integer_, n)
  z[missing == 0] <- sample(1:g, sum(missing == 0), replace = TRUE)

  sim_dat <- data.frame(X, missing = missing, z = z)

  ## 2) Warm-up initialisation using the complete-data initializer
  init <- EM_FMM_SemiSupervised_Complete_Initial(
    data = sim_dat,
    g    = g,
    ncov = 1
  )

  ## 3) Run the main EM algorithm (small number of iterations)
  fit <- EM_FMM_SemiSupervised(
    data     = sim_dat,
    g        = g,
    init_res = init,
    ncov     = 1,
    max_iter = 5,
    verbose  = FALSE
  )

  str(fit)
# }

Run the code above in your browser using DataLab