GM: Multithreaded Gaussian mixture trainer

Description

The traditional training algorithm via maximum likelihood for parameterizing weighted data with a mixture of Gaussian PDFs. Bounds on covariance matrix eigen ratios and mixture weights are optional.

Usage

GM(
  X,
  Xw = rep(1, ncol(X)),
  alpha = numeric(0),
  mu = matrix(ncol = 0, nrow = 0),
  sigma = matrix(ncol = 0, nrow = 0),
  G = 5L,
  convergenceEPS = 1e-05,
  convergenceTail = 10,
  alphaEPS = 0,
  eigenRatioLim = Inf,
  embedNoise = 1e-6,
  maxIter = 1000L,
  maxCore = 7L,
  tlimit = 3600,
  verbose = TRUE,
  updateAlpha = TRUE,
  updateMean = TRUE,
  updateSigma = TRUE,
  checkInitialization = FALSE,
  KmeansFirst = TRUE,
  KmeansPPfirst = FALSE,
  KmeansRandomSeed = NULL,
  friendlyOutput = TRUE
  )

Arguments

A d x N numeric matrix where N is the number of observations --- each column is an observation, and d is the dimensionality. Column-observation representation promotes cache locality.

A numeric vector of size N. Xw[i] is the weight on observation X[, i]. Users should normalize Xw such that the elements sum up to N. Default uniform weights for all observations.

alpha

A numeric vector of size K, the number of Gaussian kernels in the mixture model. alpha are the initial mixture weights and should sum up to 1. Default empty.

A d x K numeric matrix. mu[, i] is the initial mean for the ith Gaussian kernel. Default empty.

sigma

Either a list of d x d matrices, or a d^2 x K numeric matrix. For the latter, each column represents a flattened d x d initial covariance matrix of the ith Gaussian kernel. In R, as.numeric(aMatrix) gives the flattened version of aMatrix. Covariance matrix of each Gaussian kernel MUST be positive-definite. Default empty.

An integer. If at least one of the parameters alpha, mu, sigma are empty, the program will initialize G Gaussian kernels via K-means++ deterministic initialization. See KMppIni(). Otherwise G is ignored. Default 5.

convergenceEPS

A numeric value. If the average change of all parameters in the mixture model is below convergenceEPS relative to those in the pervious iteration, the program ends. Checking convergence this way is faster than recomputing the log-likelihood every iteration. Default 1e-5.

convergenceTail

If every one of the last convergenceTail iteration produces less than a relative increase of convergenceEPS in log-likelihood, stop.

alphaEPS

A numeric value. During training, if any Gaussian kernel's weight is no greater than alphaEPS, the kernel is deleted. Default 0.

eigenRatioLim

A numeric value. During training, if any Gaussian kernel's max:min eigen value ratio exceeds eigenRatioLim, the kernel is treated as degenerate and deleted. Thresholding eigen ratios is in the interest of minimizing the effect of degenerate kernels in an early stage. Default Inf.

embedNoise

A small constant added to the diagonal entries of all covariance matrices. This may prevent covariance matrices collapsing prematurely. A suggested value is 1e-6. Covariance degeneration is detected during Cholesky decomposition, and will lead the trainer to remove the corresponding mixture component. For high-dimensional problem, setting embedNoise to nonzero may pose the illusion of massive log-likelihood, all because one or more mixture components are so close to singular, which makes the densities around them extremely high.

maxIter

An integer, the maximal number of iterations.

maxCore

An integer. The maximal number of threads to invoke. Should be no more than the total number of logical processors on machine. Default 7.

tlimit

A numeric value. The program exits with the current model in tlimit seconds.

verbose

A boolean value. TRUE prints progress.

updateAlpha

A boolean value or boolean vector. If a boolean value, TRUE implies weights on all mixture components are subject to update, otherwise they should stay unchanged during training. If a boolean vector, its size should equal the number of mixture components. updateAlpha[i] == TRUE implies the weight on the ith component is subject to update. Regardless of updateAlpha, the output will have normalized mixture weights.

updateMean

A boolean value or a boolean vector. If a boolean value, TRUE implies means of all mixture components are subject to update, otherwise they should stay unchanged during training. If a boolean vector, its size should equal the number of mixture components. updateMean[i] == TRUE implies the mean of the ith component is subject to update.

updateSigma

A boolean value or a boolean vector. If a boolean value, TRUE implies covariances of all mixture components are subject to update, otherwise they should stay unchanged during training. If a boolean vector, its size should equal the number of mixture components. updateSigma[i] == TRUE implies the covariance of the ith component is subject to update.

checkInitialization

Check if any of the input covariance matrices are singular.

KmeansFirst

A boolean value. Run K-means clustering for finding means.

KmeansPPfirst

A boolean value. Run stochastic K-means++ for K-means initialization.

KmeansRandomSeed

An integer or NULL, the random seed for K-means++.

friendlyOutput

TRUE returns covariance matrices in a list rather than a single matrix of columns of flattened covariance matrices.

Value

A list of size 5:

alpha

a numeric vector of size K. The mixture weights.

a d x K numeric matrix. Each column is the mean of a Gaussian kernel.

sigma

a d^2 x K numeric matrix. Each column is the flattened covariance matrix of a Gaussian kernel. Do matrix(sigma[, i], nrow = d) to recover the covariance matrix of the ith kernel.

fitted

a numeric vector of size N. fitted[i] is the probability density of the ith observation given by the mixture model.

clusterMember

a list of K integer vectors, the hard clustering inferred from the mixture model. Each integer vector contains the indexes of observations in X.

Warning

For one-dimensional data, X should still follow the data structure requirements: a matrix where each column is an observation.

Details

An in-place Cholesky decomposition of covariance matrix is implemented for space and speed efficiency. Only the upper triangle of the Cholesky decomposition is memorized for each Gaussian kernel, and it is then applied to a forward substitution routine for fast Mahalanobis distances computation. Each of the three main steps in an iteration --- Gaussian density computation, parameter inference, parameter update --- is multithreaded and hand-scheduled using Intel TBB. Extensive efforts have been made over cache-friendliness and extra multithreading overheads such as memory allocation.

If eigenRatioLim is finite, eigen decomposition employs routines in RcppArmadillo.

Examples

Run this code

# NOT RUN {
# =============================================================================
# Examples below use 1 thread to pass CRAN check. Speed advantage of multiple
# threads will be more pronounced for larger data.
# =============================================================================


# =============================================================================
# Parameterize the iris data. Let the function initialize Gaussian kernels.
# =============================================================================
X = t(iris[1:4])
# CRAN check only allows 2 threads at most. Increase `maxCore` for
# acceleration.
gmmRst = GMKMcharlie::GM(X, G = 4L, maxCore = 1L, friendlyOutput = FALSE)
str(gmmRst)




# =============================================================================
# Parameterize the iris data given Gaussian kernels.
# =============================================================================
G = 3L
d = nrow(X) # Dimensionality.
alpha = rep(1, G) / G
mu = X[, sample(ncol(X), G)] # Sample observations as initial means.
# Take the average variance and create initial covariance matrices.
meanVarOfEachDim = sum(diag(var(t(X)))) / d
covar = diag(meanVarOfEachDim / G, d)
covars = matrix(rep(as.numeric(covar), G), nrow = d * d)


# Models are sensitive to initialization.
gmmRst2 = GMKMcharlie::GM(
  X, alpha = alpha, mu = mu, sigma = covars, maxCore = 1L,
  friendlyOutput = FALSE)
str(gmmRst2)




# =============================================================================
# For fun, fit Rosenbrock function with a Gaussian mixture.
# =============================================================================
set.seed(123)
rosenbrock <- function(x, y) {(1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2}
N = 2000L
x = runif(N, -2, 2)
y = runif(N, -1, 3)
z = rosenbrock(x, y)


X = rbind(x, y)
Xw = z * (N / sum(z)) # Weights on observations should sum up to N.
gmmFit = GMKMcharlie::GM(X, Xw = Xw, G = 5L, maxCore = 1L, verbose = FALSE,
  friendlyOutput = FALSE)


oldpar = par()$mfrow
par(mfrow = c(1, 2))
plot3D::points3D(x, y, z, pch = 20)
plot3D::points3D(x, y, gmmFit$fitted, pch = 20)
par(mfrow = oldpar)




# =============================================================================
# For fun, fit a 3D spiral distribution.
# =============================================================================
N = 2000
t = runif(N) ^ 2 * 15
x = cos(t) + rnorm(N) * 0.1
y = sin(t) + rnorm(N) * 0.1
z = t + rnorm(N) * 0.1


X = rbind(x, y, z)
d = 3L
G = 10L
gmmFit = GMKMcharlie::GM(X, G = G, maxCore = 1L, verbose = FALSE,
  KmeansFirst = TRUE, KmeansPPfirst = TRUE, KmeansRandomSeed = 42,
  friendlyOutput = TRUE)
# Sample N points from the Gaussian mixture.
ns = as.integer(round(N * gmmFit$alpha))
sampledPoints = list()
for(i in 1:G)
{
  sampledPoints[[i]] = MASS::mvrnorm(
    ns[i], mu = gmmFit$mu[, i], Sigma = matrix(gmmFit$sigma[[i]], nrow = d))
}
sampledPoints =
  matrix(unlist(lapply(sampledPoints, function(x) t(x))), nrow = d)


# Plot the original data and the samples from the mixture model.
oldpar = par()$mfrow
par(mfrow = c(1, 2))
plot3D::points3D(x, y, z, pch = 20)
plot3D::points3D(x = sampledPoints[1, ],
                 y = sampledPoints[2, ],
                 z = sampledPoints[3, ], pch = 20)
par(mfrow = oldpar)




# =============================================================================
# For fun, fit a 3D spiral distribution. Fix parameters at random.
# =============================================================================
N = 2000
t = runif(N) ^ 2 * 15
x = cos(t) + rnorm(N) * 0.1
y = sin(t) + rnorm(N) * 0.1
z = t + rnorm(N) * 0.1


X = rbind(x, y, z); dimnames(X) = NULL
d = 3L
G = 10L
mu = X[, sample(ncol(X), G)]
s = matrix(rep(as.numeric(cov(t(X))), G), ncol = G)
alpha = rep(1 / G, G)
updateAlpha = sample(c(TRUE, FALSE), G, replace = TRUE)
updateMean = sample(c(TRUE, FALSE), G, replace = TRUE)
updateSigma = sample(c(TRUE, FALSE), G, replace = TRUE)
gmmFit = GMKMcharlie::GM(X, alpha = alpha, mu = mu, sigma = s, G = G,
                         maxCore = 2, verbose = FALSE,
                         updateAlpha = updateAlpha,
                         updateMean = updateMean,
                         updateSigma = updateSigma,
                         convergenceEPS = 1e-5, alphaEPS = 1e-8,
                         friendlyOutput = TRUE)
# }

Run the code above in your browser using DataLab