mcmc_IMIFA: Adaptive Gibbs Sampler for Nonparameteric Model-based Clustering using models from the IMIFA family

Description

Carries out Gibbs sampling for all models from the IMIFA family, facilitating model-based clustering with dimensionally reduced factor-analytic covariance structures, with automatic estimation of the number of clusters and cluster-specific factors as appropriate to the method employed. Factor analysis with one group (FA/IFA), finite mixtures (MFA/MIFA), overfitted mixtures (OMFA/OMIFA), infinite factor models which employ the multiplicative gamma process (MGP) shrinkage prior (IFA/MIFA/OMIFA/IMIFA), and infinite mixtures which employ Dirichlet Process Mixture Models (IMFA/IMIFA) are all provided. Creates a raw object of class 'IMIFA' from which the optimal/modal model can be extracted by get_IMIFA_results.

Usage

mcmc_IMIFA(dat = NULL, method = c("IMIFA", "IMFA", "OMIFA", "OMFA", "MIFA",
  "MFA", "IFA", "FA", "classify"), n.iters = 25000L, range.G = NULL,
  range.Q = NULL, burnin = n.iters/5, thinning = 2L, centering = TRUE,
  scaling = c("unit", "pareto", "none"), uni.type = c("unconstrained",
  "isotropic", "constrained", "single"), uni.prior = c("unconstrained",
  "isotropic"), alpha = NULL, psi.alpha = NULL, psi.beta = NULL,
  mu.zero = NULL, sigma.mu = NULL, sigma.l = NULL, z.init = c("mclust",
  "kmeans", "list", "priors"), z.list = NULL, adapt = TRUE, prop = NULL,
  epsilon = NULL, alpha.d1 = NULL, alpha.d2 = NULL, beta.d1 = NULL,
  beta.d2 = NULL, nu = NULL, nuplus1 = TRUE, adapt.at = NULL,
  b0 = NULL, b1 = NULL, trunc.G = NULL, learn.alpha = TRUE,
  alpha.hyper = NULL, ind.slice = TRUE, rho = NULL, IM.lab.sw = TRUE,
  discount = NULL, learn.d = FALSE, d.hyper = NULL, kappa = NULL,
  zeta = NULL, tune.zeta = NULL, mu0g = FALSE, psi0g = FALSE,
  delta0g = FALSE, mu.switch = TRUE, score.switch = TRUE,
  load.switch = TRUE, psi.switch = TRUE, pi.switch = TRUE,
  verbose = interactive())

Arguments

dat

A matrix or data frame such that rows correspond to observations (N) and columns correspond to variables (P). Non-numeric variables and rows with missing entries will be removed.

method

An acronym for the type of model to fit where:

"FA": Factor Analysis
"MFA": Mixtures of Factor Analysers
"MIFA": Mixtures of Infinite Factor Analysers
"OMFA": Overfitted Mixtures of Factor Analysers
"OMIFA": Overfitted Mixtures of Infinite Factor Analysers
"IMFA": Infinite Mixtures of Factor Analysers
"IMIFA": Infinite Mixtures of Infinite Factor Analysers

The "classify" method is not yet implemented.

n.iters

The number of iterations to run the Gibbs sampler for.

range.G

Depending on the method employed, either the range of values for the number of clusters, or the conseratively high starting value for the number of clusters. Defaults to 1 for the "FA" and "IFA" methods. For the "MFA" and "MIFA" models this is to be given as a range of candidate models to explore. For the "OMFA", "OMIFA", "IMFA", and "IMIFA" models, this is the number of clusters with which the chain is to be initialised, in which case the default is min(N - 1, max(25, ceiling(3 * log(N)))). For the "OMFA", and "OMIFA" models this upper limit remains fixed for the entire length of the chain; range.G also doubles as the default trunc.G for the "IMFA" and "IMIFA" models. However, when N < P, or when this bound is close to or exceeds N for any of these overfitted/infinite mixture models, it is better to initialise at a value closer to the truth (i.e. ceiling(log(N)) by default), though the upper bound remains the same - as a result the role of range.G when N < P is no longer to specify the upper bound (which can still be modified via trunc.G, at least for the "IMFA" and "IMIFA" methods) and the number of groups used for initialisation, but rather just the number of groups used for initialisation only. If length(range.G) * length(range.Q) is large, consider not storing unnecessary parameters, or breaking up the range of models to be explored into chunks, and sending each chunk to get_IMIFA_results.

range.Q

Depending on the method employed, either the range of values for the number of latent factors, or, for methods ending in IFA the conservatively high starting value for the number of cluster-specific factors, in which case the default starting value is floor(3 * log(P)). For methods ending in IFA, different clusters can be modelled using different numbers of latent factors (incl. zero); for methods not ending in IFA it is possible to fit zero-factor models, corresponding to simple diagonal covariance structures. For instance, fitting the "IMFA" model with range.Q=0 corresponds to a vanilla Dirichlet Process Mixture Model. If length(range.G) * length(range.Q) is large, consider not storing unnecessary parameters or breaking up the range of models to be explored into chunks, and sending each chunk to get_IMIFA_results.

burnin

The number of burn-in iterations for the sampler. Defaults to n.iters/5. Note that chains can also be burned in later, using get_IMIFA_results.

thinning

The thinning interval used in the simulation. Defaults to 2. No thinning corresponds to 1. Note that chains can also be thinned later, using get_IMIFA_results.

centering

A logical value indicating whether mean centering should be applied to the data, defaulting to TRUE.

scaling

The scaling to be applied - one of "unit", "none" or "pareto".

uni.type

This argument specifies the type of constraint, if any, to be placed on the uniquenesses/idiosyncratic variances, i.e. whether a general diagonal matrix or isotropic diagonal matrix is to be assumed, and in turn whether these matrices are constrained to be equal across groups. The default "unconstrained" corresponds to factor analysis (and mixtures thereof), whereas "isotropic" corresponds to probabilistic principal components analysers (and mixtures thereof). Constraints may be particularly useful when N < P, though caution is advised when employing constraints for any of the infinite factor models, especially "isotropic" and "single", which may lead to overestimation of the number of clusters &/or factors if this specification is inappropriate. The four options correspond to the following 4 parsimonious Gaussian mixture models:

"unconstrained": UUU - variable-specific and cluster-specific: \(\Psi_g = \Psi_g\).
"isotropic": UUC - cluster-specific, equal across variables: \(\Psi_g = \sigma_g^2 \mathcal{I}_p\).
"constrained": UCU - variable-specific, equal across clusters: \(\Psi_g = \Psi\).
"single": UCC - a single value equal across clusters and variables: \(\Psi_g = \sigma^2 \mathcal{I}_p\).

The first letter U here corresponds to constraints on loadings (not yet implemented), the second letter corresponds to constrained/unconstrained across clusters, and the third letter corresponds to the isotropic constraint. Of course, only the third letter is of relevance for the single-cluster "FA" and "IFA" models, such that "unconstrained" and "constrained" are equivalent for these models, and so too are "isotropic" and "single".

uni.prior

A switch indicating whether uniquenesses rate hyperparameters are to be "unconstrained" or "isotropic", i.e. variable-specific or not. "uni.prior" must be "isotropic" if the last letter of "uni.type" is C, but can take either value otherwise. Defaults to correspond to the last letter of uni.type if that is supplied and uni.prior is not, otherwise defaults to "unconstrained", but "isotropic" is recommended when N < P. Only relevant when "psi.beta" is not supplied and psi_hyper is therefore invoked.

alpha

Depending on the method employed, either the hyperparameter of the Dirichlet prior for the cluster mixing proportions, or the Dirichlet process concentration parameter. Defaults to 0.5/range.G for the Overfitted methods - if supplied for "OMFA" and "OMIFA" methods, you are supplying the numerator of alpha/range.G, which should be less than half the dimension (per group!) of the free parameters of the smallest model considered in order to ensure superfluous clusters are emptied (for "OMFA", this corresponds to the smallest range.Q; for "OMIFA", this corresponds to a zero-factor model) [see: PGMM_dfree and Rousseau and Mengersen (2011)]. Defaults to 1 for the finite mixture models "MFA" and "MIFA". Defaults to 1 - discount for the "IMFA" and "IMIFA" models if learn.alpha=FALSE or a simulation from the prior if learn.alpha=TRUE. Must be positive, unless discount is supplied for the "IMFA" or "IMIFA" methods.

psi.alpha

The shape of the inverse gamma prior on the uniquenesses. Defaults to 2.5 if uni.type is one of "unconstrained" or "constrained", otherwise defaults to 3.5.

psi.beta

The rate of the inverse gamma prior on the uniquenesses. Can be either a single parameter, a vector of variable specific rates, or a matrix of variable and group-specific rates. If this is not supplied, psi_hyper is invoked to choose sensible values, depending on the value of uni.prior and, for the "MFA" and "MIFA" models, the value of psi0g.

mu.zero

The mean of the prior distribution for the mean parameter. Defaults to the sample mean of the data.

sigma.mu

The covariance of the prior distribution for the mean parameter. Can be a scalar times the identity or a matrix of appropriate dimension. Defaults to the sample covariance matrix.

sigma.l

The covariance of the prior distribution for the loadings. Defaults to 1. Only relevant for the finite factor methods.

z.init

The method used to initialise the cluster labels. Defaults to Mclust. Not relevant for the "FA" and ""IFA" methods.

z.list

A user supplied list of cluster labels. Only relevant if z.init == "z.list".

adapt

A logical value indicating whether adaptation of the number of cluster-specific factors is to take place. Only relevant for methods ending in IFA, in which case the default is TRUE. Specifying FALSE and supplying range.Q provides a means to use the MGP prior in a finite factor context.

prop

Proportion of elements within the neighbourhood epsilon of zero necessary to consider a loadings column redundant. Defaults to floor(0.7 * P)/P. Only relevant for methods ending in IFA.

epsilon

Neighbourhood of zero within which a loadings entry is considered negligible according to prop. Defaults to 0.1. Only relevant for methods ending in IFA.

alpha.d1

Shape hyperparameter of the global shrinkage on the first column of the loadings according to the MGP shrinkage prior. Passed to MGP_check to ensure validity. Defaults to 3. Only relevant for methods ending in IFA.

alpha.d2

Shape hyperparameter of the global shrinkage on subsequent columns of the loadings according to the MGP shrinkage prior. Passed to MGP_check to ensure validity. Defaults to 6. Only relevant for methods ending in IFA.

beta.d1

Rate hyperparameter of the global shrinkage on the first column of the loadings according to the MGP shrinkage prior. Passed to MGP_check to ensure validity. Defaults to 1. Only relevant for methods ending in IFA.

beta.d2

Hyperparameter for the gamma prior on the local shrinkage parameters. Defaults to 2. Passed to MGP_check to ensure validity. Only relevant for methods ending in IFA.

nuplus1

Logical switch indicating whether the shape hyperparameter of the prior on the local shrinkage parameters is equal to nu + 1. If FALSE, it is simply equal to nu. Only relevant for methods ending in IFA.

adapt.at

The iteration at which adaptation is to begin. Defaults to burnin for the "IFA" and "MIFA" methods, defaults to 0 for the "OMIFA" and "IMIFA". Cannot exceed burnin. Only relevant for methods ending in IFA.

Intercept parameter for the exponentially decaying adaptation probability s.t. p(iter) = 1/exp(b0 + b1 * (iter - adapt.at)). Defaults to 0.1. Only relevant for methods ending in IFA.

Slope parameter for the exponentially decaying adaptation probability s.t. p(iter) = 1/exp(b0 + b1 * (iter - adapt.at)). Defaults to 0.00005. Only relevant for methods ending in IFA.

trunc.G

The maximum number of allowable and storable groups if the "IMFA" or "IMIFA" method is employed. Defaults to the same value as range.G (unless N < P, see range.G for details) and must be greater than or equal to this value. The number of active groups to be sampled at each iteration is adaptively truncated, with trunc.G as an upper limit for storage reasons. Note that large values of trunc.G may lead to memory capacity issues.

learn.alpha

Logical indicating whether the Dirichlet process / Pitman concentration parameter is to be learned, or remain fixed for the duration of the chain. If being learned, a Ga(a, b) prior is assumed for alpha; updates take place via Gibbs sampling when discount is zero and via Metropolis-Hastings otherwise. Only relevant for the "IMFA" and "IMIFA" methods, in which case the default is TRUE.

alpha.hyper

A vector of length 2 giving hyperparameters for the Dirichlet process / Pitman-Yor concentration parameter alpha. If isTRUE(learn.alpha), these are shape and rate parameter of a Gamma distribution. Defaults to Ga(2, 1). Only relevant for the "IMFA" and "IMIFA" methods, in which case the default is TRUE. The prior is shifted to have support on (-discount, Inf) when non-zero discount is supplied or learn.d=TRUE.

ind.slice

Logical indicitating whether the independent slice-efficient sampler is to be employed. If FALSE the dependent slice-efficient sampler is employed, whereby the slice sequence xi_1,...,xi_g is equal to the decreasingly ordered mixing proportions. Only relevant for the "IMFA" and "IMIFA" methods. Defaults to TRUE.

rho

Parameter controlling the rate of geometric decay for the independent slice-efficient sampler, s.t. xi = (1 - rho)rho^(g-1). Must lie in the interval (0, 1]. Higher values are associated with better mixing but longer run times. Defaults to 0.75, but 0.5 is an interesting special case which guarantees that the slice sequence xi_1,...,xi_g is equal to the expectation of the decreasingly ordered mixing proportions. Only relevant for the "IMFA" and "IMIFA" methods when ind.slice is TRUE.

IM.lab.sw

Logial indicating whether the two forced label switching moves are to be implemented (defaults to TRUE) when running one of the infinite mixture models, with Dirichlet process or Pitman-Yor process priors. Only relevant for the "IMFA" and "IMIFA" methods.

discount

The discount parameter used when generalising the Dirichlet process to the Pitman-Yor process. Must lie in the interval [0, 1). If non-zero, alpha can be supplied greater than -discount. Defaults to 0. Only relevant for the "IMFA" and "IMIFA" methods.

learn.d

Logical indicating whether the discount parameter is to be updated via Metropolis-Hastings. Only relevant for the "IMFA" and "IMIFA" methods, in which case the default is FALSE.

d.hyper

Hyperparameters for the Beta(a,b) prior on the discount hyperparameter. Only relevant for the "IMFA" and "IMIFA" methods.

kappa

The spike-and-slab prior distribution on the discount hyperparameter is assumed to be a mixture with point-mass at zero and a continuous Beta(a,b) distribution. kappa gives the weight of the point mass at zero (the 'spike'). Must lie in the interval [0,1]. Defaults to 0.5. Only relevant for the "IMFA" and "IMIFA" methods when isTRUE(learn.d). A value of 0 ensures non-zero discount values (i.e. Pitman-Yor) at all times, and vice versa.

zeta

Tuning parameter controlling the acceptance rate of the random-walk proposal for the Metropolis-Hastings steps when learn.alpha=TRUE, where 2 * zeta gives the full width of the uniform proposal distribution. These steps are only invoked when either discount is non-zero and fixed or learn.d=TRUE, otherwise alpha is learned by Gibbs updates. Must be strictly positive. Defauts to 2.

tune.zeta

Used for tuning zeta & the width of the uniform proposal for alpha via diminishing Robbins-Monro type adaptation, when that parameter is learned via Metropolis-Hastings. Must be given in the form of a list with the following named elements:

"heat": The initial adaptation intensity/step-size, such that larger values lead to larger updates. Must be strictly greater than zero. Defaults to 1 if not supplied but other elements of tune.zeta are.
"lambda": Iteration rescaling parameter which controls the speed at which adaptation diminishes, such that lower values cause the contribution of later iterations to diminish more slowly. Must lie in the interval (0.5, 1]. Defaults to 1 if not supplied but other elements of tune.zeta are.
"target": The target acceptance rate. Must lie in the interval [0, 1]. Defaults to 0.441, which is optimum for univariate targets, if not supplied but other elements of tune.zeta are.

tune.zeta is only relevant when isTRUE(learn.alpha) under the "IMFA" or "IMIFA" models, and either the discount remains fixed at a non-zero value, or when isTRUE(learn.d) and kappa < 1. Since Gibbs steps are invoked for updated alpha when discount == 0, adaption occurs according to a running count of the number of iterations with non-zero sampled discount values. If diminishing adaptation invoked, the posterior mean zeta will be stored. Since caution is advised when employing adaptation, note that acceptance rates of between 10-50% are generally considered adequate.

mu0g

Logical indicating whether the mu.zero hyperparameter can be cluster-specific. Defaults to FALSE. Only relevant for the "MFA" and "MIFA" methods when z.list is supplied.

psi0g

Logical indicating whether the psi.beta hyperparameter(s) can be cluster-specific. Defaults to FALSE. Only relevant for the "MFA" and "MIFA" methods when z.list is supplied, and only allowable when uni.type is one of unconstrained or isotropic.

delta0g

Logical indicating whether the alpha.d1 and alpha.d2 hyperparameters can be cluster-specific. Defaults to FALSE. Only relevant for the "MIFA" method when z.list is supplied.

mu.switch

Logical indicating whether the means are to be stored (defaults to TRUE). May be useful not to store if memory is an issue. Warning: posterior inference won't be posssible.

score.switch

Logical indicating whether the factor scores are to be stored. As the array containing each sampled scores matrix tends to be amongst the largest objects to be stored, this defaults to FALSE when length(range.G) * length(range.Q) > 10, otherwise the default is TRUE. May be useful not to store if memory is an issue - for the "MIFA", "OMIFA", and "IMIFA" methods, setting this switch to FALSE also offers a slight speed-up. Warning: posterior inference won't be posssible.

load.switch

Logical indicating whether the factor loadings are to be stored (defaults to TRUE). May be useful not to store if memory is an issue. Warning: posterior inference won't be posssible.

psi.switch

Logical indicating whether the uniquenesses are to be stored (defaults to TRUE). May be useful not to store if memory is an issue. Warning: posterior inference won't be posssible.

pi.switch

Logical indicating whether the mixing proportions are to be stored (defaults to TRUE). May be useful not to store if memory is an issue. Warning: posterior inference won't be posssible.

verbose

Logical indicating whether to print output (e.g. run times) and a progress bar to the screen while the sampler runs. By default is TRUE if the session is interactive, and FALSE otherwise. If FALSE, warnings and error messages will still be printed to the screen, but everything else will be suppressed.

Value

A list of lists of lists of class "IMIFA" to be passed to get_IMIFA_results. If the returned object is x, candidate models accesible via subsetting, where x is of the form x[[1:length(range.G)]][[1:length(range.Q)]]. However, these objects of class "IMIFA" should rarely if ever be manipulated by hand - use of the get_IMIFA_results function is strongly advised. Dedicated print and summary functions exist for objects of class "IMIFA".

References

Murphy, K., Gormley, I. C. and Viroli, C. (2017) Infinite Mixtures of Infinite Factor Analysers: Nonparametric Model-Based Clustering via Latent Gaussian Models, arXiv:1701.07010.

Bhattacharya, A. and Dunson, D. B. (2011) Sparse Bayesian infinite factor models, Biometrika, 98(2): 291-306.

Kalli, M., Griffin, J. E. and Walker, S. G. (2011) Slice sampling mixture models, Statistics and Computing, 21(1): 93-105.

McNicholas, P. D. and Murphy, T. B. (2008) Parsimonious Gaussian Mixture Models, Statistics and Computing, 18(3): 285-296.

Rousseau, J. and Mengersen, K. (2011) Asymptotic Behaviour of the posterior distribution in overfitted mixture models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5): 689-710.

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3): 611-622.

Examples

Run this code

# NOT RUN {
# data(olive)
# data(coffee)

# Fit an IMIFA model to the olive data. Accept all defaults.
# simIMIFA <- mcmc_IMIFA(olive, method="IMIFA")
# summary(simIMIFA)

# Fit an IMIFA model assuming a Pitman-Yor prior.
# Allow the alpha and discount parameter to be learned.
# Control the balance between the DP and PY priors using the kappa parameter.
# simPY    <- mcmc_IMIFA(olive, method="IMIFA", learn.d=TRUE, kappa=0.5)
# summary(simPY)

# Fit a MFA model to the scaled olive data, with isotropic uniquenesses (i.e. MPPCA).
# Allow diagonal covariance as a special case where range.Q = 0. Accept all other defaults.
# simMFA   <- mcmc_IMIFA(olive, method="MFA", n.iters=10000, range.G=3:6,
#                        range.Q=0:3, centering=FALSE, uni.type="isotropic")

# Fit a MIFA model to the centered & scaled coffee data, w/ cluster labels initialised by K-Means.
# Note that range.Q doesn't need to be specified. Allow IFA as a special case where range.G=1.
# simMIFA  <- mcmc_IMIFA(coffee, method="MIFA", n.iters=10000, range.G=1:3, z.init="kmeans")

# Fit an IFA model to the centered and pareto scaled olive data.
# Note that range.G doesn't need to be specified. We can optionally supply a range.Q starting value.
# We can also enforce additional shrinkage using alpha.d1, alpha.d2, prop, and epsilon.
# simIFA   <- mcmc_IMIFA(olive, method="IFA", n.iters=10000, range.Q=4, scaling="pareto",
#                        alpha.d1=3.5, alpha.d2=7, prop=0.6, epsilon=0.12)

# Fit an OMIFA model to the centered & scaled coffee data.
# Supply a sufficiently small alpha value. Try varying other hyperparameters.
# Accept the default value for the starting number of factors,
# but supply a value for the starting number of clusters.
# Try contraining uniquenesses to be common across both variables and clusters.
# simOMIFA <- mcmc_IMIFA(coffee, method="OMIFA", range.G=10, psi.alpha=3,
#                        nu=3, alpha=0.8, uni.type="single")
# }

Run the code above in your browser using DataLab