get_IMIFA_results: Extract results, conduct posterior inference and compute performance metrics for MCMC samples of models from the IMIFA family

Description

This function post-processes simulations generated by mcmc_IMIFA for any of the IMIFA family of models. It can be re-ran at little computational cost in order to extract different models explored by the sampler used for sims, without having to re-run the model itself. New results objects using different numbers of clusters and different numbers of factors (if visited by the model in question), or using different model selection criteria (if necessary) can be generated with ease. The function also performs post-hoc corrections for label switching, as well as post-hoc Procrustes rotation of loadings matrices and scores, to ensure sensible posterior parameter estimates, and constructs credible intervals.

Usage

get_IMIFA_results(sims = NULL, burnin = 0L, thinning = 1L, G = NULL,
  Q = NULL, criterion = c("bicm", "aicm", "log.iLLH", "dic", "bic.mcmc",
  "aic.mcmc"), G.meth = c("mode", "median"), Q.meth = c("mode", "median"),
  dat = NULL, conf.level = 0.95, z.avgsim = FALSE, zlabels = NULL)

Arguments

sims

An object of class "IMIFA" generated by mcmc_IMIFA.

burnin

Optional additional number of iterations to discard. Defaults to 0, corresponding to no burnin.

thinning

Optional interval for extra thinning to be applied. Defaults to 1, corresponding to no thinning.

If this argument is not specified, results will be returned with the optimal number of clusters. If different numbers of clusters were explored in sims for the "MFA" or "MIFA" methods, supplying an integer value allows pulling out a specific solution with G clusters, even if the solution is sub-optimal. Similarly, this allows retrieval of samples corresponding to a solution, if visited, with G clusters for the "OMFA", "OMIFA", "IMFA" and "IMIFA" methods.

If this argument is non specified, results will be returned with the optimal number of factors. If different numbers of factors were explored in sims for the "FA", "MFA", "OMFA" or "IMFA" methods, this allows pulling out a specific solution with Q factors, even if the solution is sub-optimal. Similarly, this allows retrieval of samples corresponding to a solution, if visited, with Q factors for the "IFA", "MIFA", "OMIFA" and "IMIFA" methods.

criterion

The criterion to use for model selection, where model selection is only required if more than one model was run under the "FA", "MFA", "MIFA", "OMFA" or "IMFA" methods when sims was created via mcmc_IMIFA. Note that these are all calculated, this argument merely indicates which one will form the basis of the construction of the output. Note that the first three options here might exhibit bias in favour of zero-factor models for the finite factor "FA", "MFA", "OMFA" and "IMFA" methods and might exhibit bias in favour of one-cluster models for the "MFA" and "MIFA" methods.

G.meth

If the object in sims arises from the "OMFA", "OMIFA", "IMFA" or "IMIFA" methods, this argument determines whether the optimal number of clusters is given by the mode or median of the posterior distribution of G. Defaults to "Mode".

Q.meth

If the object in sims arises from the "IFA", "MIFA", "OMIFA" or "IMIFA" methods, this argument determines whether the optimal number of latent factors is given by the mode or median of the posterior distribution of Q. Defaults to "Mode".

dat

The actual data set on which mcmc_IMIFA was originally run. This is necessary for computing error metrics between the estimated and empirical covariance matrix/matrices. If this is not supplied, the function will attempt to find the data set if it is still available in the global environment.

conf.level

The confidence level to be used throughout for credible intervals for all parameters of inferential interest. Defaults to 0.95.

z.avgsim

Logical indicating whether the clustering should also be summarised with a call to Zsimilarity by the clustering with minimum squared distance to the similarity matrix obtained by averaging the stored adjacency matrices, in addition to the MAP estimate. Note that the MAP clustering is computed conditional on the estimate of the number of clusters (whether that be the modal estimate or the estimate according to criterion) and other parameters are extracted conditional on this estimate of G: however, in constrast, the number of distinct clusters in the summarised labels obtained by z.avgsim=TRUE may not necessarily coincide with the estimate of G, but may provide a useful alternative summary of the partitions explored during the chain. Please be warned that this can take considerable time to compute, and may not even be possible if the number of observations &/or number of stored iterations is large and the resulting matrix isn't sufficiently sparse, so the default is FALSE, otherwise both the summarised clustering and the similarity matrix are stored: the latter can be passed to plot.Results_IMIFA.

zlabels

For any method that performs clustering, the true labels can be supplied if they are known in order to compute clustering performance metrics. This also has the effect of ordering the MAP labels (and thus the ordering of cluster-specific parameters) to most closely correspond to the true labels if supplied.

Value

An object of class "Results_IMIFA" to be passed to plot.Results_IMIFA for visualising results. Dedicated print and summary functions exist for objects of this class. The object, say x, is a list of lists, the most important components of which are:

Clust: Everything pertaining to clustering performance can be found here for all but the "FA" and "IFA" methods, in particular x$Clust$map, the MAP summary of the posterior clustering. More detail is given if known zlabels are supplied: performance is always evaluated against the MAP clustering, with additional evaluation against the alternative clustering computed if z.avgsim=TRUE.
Error: Error metrics (e.g. MSE) between the empirical and estimated covariance matrix/matrices.
GQ.results: Everything pertaining to model choice can be found here, incl. posterior summaries for the estimated number of clusters and estimated number of factors, if applicable to the method employed. Information criterion values are also accessible here.
Means: Posterior summaries for the means.
Loadings: Posterior summaries for the factor loadings matrix/matrices. Posterior mean loadings given by x$Loadings$post.load are given the loadings class for printing purposes and thus the manner in which they are displayed can be modified.
Scores: Posterior summaries for the latent factor scores.
Uniquenesses: Posterior summaries for the uniquenesses.

References

Murphy, K., Gormley, I. C. and Viroli, C. (2017) Infinite Mixtures of Infinite Factor Analysers: Nonparametric Model-Based Clustering via Latent Gaussian Models, arXiv:1701.07010.

Examples

Run this code

# NOT RUN {
# data(coffee)
# data(olive)

# Run a MFA model on the coffee data over a range of clusters and factors.
# simMFAcoffee  <- mcmc_IMIFA(coffee, method="MFA", range.G=2:3, range.Q=0:3, n.iters=1000)

# Accept all defaults to extract the optimal model.
# resMFAcoffee  <- get_IMIFA_results(simMFAcoffee)


# Instead let's get results for a 3-cluster model, allowing Q be chosen by aic.mcmc.
# resMFAcoffee2 <- get_IMIFA_results(simMFAcoffee, G=3, criterion="aic.mcmc")

# Run an IMIFA model on the olive data, accepting all defaults.
# simIMIFAolive <- mcmc_IMIFA(olive, method="IMIFA", n.iters=10000)

# Extract optimum results
# Estimate G & Q by the median of their posterior distributions
# Construct 90% credible intervals and try to return the similarity matrix.
# resIMIFAolive <- get_IMIFA_results(simIMIFAolive, G.meth="median", Q.meth="median",
#                                    conf.level=0.9, z.avgsim=TRUE)
# summary(resIMIFAolive)
# }

Run the code above in your browser using DataLab