MoE_control: Set control values for use with MoEClust

Description

Supplies a list of arguments (with defaults) for use with MoE_clust.

Usage

MoE_control(init.z = c("hc", "quantile", "kmeans", "mclust", "random", "list"),
            noise.args = list(...),
            asMclust = FALSE,
            equalPro = FALSE,
            exp.init = list(...),
            algo = c("EM", "CEM", "cemEM"),
            criterion = c("bic", "icl", "aic"),
            stopping = c("aitken", "relative"),
            z.list = NULL, 
            nstarts = 1L,
            eps = .Machine$double.eps,
            tol = c(1e-05, sqrt(.Machine$double.eps), 1e-08),
            itmax = c(.Machine$integer.max, .Machine$integer.max, 1000L),
            hc.args = list(...),
            km.args = list(...),
            posidens = TRUE,
            init.crit = c("bic", "icl"),
            warn.it = 0L,
            MaxNWts = 1000L,
            verbose = interactive(),
            ...)

Arguments

init.z

The method used to initialise the cluster labels. Defaults to "hc", i.e. model-based agglomerative hierarchical clustering tree as per hc, for multivariate data (see hc.args), or "quantile"-based clustering as per quant_clust for univariate data (unless there are expert network covariates incorporated via exp.init$joint &/or exp.init$clustMD, in which case the default is again "hc"). The "quantile" option is thus only available for univariate data when expert network covariates are not incorporated via exp.init$joint &/or exp.init$clustMD, or when expert network covariates are not supplied.

Other options include "kmeans" (see km.args), "random" initialisation (see nstarts below), a user-supplied "list", and a full run of Mclust (itself initialised via a model-based agglomerative hierarchical clustering tree, again see hc.args), although this last option "mclust" will be coerced to "hc" if there are no gating &/or expert covariates within MoE_clust (in order to better reproduce Mclust output).

When init.z="list", exp.init$clustMD is forced to FALSE; otherwise, when isTRUE(exp.init$clustMD) and the clustMD library is loaded, the init.z argument instead governs the method by which a call to clustMD is initialised. In this instance, "quantile" will instead default to "hc", and the arguments to hc.args and km.args will be ignored (unless all clustMD model types fail for a given number of components).

When init.z="mclust" or clustMD is successfully invoked (via exp.init$clustMD), the argument init.crit (see below) specifies the model-selection criterion ("bic" or "icl") by which the optimal Mclust or clustMD model type to initialise with is determined, and criterion remains unaffected.

Finally, when the model includes expert network covariates and isTRUE(exp.init$mahalanobis), the argument exp.init$estart (see below) can be used to modify the behaviour of init.z="random" when nstarts > 1, toggling between a full run of the EM algorithm for each random initialisation (i.e. exp.init$estart=FALSE, the default), or a single run of the EM algorithm starting from the best initial partition obtained among the random starts according to the iterative reallocation initialisation routine (i.e. exp.init$estart=TRUE).

noise.args

A list supplying select named parameters to control inclusion of a noise component in the estimation of the mixture. If either or both of the arguments tau0 &/or noise.init are supplied, a noise component is added to the the model in the estimation.

tau0

Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with tau0 giving the prior probability of belonging to the noise component for all observations. Typically supplied as a scalar in the interval (0, 1), e.g. 0.1. Can be supplied as a vector when gating covariates are present and noise.args$noise.gate is TRUE. This argument can be supplied instead of or in conjunction with the argument noise.init below.

noise.init

A logical or numeric vector indicating an initial guess as to which observations are noise in the data. If numeric, the entries should correspond to row indices of the data. If supplied, a noise component will be added to the model in the estimation. This argument can be used in conjunction with tau0 above, or can be replaced by that argument also.

noise.gate

A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to TRUE, but leads to greater parsimony if FALSE. Only relevant in the presence of a noise component; only effects estimation in the presence of gating covariates.

noise.meth

The method used to estimate the volume when a noise component is invoked. Defaults to hypvol. For univariate data, this argument is ignored and the range of the data is used instead (unless noise.vol below is specified). The options "convexhull" and "ellipsoidhull" require loading the geometry and cluster libraries, respectively. This argument is only relevant if noise.vol below is not supplied.

noise.vol

This argument can be used to override the argument noise.meth by specifying the (hyper)volume directly, i.e. specifying an improper uniform density. This will override the use of the range of the response data for univariate data if supplied. Note that the (hyper)volume, rather than its inverse, is supplied here. This can affect prediction and the location of the MVN ellipses for MoE_gpairs plots (see noise_vol).

equalNoise

Logical which is only invoked when isTRUE(equalPro) and gating covariates are not supplied. Under the default setting (FALSE), the mixing proportion for the noise component is estimated, and remaining mixing proportions are equal; when TRUE all components, including the noise component, have equal mixing proportions.

discard.noise

A logical governing how the means are summarised in parameters$mean and by extension the location of the MVN ellipses in MoE_gpairs plots for models with both expert network covariates and a noise component (otherwise this argument is irrelevant).

The means for models with expert network covariates are summarised by the posterior mean of the fitted values. By default (FALSE), the mean of the noise component is accounted for in the posterior mean. Otherwise, or when the mean of the noise component is unavailable (due to having been manually supplied via noise.args$noise.vol), the z matrix is renormalised after discarding the column corresponding to the noise component prior to computation of the posterior mean. The renormalisation approach can be forced by specifying noise.args$discard.noise=TRUE, even when the mean of the noise component is available. For models with a noise component fitted with algo="CEM", a small extra E-step is conducted for observations assigned to the non-noise components in this case.

In particular, the argument noise.meth will be ignored for high-dimensional n <= d data, in which case the argument noise.vol must be specified. Note that this forces noise.args$discard.noise to TRUE. See noise_vol for more details.

The arguments tau0 and noise.init can be used separately, to provide alternative means to invoke a noise component. However, they can also be supplied together, in which case observations corresponding to noise.init have probability tau0 (rather than 1) of belonging to the noise component.

asMclust

The default values of stopping and hc.args$hcUse (see below) are such that results for models with no covariates in either network are liable to differ from results for equivalent models obtained via Mclust. MoEClust uses stopping="aitken" and hcUse="VARS" by default, while mclust always implicitly uses stopping="relative" and defaults to hcUse="SVD".

asMclust is a logical variable (FALSE, by default) which functions as a simple convenience tool for overriding these two arguments (even if explicitly supplied!) such that they behave like the function Mclust. Other user-specified arguments which differ from mclust are not affected by asMclust, as their defaults already correspond to mclust. Results may still differ slightly as MoEClust calculates log-likelihood values with greater precision. Finally, note that asMclust=TRUE is invoked even for models with covariates which are not accommodated by mclust.

equalPro

Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: equalPro = FALSE. Only relevant when gating covariates are not supplied within MoE_clust, otherwise ignored. In the presence of a noise component (see noise.args), only the mixing proportions for the non-noise components are constrained to be equal (by default, see equalNoise), after accounting for the noise component.

exp.init

A list supplying select named parameters to control the initialisation routine in the presence of expert network covariates (otherwise ignored):

joint

A logical indicating whether the initial partition is obtained on the joint distribution of the response and expert network covariates (defaults to TRUE) or just the response variables (FALSE). By default, only continuous expert network covariates are considered (see exp.init$clustMD below). Only relevant when init.z is not "random" (unless isTRUE(exp.init$clustMD), in which case init.z specifies the initialisation routine for a call to clustMD). This will render the "quantile" option to init.z for univariate data unusable if continuous expert network covariates are supplied &/or categorical/ordinal expert network covariates are supplied when isTRUE(exp.init$clustMD) and the clustMD library is loaded.

mahalanobis

A logical indicating whether to iteratively reallocate observations during the initialisation phase to the component corresponding to the expert network regression to which it's closest to the fitted values of in terms of Mahalanobis distance (defaults to TRUE). This will ensure that each component can be well modelled by a single expert prior to running the EM/CEM algorithm.

estart

A logical governing the behaviour of init.z="random" when nstarts > 1 in the presence of expert network covariates. Only relevant when isTRUE(exp.init$mahalanobis). Defaults to FALSE; i.e. all random starts are put through full runs of the EM algorithm. When TRUE, all random starts are put through the initial iterative reallocation routine prior to a full run of EM for only the single best random initial partition obtained. See the last set of Examples below.

clustMD

A logical indicating whether categorical/ordinal covariates should be incorporated when using the joint distribution of the response and expert network covariates for initialisation (defaults to FALSE). Only relevant when isTRUE(exp.init$joint). Requires the use of the clustMD library. Note that initialising in this manner involves fitting all clustMD model types in parallel for all numbers of components considered, and may fail (especially) in the presence of nominal expert network covariates.

Unless init.z="list", supplying this argument as TRUE when the clustMD library is loaded has the effect of superseding the init.z argument: this argument now governs instead how the call to clustMD is initialised (unless all clustMD model types fail for a given number of components, in which case init.z is invoked instead to initialise for G values for which all clustMD model types failed). Similarly, the arguments hc.args and km.args will be ignored (again, unless all clustMD model types fail for a given number of components).

max.init

The maximum number of iterations for the Mahalanobis distance-based reallocation procedure when exp.init$mahalanobis is TRUE. Defaults to .Machine$integer.max.

identity

A logical indicating whether the identity matrix (corresponding to the use of the Euclidean distance) is used in place of the covariance matrix of the residuals (corresponding to the use of the Mahalanobis distance). Defaults to FALSE; only relevant for multivariate response data.

drop.break

When isTRUE(exp.init$mahalanobis) observations will be completely in or out of a component during the initialisation phase. As such, it may occur that constant columns will be present when building a given component's expert regression (particularly for categorical covariates). It may also occur, due to this partitioning, that "unseen" data, when calculating the residuals, will have new factor levels. When isTRUE(exp.init$drop.break), the Mahalanobis distance based initialisation phase will explicitly fail in either of these scenarios.

Otherwise, drop_constants and drop_levels will be invoked when exp.init$drop.break is FALSE (the default) to try to remedy the situation. In any case, only a warning that the initialisation step failed will be printed, regardless of the value of exp.init$drop.break.

algo

Switch controlling whether models are fit using the "EM" (the default) or "CEM" algorithm. The option "cemEM" allows running the EM algorithm starting from convergence of the CEM algorithm.

criterion

When either G or modelNames is a vector, criterion determines whether the "bic" (Bayesian Information Criterion), "icl" (Integrated Complete Likelihood), "aic" (Akaike Information Criterion) is used to determine the 'best' model when gathering output. Note that all criteria will be returned in any case.

stopping

The criterion used to assess convergence of the EM/CEM algorithm. The default ("aitken") uses Aitken's acceleration method via aitken, otherwise the "relative" change in log-likelihood is monitored (which may be less strict). The "relative" option corresponds to the stopping criterion used by Mclust: see asMclust above.

Both stopping rules are ultimately governed by tol[1]. When the "aitken" method is employed, the asymptotic estimate of the final converged maximised log-likelihood is also returned as linf for models with 2 or more components, though the largest element of the returned vector loglik still gives the log-likelihood value achieved by the parameters returned at convergence, under both stopping methods (see MoE_clust).

z.list

A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if init.z == "z.list". These matrices are allowed correspond to both soft or hard clusterings, and will be internally normalised so that the rows sum to 1.

nstarts

The number of random initialisations to use when init.z="random". Defaults to 1. When there are no expert covariates (or when exp.init$mahalanobis=FALSE or exp.init$estart=FALSE), the results will be based on the random start yielding the highest estimated log-likelihood after each initial partition is subjected to a full run of the EM algorithm. Note, in this case, that all nstarts random initialisations are affected by exp.init$mahalanobis, if invoked in the presence of expert network covariates, which may remove some of the randomness.

Conversely, if exp.init$mahalanobis=TRUE and exp.init$estart=TRUE, all nstarts random starts are put through the initial iterative reallocation routine and only the single best initial partition uncovered is put through the full run of the EM algorithm. See init.z and exp.init$estart above for more details, though note that exp.init$mahalanobis=TRUE and exp.init$estart=FALSE, by default.

eps

A scalar tolerance associated with deciding when to terminate computations due to computational singularity in covariances. Smaller values of eps allow computations to proceed nearer to singularity. The default is the relative machine precision .Machine$double.eps, which is approximately 2e-16 on IEEE-compliant machines.

tol

A vector of length three giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, 2) parameter convergence in the inner loop for models with iterative M-step ("VEI", "VEE", "EVE", "VVE", "VEV"), and 3) optimisation in the multinomial logistic regression in the gating network, respectively. The default is c(1e-05, sqrt(.Machine$double.eps), 1e-08). If only one number is supplied, it is used as the tolerance for all three cases given.

itmax

A vector of length three giving integer limits on the number of iterations for 1) the EM/CEM algorithm, 2) the inner loop for models with iterative M-step ("VEI", "VEE", "EVE", "VVE", "VEV"), and 3) the multinomial logistic regression in the gating network, respectively.

The default is c(.Machine$integer.max, .Machine$integer.max, 1000L), allowing termination to be completely governed by tol[1] & tol[2] for the inner and outer loops of the EM/CEM algorithm. If only one number is supplied, it is used as the iteration limit for the outer loop only and the other elements of itmax retain their usual defaults.

If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in itmax[3] iterations at any stage of the EM/CEM algorithm, an appropriate warning will be printed, prompting the user to modify this argument.

hc.args

A list supplying select named parameters to control the initialisation of the cluster allocations when init.z="hc" (or when init.z="mclust", which itself relies on hc), unless isTRUE(exp.init$clustMD), the clustMD library is loaded, and none of the clustMD model types fail (otherwise irrelevant):

hcUse: A string specifying the type of input variables to be used. This defaults to "VARS" here, unlike mclust which defaults to "SVD". Other allowable values are documented in mclust.options. See asMclust above.
hc.meth: A character string indicating the model to be used when hierarchical clustering (see hc) is employed for initialisation (either when init.z="hc" or init.z="mclust"). Defaults to "EII" for high-dimensional data, or "VVV" otherwise.

km.args

A list supplying select named parameters to control the initialisation of the cluster allocations when init.z="kmeans", unless isTRUE(exp.init$clustMD), the clustMD library is loaded, and none of the clustMD model types fail (otherwise irrelevant):

kstarts: The number of random initialisations to use. Defaults to 10.
kiters: The maximum number of K-Means iterations allowed. Defaults to 10.

posidens

A logical governing whether to continue running the algorithm even in the presence of positive log-densities. Defaults to TRUE, but setting posidens=FALSE can help to safeguard against spurious solutions, which will be instantly terminated if positive log-densities are encountered. Note that versions of this package prior to and including version 1.3.1 always implicitly assumed posidens=FALSE.

init.crit

The criterion to be used to determine the optimal model type to initialise with, when init.z="mclust" or when isTRUE(exp.init$clustMD) and the clustMD library is loaded (one of "bic" or "icl"). Defaults to "icl" when criterion="icl", otherwise defaults to "bic". The criterion argument remains unaffected.

warn.it

A single number giving the iteration count at which a warning will be printed if the EM/CEM algorithm has failed to converge. Defaults to 0, i.e. no warning (which is true for any warn.it value less than 3), otherwise the message is printed regardless of the value of verbose. If non-zero, warn.it should be moderately large, but obviously less than itmax[1]. A warning will always be printed if one of more models fail to converge in itmax[1] iterations.

MaxNWts

The maximum allowable number of weights in the call to multinom for the multinomial logistic regression in the gating network. There is no intrinsic limit in the code, but increasing MaxNWts will probably allow fits that are very slow and time-consuming. It may be necessary to increase MaxNWts when categorical concomitant variables with many levels are included or the number of components is high.

verbose

Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is TRUE if the session is interactive, and FALSE otherwise. If FALSE, warnings and error messages will still be printed to the screen, but everything else will be suppressed.

...

Catches unused arguments.

Value

A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.

Details

MoE_control is provided for assigning values and defaults within MoE_clust and MoE_stepwise.

While the criterion argument controls the choice of the optimal number of components and GPCM/mclust model type, MoE_compare is provided for choosing between fits with different combinations of covariates or different initialisation settings.

Examples

Run this code

# NOT RUN {
ctrl1 <- MoE_control(criterion="icl", itmax=100, warn.it=15, init.z="random", nstarts=5)

data(CO2data)
GNP   <- CO2data$GNP
# }
# NOT RUN {
res   <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, control=ctrl1)

# Alternatively, specify control arguments directly
res2  <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, stopping="relative")
# }
# NOT RUN {
# Supplying ctrl1 without naming it as 'control' can throw an error
# }
# NOT RUN {
res3  <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, ctrl1)
# }
# NOT RUN {
# Similarly, supplying control arguments via a mix of the ... construct
# and the named argument 'control' also throws an error
# }
# NOT RUN {
res4  <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, control=ctrl1, init.z="kmeans")
# }
# NOT RUN {
# }
# NOT RUN {
# Initialise via the mixed-type joint distribution of response & covariates
# Let the ICL criterion determine the optimal clustMD model type
# Constrain the mixing proportions to be equal
ctrl2 <- MoE_control(exp.init=list(clustMD=TRUE), init.crit="icl", equalPro=TRUE)
data(ais)
library(clustMD)
res4  <- MoE_clust(ais[,3:7], G=2, modelNames="EVE", expert= ~ sex,
                   network.data=ais, control=ctrl2)

# Include a noise component by specifying its prior mixing proportion
res5  <- MoE_clust(ais[,3:7], G=2, modelNames="EVE", expert= ~ sex,
                   network.data=ais, tau0=0.1)
                   
# Investigate the use of random starts
sex  <- ais$sex
# resA uses deterministic starting values (by default) for each G value
 system.time(resA <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE))
# resB passes each random start through the entire EM algorithm for each G value
 system.time(resB <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE,
                              init.z="random", nstarts=10))
# resC passes only the "best" random start through the EM algorithm for each G value
 system.time(resC <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE,
                               init.z="random", nstarts=10, estart=TRUE))
# Here, all three settings (listed here in order of speed) converge to the same model
 MoE_compare(resA, resC, resB)
# }

Run the code above in your browser using DataLab