Supplies a list of arguments (with defaults) for use with MoE_clust
.
MoE_control(init.z = c("hc", "quantile", "kmeans", "mclust", "random", "list"),
noise.args = list(...),
asMclust = FALSE,
equalPro = FALSE,
exp.init = list(...),
algo = c("EM", "CEM", "cemEM"),
criterion = c("bic", "icl", "aic"),
stopping = c("aitken", "relative"),
z.list = NULL,
nstarts = 1L,
eps = .Machine$double.eps,
tol = c(1e-05, sqrt(.Machine$double.eps), 1e-08),
itmax = c(.Machine$integer.max, .Machine$integer.max, 1000L),
hc.args = list(...),
km.args = list(...),
posidens = TRUE,
init.crit = c("bic", "icl"),
warn.it = 0L,
MaxNWts = 1000L,
verbose = interactive(),
...)
The method used to initialise the cluster labels. Defaults to "hc"
, i.e. model-based agglomerative hierarchical clustering tree as per hc
, for multivariate data (see hc.args
), or "quantile"
-based clustering as per quant_clust
for univariate data (unless there are expert network covariates incorporated via exp.init$joint
&/or exp.init$clustMD
, in which case the default is again "hc"
). The "quantile"
option is thus only available for univariate data when expert network covariates are not incorporated via exp.init$joint
&/or exp.init$clustMD
, or when expert network covariates are not supplied.
Other options include "kmeans"
(see km.args
), "random"
initialisation (see nstarts
below), a user-supplied "list"
, and a full run of Mclust
(itself initialised via a model-based agglomerative hierarchical clustering tree, again see hc.args
), although this last option "mclust"
will be coerced to "hc"
if there are no gating
&/or expert
covariates within MoE_clust
(in order to better reproduce Mclust
output).
When init.z="list"
, exp.init$clustMD
is forced to FALSE
; otherwise, when isTRUE(exp.init$clustMD)
and the clustMD
library is loaded, the init.z
argument instead governs the method by which a call to clustMD
is initialised. In this instance, "quantile"
will instead default to "hc"
, and the arguments to hc.args
and km.args
will be ignored (unless all clustMD
model types fail for a given number of components).
When init.z="mclust"
or clustMD
is successfully invoked (via exp.init$clustMD
), the argument init.crit
(see below) specifies the model-selection criterion ("bic"
or "icl"
) by which the optimal Mclust
or clustMD
model type to initialise with is determined, and criterion
remains unaffected.
Finally, when the model includes expert network covariates and isTRUE(exp.init$mahalanobis)
, the argument exp.init$estart
(see below) can be used to modify the behaviour of init.z="random"
when nstarts > 1
, toggling between a full run of the EM algorithm for each random initialisation (i.e. exp.init$estart=FALSE
, the default), or a single run of the EM algorithm starting from the best initial partition obtained among the random starts according to the iterative reallocation initialisation routine (i.e. exp.init$estart=TRUE
).
A list supplying select named parameters to control inclusion of a noise component in the estimation of the mixture. If either or both of the arguments tau0
&/or noise.init
are supplied, a noise component is added to the the model in the estimation.
tau0
Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with tau0
giving the prior probability of belonging to the noise component for all observations. Typically supplied as a scalar in the interval (0, 1), e.g. 0.1
. Can be supplied as a vector when gating covariates are present and noise.args$noise.gate
is TRUE
. This argument can be supplied instead of or in conjunction with the argument noise.init
below.
noise.init
A logical or numeric vector indicating an initial guess as to which observations are noise in the data. If numeric, the entries should correspond to row indices of the data. If supplied, a noise component will be added to the model in the estimation. This argument can be used in conjunction with tau0
above, or can be replaced by that argument also.
noise.gate
A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to TRUE
, but leads to greater parsimony if FALSE
. Only relevant in the presence of a noise component; only effects estimation in the presence of gating covariates.
noise.meth
The method used to estimate the volume when a noise component is invoked. Defaults to hypvol
. For univariate data, this argument is ignored and the range of the data is used instead (unless noise.vol
below is specified). The options "convexhull"
and "ellipsoidhull"
require loading the geometry
and cluster
libraries, respectively. This argument is only relevant if noise.vol
below is not supplied.
noise.vol
This argument can be used to override the argument noise.meth
by specifying the (hyper)volume directly, i.e. specifying an improper uniform density. This will override the use of the range of the response data for univariate data if supplied. Note that the (hyper)volume, rather than its inverse, is supplied here. This can affect prediction and the location of the MVN ellipses for MoE_gpairs
plots (see noise_vol
).
equalNoise
Logical which is only invoked when isTRUE(equalPro)
and gating covariates are not supplied. Under the default setting (FALSE
), the mixing proportion for the noise component is estimated, and remaining mixing proportions are equal; when TRUE
all components, including the noise component, have equal mixing proportions.
discard.noise
A logical governing how the means are summarised in parameters$mean
and by extension the location of the MVN ellipses in MoE_gpairs
plots for models with both expert network covariates and a noise component (otherwise this argument is irrelevant).
The means for models with expert network covariates are summarised by the posterior mean of the fitted values. By default (FALSE
), the mean of the noise component is accounted for in the posterior mean. Otherwise, or when the mean of the noise component is unavailable (due to having been manually supplied via noise.args$noise.vol
), the z
matrix is renormalised after discarding the column corresponding to the noise component prior to computation of the posterior mean. The renormalisation approach can be forced by specifying noise.args$discard.noise=TRUE
, even when the mean of the noise component is available. For models with a noise component fitted with algo="CEM"
, a small extra E-step is conducted for observations assigned to the non-noise components in this case.
In particular, the argument noise.meth
will be ignored for high-dimensional n <= d
data, in which case the argument noise.vol
must be specified. Note that this forces noise.args$discard.noise
to TRUE
. See noise_vol
for more details.
The arguments tau0
and noise.init
can be used separately, to provide alternative means to invoke a noise component. However, they can also be supplied together, in which case observations corresponding to noise.init
have probability tau0
(rather than 1) of belonging to the noise component.
The default values of stopping
and hc.args$hcUse
(see below) are such that results for models with no covariates in either network are liable to differ from results for equivalent models obtained via Mclust
. MoEClust uses stopping="aitken"
and hcUse="VARS"
by default, while mclust always implicitly uses stopping="relative"
and defaults to hcUse="SVD"
.
asMclust
is a logical variable (FALSE
, by default) which functions as a simple convenience tool for overriding these two arguments (even if explicitly supplied!) such that they behave like the function Mclust
. Other user-specified arguments which differ from mclust are not affected by asMclust
, as their defaults already correspond to mclust. Results may still differ slightly as MoEClust calculates log-likelihood values with greater precision. Finally, note that asMclust=TRUE
is invoked even for models with covariates which are not accommodated by mclust.
Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: equalPro = FALSE
. Only relevant when gating
covariates are not supplied within MoE_clust
, otherwise ignored. In the presence of a noise component (see noise.args
), only the mixing proportions for the non-noise components are constrained to be equal (by default, see equalNoise
), after accounting for the noise component.
A list supplying select named parameters to control the initialisation routine in the presence of expert network covariates (otherwise ignored):
joint
A logical indicating whether the initial partition is obtained on the joint distribution of the response and expert network covariates (defaults to TRUE
) or just the response variables (FALSE
). By default, only continuous expert network covariates are considered (see exp.init$clustMD
below). Only relevant when init.z
is not "random"
(unless isTRUE(exp.init$clustMD)
, in which case init.z
specifies the initialisation routine for a call to clustMD
). This will render the "quantile"
option to init.z
for univariate data unusable if continuous expert network covariates are supplied &/or categorical/ordinal expert network covariates are supplied when isTRUE(exp.init$clustMD)
and the clustMD
library is loaded.
mahalanobis
A logical indicating whether to iteratively reallocate observations during the initialisation phase to the component corresponding to the expert network regression to which it's closest to the fitted values of in terms of Mahalanobis distance (defaults to TRUE
). This will ensure that each component can be well modelled by a single expert prior to running the EM/CEM algorithm.
estart
A logical governing the behaviour of init.z="random"
when nstarts > 1
in the presence of expert network covariates. Only relevant when isTRUE(exp.init$mahalanobis)
. Defaults to FALSE
; i.e. all random starts are put through full runs of the EM algorithm. When TRUE
, all random starts are put through the initial iterative reallocation routine prior to a full run of EM for only the single best random initial partition obtained. See the last set of Examples below.
clustMD
A logical indicating whether categorical/ordinal covariates should be incorporated when using the joint distribution of the response and expert network covariates for initialisation (defaults to FALSE
). Only relevant when isTRUE(exp.init$joint)
. Requires the use of the clustMD
library. Note that initialising in this manner involves fitting all clustMD
model types in parallel for all numbers of components considered, and may fail (especially) in the presence of nominal expert network covariates.
Unless init.z="list"
, supplying this argument as TRUE
when the clustMD
library is loaded has the effect of superseding the init.z
argument: this argument now governs instead how the call to clustMD
is initialised (unless all clustMD
model types fail for a given number of components, in which case init.z
is invoked instead to initialise for G
values for which all clustMD
model types failed). Similarly, the arguments hc.args
and km.args
will be ignored (again, unless all clustMD
model types fail for a given number of components).
max.init
The maximum number of iterations for the Mahalanobis distance-based reallocation procedure when exp.init$mahalanobis
is TRUE
. Defaults to .Machine$integer.max
.
identity
A logical indicating whether the identity matrix (corresponding to the use of the Euclidean distance) is used in place of the covariance matrix of the residuals (corresponding to the use of the Mahalanobis distance). Defaults to FALSE
; only relevant for multivariate response data.
drop.break
When isTRUE(exp.init$mahalanobis)
observations will be completely in or out of a component during the initialisation phase. As such, it may occur that constant columns will be present when building a given component's expert regression (particularly for categorical covariates). It may also occur, due to this partitioning, that "unseen" data, when calculating the residuals, will have new factor levels. When isTRUE(exp.init$drop.break)
, the Mahalanobis distance based initialisation phase will explicitly fail in either of these scenarios.
Otherwise, drop_constants
and drop_levels
will be invoked when exp.init$drop.break
is FALSE
(the default) to try to remedy the situation. In any case, only a warning that the initialisation step failed will be printed, regardless of the value of exp.init$drop.break
.
Switch controlling whether models are fit using the "EM"
(the default) or "CEM"
algorithm. The option "cemEM"
allows running the EM algorithm starting from convergence of the CEM algorithm.
When either G
or modelNames
is a vector, criterion
determines whether the "bic"
(Bayesian Information Criterion), "icl"
(Integrated Complete Likelihood), "aic"
(Akaike Information Criterion) is used to determine the 'best' model when gathering output. Note that all criteria will be returned in any case.
The criterion used to assess convergence of the EM/CEM algorithm. The default ("aitken"
) uses Aitken's acceleration method via aitken
, otherwise the "relative"
change in log-likelihood is monitored (which may be less strict). The "relative"
option corresponds to the stopping criterion used by Mclust
: see asMclust
above.
Both stopping rules are ultimately governed by tol[1]
. When the "aitken"
method is employed, the asymptotic estimate of the final converged maximised log-likelihood is also returned as linf
for models with 2 or more components, though the largest element of the returned vector loglik
still gives the log-likelihood value achieved by the parameters returned at convergence, under both stopping
methods (see MoE_clust
).
A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if init.z == "z.list"
. These matrices are allowed correspond to both soft or hard clusterings, and will be internally normalised so that the rows sum to 1.
The number of random initialisations to use when init.z="random"
. Defaults to 1
. When there are no expert covariates (or when exp.init$mahalanobis=FALSE
or exp.init$estart=FALSE
), the results will be based on the random start yielding the highest estimated log-likelihood after each initial partition is subjected to a full run of the EM algorithm. Note, in this case, that all nstarts
random initialisations are affected by exp.init$mahalanobis
, if invoked in the presence of expert network covariates, which may remove some of the randomness.
Conversely, if exp.init$mahalanobis=TRUE
and exp.init$estart=TRUE
, all nstarts
random starts are put through the initial iterative reallocation routine and only the single best initial partition uncovered is put through the full run of the EM algorithm. See init.z
and exp.init$estart
above for more details, though note that exp.init$mahalanobis=TRUE
and exp.init$estart=FALSE
, by default.
A scalar tolerance associated with deciding when to terminate computations due to computational singularity in covariances. Smaller values of eps
allow computations to proceed nearer to singularity. The default is the relative machine precision .Machine$double.eps
, which is approximately 2e-16 on IEEE-compliant machines.
A vector of length three giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, 2) parameter convergence in the inner loop for models with iterative M-step ("VEI", "VEE", "EVE", "VVE", "VEV"
), and 3) optimisation in the multinomial logistic regression in the gating network, respectively. The default is c(1e-05, sqrt(.Machine$double.eps), 1e-08)
. If only one number is supplied, it is used as the tolerance for all three cases given.
A vector of length three giving integer limits on the number of iterations for 1) the EM/CEM algorithm, 2) the inner loop for models with iterative M-step ("VEI", "VEE", "EVE", "VVE", "VEV"
), and 3) the multinomial logistic regression in the gating network, respectively.
The default is c(.Machine$integer.max, .Machine$integer.max, 1000L)
, allowing termination to be completely governed by tol[1]
& tol[2]
for the inner and outer loops of the EM/CEM algorithm. If only one number is supplied, it is used as the iteration limit for the outer loop only and the other elements of itmax
retain their usual defaults.
If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in itmax[3]
iterations at any stage of the EM/CEM algorithm, an appropriate warning will be printed, prompting the user to modify this argument.
A list supplying select named parameters to control the initialisation of the cluster allocations when init.z="hc"
(or when init.z="mclust"
, which itself relies on hc
), unless isTRUE(exp.init$clustMD)
, the clustMD
library is loaded, and none of the clustMD
model types fail (otherwise irrelevant):
hcUse
A string specifying the type of input variables to be used. This defaults to "VARS"
here, unlike mclust which defaults to "SVD"
. Other allowable values are documented in mclust.options
. See asMclust
above.
hc.meth
A character string indicating the model to be used when hierarchical clustering (see hc
) is employed for initialisation (either when init.z="hc"
or init.z="mclust"
). Defaults to "EII"
for high-dimensional data, or "VVV"
otherwise.
A list supplying select named parameters to control the initialisation of the cluster allocations when init.z="kmeans"
, unless isTRUE(exp.init$clustMD)
, the clustMD
library is loaded, and none of the clustMD
model types fail (otherwise irrelevant):
kstarts
The number of random initialisations to use. Defaults to 10.
kiters
The maximum number of K-Means iterations allowed. Defaults to 10.
A logical governing whether to continue running the algorithm even in the presence of positive log-densities. Defaults to TRUE
, but setting posidens=FALSE
can help to safeguard against spurious solutions, which will be instantly terminated if positive log-densities are encountered. Note that versions of this package prior to and including version 1.3.1 always implicitly assumed posidens=FALSE
.
The criterion to be used to determine the optimal model type to initialise with, when init.z="mclust"
or when isTRUE(exp.init$clustMD)
and the clustMD
library is loaded (one of "bic"
or "icl"
). Defaults to "icl"
when criterion="icl"
, otherwise defaults to "bic"
. The criterion
argument remains unaffected.
A single number giving the iteration count at which a warning will be printed if the EM/CEM algorithm has failed to converge. Defaults to 0
, i.e. no warning (which is true for any warn.it
value less than 3
), otherwise the message is printed regardless of the value of verbose
. If non-zero, warn.it
should be moderately large, but obviously less than itmax[1]
. A warning will always be printed if one of more models fail to converge in itmax[1]
iterations.
The maximum allowable number of weights in the call to multinom
for the multinomial logistic regression in the gating network. There is no intrinsic limit in the code, but increasing MaxNWts
will probably allow fits that are very slow and time-consuming. It may be necessary to increase MaxNWts
when categorical concomitant variables with many levels are included or the number of components is high.
Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is TRUE
if the session is interactive, and FALSE
otherwise. If FALSE
, warnings and error messages will still be printed to the screen, but everything else will be suppressed.
Catches unused arguments.
A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.
MoE_control
is provided for assigning values and defaults within MoE_clust
and MoE_stepwise
.
While the criterion
argument controls the choice of the optimal number of components and GPCM/mclust model type, MoE_compare
is provided for choosing between fits with different combinations of covariates or different initialisation settings.
MoE_clust
, MoE_stepwise
, aitken
, Mclust
, hc
, mclust.options
, quant_clust
, clustMD
, noise_vol
, hypvol
, convhulln
, ellipsoidhull
, MoE_compare
, multinom
# NOT RUN {
ctrl1 <- MoE_control(criterion="icl", itmax=100, warn.it=15, init.z="random", nstarts=5)
data(CO2data)
GNP <- CO2data$GNP
# }
# NOT RUN {
res <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, control=ctrl1)
# Alternatively, specify control arguments directly
res2 <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, stopping="relative")
# }
# NOT RUN {
# Supplying ctrl1 without naming it as 'control' can throw an error
# }
# NOT RUN {
res3 <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, ctrl1)
# }
# NOT RUN {
# Similarly, supplying control arguments via a mix of the ... construct
# and the named argument 'control' also throws an error
# }
# NOT RUN {
res4 <- MoE_clust(CO2data$CO2, G=2, expert = ~ GNP, control=ctrl1, init.z="kmeans")
# }
# NOT RUN {
# }
# NOT RUN {
# Initialise via the mixed-type joint distribution of response & covariates
# Let the ICL criterion determine the optimal clustMD model type
# Constrain the mixing proportions to be equal
ctrl2 <- MoE_control(exp.init=list(clustMD=TRUE), init.crit="icl", equalPro=TRUE)
data(ais)
library(clustMD)
res4 <- MoE_clust(ais[,3:7], G=2, modelNames="EVE", expert= ~ sex,
network.data=ais, control=ctrl2)
# Include a noise component by specifying its prior mixing proportion
res5 <- MoE_clust(ais[,3:7], G=2, modelNames="EVE", expert= ~ sex,
network.data=ais, tau0=0.1)
# Investigate the use of random starts
sex <- ais$sex
# resA uses deterministic starting values (by default) for each G value
system.time(resA <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE))
# resB passes each random start through the entire EM algorithm for each G value
system.time(resB <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE,
init.z="random", nstarts=10))
# resC passes only the "best" random start through the EM algorithm for each G value
system.time(resC <- MoE_clust(ais[,3:7], G=2, expert=~sex, equalPro=TRUE,
init.z="random", nstarts=10, estart=TRUE))
# Here, all three settings (listed here in order of speed) converge to the same model
MoE_compare(resA, resC, resB)
# }
Run the code above in your browser using DataLab