Supplies a list of arguments (with defaults) for use with MEDseq_fit
.
MEDseq_control(algo = c("EM", "CEM", "cemEM"),
init.z = c("kmedoids", "kmodes", "kmodes2", "hc", "random", "list"),
z.list = NULL,
dist.mat = NULL,
unique = TRUE,
criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"),
tau0 = NULL,
noise.gate = TRUE,
random = TRUE,
do.cv = FALSE,
do.nec = FALSE,
nfolds = 10L,
nstarts = 1L,
stopping = c("aitken", "relative"),
equalPro = FALSE,
equalNoise = FALSE,
tol = c(1E-05, 1E-08),
itmax = c(.Machine$integer.max, 1000L),
opti = c("mode", "medoid", "first", "GA"),
ordering = c("none", "decreasing", "increasing"),
MaxNWts = 1000L,
verbose = TRUE,
...)
A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.
Switch controlling whether models are fit using the "EM"
(the default) or "CEM"
algorithm. The option "cemEM"
allows running the EM algorithm starting from convergence of the CEM algorithm.
The method used to initialise the cluster labels. All options respect the presence of sampling weights
, if any. Defaults to "kmedoids"
. Other options include "kmodes"
, "kmodes2"
, Ward's hierarchical clustering ("hc"
, via hclust
), "random"
initialisation, and a user-supplied "list"
(see z.list
below). For weighted sequences, "kmedoids"
is itself initialised using Ward's hierarchical clustering.
The "kmodes"
and "kmodes2"
options both internally call the function wKModes
, which typically uses random initial modes. Under "kmodes"
, the algorithm is instead initialised via the medoids of the clusters obtained from a call to hclust
. The option "kmodes2"
is slightly faster, by virtue of using the random initial medoids. However, final results are by default still subject to randomness under both options (unless set.seed
is invoked), as ties for modes and cluster assignments are typically broken at random throughout the algorithm (see the random
argument below, and in wKModes
itself).
A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if init.z == "z.list"
. These matrices are allowed correspond to both soft or hard clusterings, and will be internally normalised so that the rows sum to 1.
An optional distance matrix to use for initialisation when init.z
is one of "kmedoids"
or "hc"
. Defaults to a Hamming distance matrix. This is an experimental feature and should only be tampered with by expert users.
A logical indicating whether the model is fit only to the unique observations (defaults to TRUE
). When there are covariates, this means all unique combinations of covariate and sequence patterns, otherwise only the sequence patterns.
When weights
are not supplied to MEDseq_fit
and isTRUE(unique)
, weights are given by the occurrence frequency of the corresponding sequences, and the model is then fit to the unique observations only.
When weights
are supplied and isTRUE(unique)
, the weights are summed for each set of duplicate observations and assigned to one retained copy of each corresponding unique sequence. Hence, observations with different weights that are otherwise duplicates are treated as duplicates and significant computational gains can be made.
In both cases, the results will be unchanged, but setting unique
to TRUE
can often be much faster.
When either G
or modtype
is a vector, criterion
governs how the 'best' model is determined when gathering output. Defaults to "bic"
. Note that all criteria will be returned in any case, if possible.
Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with tau0
giving the prior probability of belonging to the noise component for all observations. Typically supplied as a scalar in the interval (0, 1), e.g. 0.1
. Can be supplied as a vector when gating covariates are present and noise.gate
is TRUE
.
A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to TRUE
, but leads to greater parsimony if FALSE
. Only relevant in the presence of a noise component (i.e. the "CCN"
, "UCN"
, "CUN"
, and "UUN"
models); only affects estimation in the presence of gating covariates.
A logical governing how ties for estimated central sequence positions are handled. When TRUE
(the default), such ties are broken at random. When FALSE
(the implied default prior to version 1.2.0
of this package), the first candidate state is always chosen. This argument affects all opti
options. If verbose
is TRUE
and there are tie-breaking operations performed, a warning message is printed once per model, regardless of the number of such operations.
Note that this argument is also passed to wKModes
if init.z
is "kmodes"
or "kmodes2"
and that, in certain rare cases when the "CEM"
algo
is invoked when equalPro
is TRUE
and the precision parameter(s) are somehow constrained across clusters, this argument also governs ties for cluster assignments within MEDseq_fit
as well.
A logical indicating whether cross-validated log-likelihood scores should also be computed (see nfolds
). Defaults to FALSE
due to significant computational burden incurred.
A logical indicating whether the normalised entropy criterion (NEC) should also be computed (for models with more than one component). Defaults to FALSE
. When TRUE
, models with G=1
are fitted always.
The number of folds to use when isTRUE{do.cv}
.
The number of random initialisations to use when init.z="random"
. Defaults to 1
. Results will be based on the random start yielding the highest estimated log-likelihood.
The criterion used to assess convergence of the EM/CEM algorithm. The default ("aitken"
) uses Aitken's acceleration method, otherwise the "relative"
change in log-likelihood is monitored (which may be less strict).
Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: equalPro = FALSE
. Only relevant when gating
covariates are not supplied within MEDseq_fit
, otherwise ignored. In the presence of a noise component, only the mixing proportions for the non-noise components are constrained to be equal (by default, see equalNoise
), after accounting for the noise component.
Logical which is only invoked when isTRUE(equalPro)
and gating covariates are not supplied. Under the default setting (FALSE
), the mixing proportion for the noise component is estimated, and remaining mixing proportions are equal; when TRUE
all components, including the noise component, have equal mixing proportions.
A vector of length two giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, and 2) optimisation in the multinomial logistic regression in the gating network, respectively. The default is c(1e-05, 1e-08)
. If only one number is supplied, it is used as the tolerance in both cases.
A vector of length two giving integer limits on the number of iterations for 1) the EM/CEM algorithm, and 2) the multinomial logistic regression in the gating network, respectively. The default is c(.Machine$integer.max, 1000)
. This allows termination of the EM/CEM algorithm to be completely governed by tol[1]
. If only one number is supplied, it is used as the iteration limit for the EM/CEM algorithm only and the other element of itmax
retains its usual default.
If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in itmax[2]
iterations at any stage of the EM/CEM algorithm, an appropriate warning will be printed, prompting the user to modify this argument.
Character string indicating how central sequence parameters should be estimated. The default "mode"
is exact and thus this experimental argument should only be tampered with by expert users. The option "medoid"
fixes the central sequence(s) to be one of the observed sequences (like k-medoids). The other options "first"
and "GA"
use stochastic local search with the first-improvement and genetic algorithms, respectively, to mutate the medoid. Pre-computation of the Hamming distance matrix for the observed sequences speeds-up computation of all options other than "mode"
.
Experimental feature that should only be tampered with by experienced users. Allows sequences to be reordered on the basis of the column-wise entropy when opti
is "first"
or "GA"
.
The maximum allowable number of weights in the call to multinom
for the multinomial logistic regression in the gating network. There is no intrinsic limit in the code, but increasing MaxNWts
will probably allow fits that are very slow and time-consuming. It may be necessary to increase MaxNWts
when categorical concomitant variables with many levels are included or the number of components is high.
Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is TRUE
if the session is interactive, and FALSE
otherwise. If FALSE
, warnings and error messages will still be printed to the screen, but everything else will be suppressed.
Catches unused arguments, and also allows the optional arguments ztol
and summ
to be passed to dbs
(ztol
and summ
) as well as the ASW computation (summ
), and the optional wKModes
arguments iter.max
, freq.weighted
, and fast
(provided init.z
is one of "kmodes"
or "kmodes2"
). In such cases, the wKModes
argument random
is already controlled by random
above here.
Keefe Murphy - <keefe.murphy@mu.ie>
MEDseq_control
is provided for assigning values and defaults within MEDseq_fit
. While the criterion
argument controls the choice of the optimal number of components and MEDseq model type (in terms of the constraints or lack thereof on the precision parameters), MEDseq_compare
is provided for choosing between fits with different combinations of covariates or different initialisation settings.
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <tools:::Rd_expr_doi("10.1111/rssa.12712")>.
Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and Computing, 21(3): 295-308.
Hoos, H. and T. Stützle (2004). Stochastic Local Search: Foundations and Applications. The Morgan Kaufman Series in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufman Publishers Inc.
# The CC MEDseq model is almost equivalent to k-medoids when the
# CEM algorithm is employed, mixing proportions are constrained,
# and the central sequences are restricted to the observed sequences
ctrl <- MEDseq_control(algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")
# \donttest{
data(mvad)
# Note that ctrl must be explicitly named 'ctrl'
mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, ctrl=ctrl)
# Alternatively, specify the control arguments directly
mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight,
algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")
# Note that supplying control arguments via a mix of the ... construct and the named argument
# 'control' or supplying MEDseq_control output without naming it 'control' can throw an error# }
Run the code above in your browser using DataLab