MEDseq_control: Set control values for use with MEDseq_fit

Description

Supplies a list of arguments (with defaults) for use with MEDseq_fit.

Usage

MEDseq_control(algo = c("EM", "CEM", "cemEM"), 
               init.z = c("kmedoids", "kmodes", "kmodes2", "hc", "random", "list"), 
               z.list = NULL, 
               dist.mat = NULL, 
               unique = TRUE, 
               criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), 
               tau0 = NULL, 
               noise.gate = TRUE, 
               random = TRUE,
               do.cv = FALSE, 
               do.nec = FALSE, 
               nfolds = 10L, 
               nstarts = 1L, 
               stopping = c("aitken", "relative"), 
               equalPro = FALSE, 
               equalNoise = FALSE, 
               tol = c(1E-05, 1E-08), 
               itmax = c(.Machine$integer.max, 1000L), 
               opti = c("mode", "medoid", "first", "GA"), 
               ordering = c("none", "decreasing", "increasing"), 
               MaxNWts = 1000L, 
               verbose = TRUE, 
               ...)

Value

A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.

Arguments

algo

Switch controlling whether models are fit using the "EM" (the default) or "CEM" algorithm. The option "cemEM" allows running the EM algorithm starting from convergence of the CEM algorithm.

init.z

The method used to initialise the cluster labels. All options respect the presence of sampling weights, if any. Defaults to "kmedoids". Other options include "kmodes", "kmodes2", Ward's hierarchical clustering ("hc", via hclust), "random" initialisation, and a user-supplied "list" (see z.list below). For weighted sequences, "kmedoids" is itself initialised using Ward's hierarchical clustering.

The "kmodes" and "kmodes2" options both internally call the function wKModes, which typically uses random initial modes. Under "kmodes", the algorithm is instead initialised via the medoids of the clusters obtained from a call to hclust. The option "kmodes2" is slightly faster, by virtue of using the random initial medoids. However, final results are by default still subject to randomness under both options (unless set.seed is invoked), as ties for modes and cluster assignments are typically broken at random throughout the algorithm (see the random argument below, and in wKModes itself).

z.list

A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if init.z == "z.list". These matrices are allowed correspond to both soft or hard clusterings, and will be internally normalised so that the rows sum to 1.

dist.mat

An optional distance matrix to use for initialisation when init.z is one of "kmedoids" or "hc". Defaults to a Hamming distance matrix. This is an experimental feature and should only be tampered with by expert users.

unique

A logical indicating whether the model is fit only to the unique observations (defaults to TRUE). When there are covariates, this means all unique combinations of covariate and sequence patterns, otherwise only the sequence patterns.

When weights are not supplied to MEDseq_fit and isTRUE(unique), weights are given by the occurrence frequency of the corresponding sequences, and the model is then fit to the unique observations only.

When weights are supplied and isTRUE(unique), the weights are summed for each set of duplicate observations and assigned to one retained copy of each corresponding unique sequence. Hence, observations with different weights that are otherwise duplicates are treated as duplicates and significant computational gains can be made.

In both cases, the results will be unchanged, but setting unique to TRUE can often be much faster.

criterion

When either G or modtype is a vector, criterion governs how the 'best' model is determined when gathering output. Defaults to "bic". Note that all criteria will be returned in any case, if possible.

tau0

Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with tau0 giving the prior probability of belonging to the noise component for all observations. Typically supplied as a scalar in the interval (0, 1), e.g. 0.1. Can be supplied as a vector when gating covariates are present and noise.gate is TRUE.

noise.gate

A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to TRUE, but leads to greater parsimony if FALSE. Only relevant in the presence of a noise component (i.e. the "CCN", "UCN", "CUN", and "UUN" models); only affects estimation in the presence of gating covariates.

random

A logical governing how ties for estimated central sequence positions are handled. When TRUE (the default), such ties are broken at random. When FALSE (the implied default prior to version 1.2.0 of this package), the first candidate state is always chosen. This argument affects all opti options. If verbose is TRUE and there are tie-breaking operations performed, a warning message is printed once per model, regardless of the number of such operations.

Note that this argument is also passed to wKModes if init.z is "kmodes" or "kmodes2" and that, in certain rare cases when the "CEM" algo is invoked when equalPro is TRUE and the precision parameter(s) are somehow constrained across clusters, this argument also governs ties for cluster assignments within MEDseq_fit as well.

do.cv

A logical indicating whether cross-validated log-likelihood scores should also be computed (see nfolds). Defaults to FALSE due to significant computational burden incurred.

do.nec

A logical indicating whether the normalised entropy criterion (NEC) should also be computed (for models with more than one component). Defaults to FALSE. When TRUE, models with G=1 are fitted always.

nfolds

The number of folds to use when isTRUE{do.cv}.

nstarts

The number of random initialisations to use when init.z="random". Defaults to 1. Results will be based on the random start yielding the highest estimated log-likelihood.

stopping

The criterion used to assess convergence of the EM/CEM algorithm. The default ("aitken") uses Aitken's acceleration method, otherwise the "relative" change in log-likelihood is monitored (which may be less strict).

equalPro

Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: equalPro = FALSE. Only relevant when gating covariates are not supplied within MEDseq_fit, otherwise ignored. In the presence of a noise component, only the mixing proportions for the non-noise components are constrained to be equal (by default, see equalNoise), after accounting for the noise component.

equalNoise

Logical which is only invoked when isTRUE(equalPro) and gating covariates are not supplied. Under the default setting (FALSE), the mixing proportion for the noise component is estimated, and remaining mixing proportions are equal; when TRUE all components, including the noise component, have equal mixing proportions.

tol

A vector of length two giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, and 2) optimisation in the multinomial logistic regression in the gating network, respectively. The default is c(1e-05, 1e-08). If only one number is supplied, it is used as the tolerance in both cases.

itmax

A vector of length two giving integer limits on the number of iterations for 1) the EM/CEM algorithm, and 2) the multinomial logistic regression in the gating network, respectively. The default is c(.Machine$integer.max, 1000). This allows termination of the EM/CEM algorithm to be completely governed by tol[1]. If only one number is supplied, it is used as the iteration limit for the EM/CEM algorithm only and the other element of itmax retains its usual default.

If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in itmax[2] iterations at any stage of the EM/CEM algorithm, an appropriate warning will be printed, prompting the user to modify this argument.

opti

Character string indicating how central sequence parameters should be estimated. The default "mode" is exact and thus this experimental argument should only be tampered with by expert users. The option "medoid" fixes the central sequence(s) to be one of the observed sequences (like k-medoids). The other options "first" and "GA" use stochastic local search with the first-improvement and genetic algorithms, respectively, to mutate the medoid. Pre-computation of the Hamming distance matrix for the observed sequences speeds-up computation of all options other than "mode".

ordering

Experimental feature that should only be tampered with by experienced users. Allows sequences to be reordered on the basis of the column-wise entropy when opti is "first" or "GA".

MaxNWts

The maximum allowable number of weights in the call to multinom for the multinomial logistic regression in the gating network. There is no intrinsic limit in the code, but increasing MaxNWts will probably allow fits that are very slow and time-consuming. It may be necessary to increase MaxNWts when categorical concomitant variables with many levels are included or the number of components is high.

verbose

Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is TRUE if the session is interactive, and FALSE otherwise. If FALSE, warnings and error messages will still be printed to the screen, but everything else will be suppressed.

...

Catches unused arguments, and also allows the optional arguments ztol and summ to be passed to dbs (ztol and summ) as well as the ASW computation (summ), and the optional wKModes arguments iter.max, freq.weighted, and fast (provided init.z is one of "kmodes" or "kmodes2"). In such cases, the wKModes argument random is already controlled by random above here.

Author

Keefe Murphy - <keefe.murphy@mu.ie>

Details

MEDseq_control is provided for assigning values and defaults within MEDseq_fit. While the criterion argument controls the choice of the optimal number of components and MEDseq model type (in terms of the constraints or lack thereof on the precision parameters), MEDseq_compare is provided for choosing between fits with different combinations of covariates or different initialisation settings.

References

Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <tools:::Rd_expr_doi("10.1111/rssa.12712")>.

Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and Computing, 21(3): 295-308.

Hoos, H. and T. Stützle (2004). Stochastic Local Search: Foundations and Applications. The Morgan Kaufman Series in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufman Publishers Inc.

Examples

Run this code

# The CC MEDseq model is almost equivalent to k-medoids when the
# CEM algorithm is employed, mixing proportions are constrained,
# and the central sequences are restricted to the observed sequences
ctrl  <- MEDseq_control(algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")
# \donttest{
data(mvad)
# Note that ctrl must be explicitly named 'ctrl'
mod   <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, ctrl=ctrl)

# Alternatively, specify the control arguments directly
mod   <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight,
                    algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")

# Note that supplying control arguments via a mix of the ... construct and the named argument 
# 'control' or supplying MEDseq_control output without naming it 'control' can throw an error# }

Run the code above in your browser using DataLab