Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences. Additional arguments are available via the function MEDseq_control.
MEDseq_fit(seqs,
G = 1L:9L,
modtype = c("CC", "UC", "CU", "UU",
"CCN", "UCN", "CUN", "UUN"),
gating = NULL,
weights = NULL,
ctrl = MEDseq_control(...),
covars = NULL,
...)# S3 method for MEDseq
summary(object,
classification = TRUE,
parameters = FALSE,
network = FALSE,
SPS = FALSE,
...)
# S3 method for MEDseq
print(x,
digits = 3L,
...)
A list (of class "MEDseq") with the following named entries (of which some may be missing, depending on the criterion employed), mostly corresponding to the chosen optimal model (as determined by the criterion within MEDseq_control):
callThe matched call.
dataThe input data, seqs.
modtypeA character string denoting the MEDseq model type at which the optimal criterion occurs.
GThe optimal number of mixture components according to criterion.
paramsA list with the following named components:
thetaA matrix with G rows and T columns, where T is the number of sequence positions, giving the central sequences of each cluster. The mean of the noise component is not reported, as it does not contribute in any way to the likelihood. A dedicated print function is provided.
lambdaA matrix of precision parameters. Will contain 1 row if the 1st letter of modtype is `C' and G columns otherwise. Will contain 1 column if the 2nd letter of modtype is `C' and T columns otherwise, where T is the number of sequence positions. Precision parameter values of zero are reported for the noise component, if any. Note that values of Inf are also possible, corresponding to zero-variance, which is most likely under the "UU" or "UUN" models. A dedicated print function is provided.
tauThe mixing proportions: either a vector of length G or, if gating covariates were supplied, a matrix with an entry for each observation (rows) and component (columns).
gatingAn object of class "MEDgating" (for which dedicated print, summary, and predict methods exist) and either "multinom" or "glm" (only for single-component models) giving the multinom regression coefficients of the gating network. If gating covariates were NOT supplied (or the best model has just one component), this corresponds to a RHS of ~1, otherwise the supplied gating formula. As such, a fitted gating network is always returned even in the absence of supplied covariates or clusters. If there is a noise component (and the option noise.gate=TRUE is invoked), its coefficients are those for the last component. Users are cautioned against making inferences about statistical significance from summaries of the coefficients in the gating network. Users are instead advised to use the function MEDseq_stderr.
zThe final responsibility matrix whose [i,k]-th entry is the probability that observation i belongs to the k-th component. If there is a noise component, its values are found in the last column.
MAPThe vector of cluster labels for the chosen model corresponding to z, i.e. max.col(z). Observations belonging to the noise component, if any, will belong to component 0.
BICA matrix of all BIC values with length{G} rows and length(modtype) columns. See Note.
ICLA matrix of all ICL values with length{G} rows and length(modtype) columns. See Note.
AICA matrix of all AIC values with length{G} rows and length(modtype) columns. See Note.
DBSA matrix of all (weighted) mean/median DBS values with length{G} rows and length(modtype) columns. See Note and dbs.
DBSvalsA list of lists giving the observation-specific DBS values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
dbsThe (weighted) mean/median DBS value corresponding to the optimal model. May not necessarily be the optimal DBS.
dbsvalsObservation-specific DBS values corresponding to the optimum model, which may not be optimal in terms of DBS.
ASWA matrix of all (weighted) mean/median ASW values with length{G} rows and length(modtype) columns. See Note and wcSilhouetteObs.
ASWvalsA list of lists giving the observation-specific ASW values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
aswThe (weighted) mean/median ASW value corresponding to the optimal model. May not necessarily be the optimal ASW.
aswvalsObservation-specific ASW values corresponding to the optimum model, which may not be optimal in terms of ASW.
LOGLIKA matrix of all maximal log-likelihood values with length{G} rows and length(modtype) columns. See Note.
DFA matrix giving the numbers of estimated parameters (i.e. the number of `used' degrees of freedom) for all visited models, with length{G} rows and length(modtype) columns. Subtract these numbers from the sample size to get the degrees of freedom. See Note.
ITERSA matrix giving the total number of EM/CEM iterations for all visited models, with length{G} rows and length(modtype) columns. See Note.
CVA matrix of all cross-validated log-likelihood values with length{G} rows and length(modtype) columns, if available. See Note and the arguments do.cv and nfolds to MEDseq_control.
NECA matrix of all NEC values with length{G} rows and length(modtype) columns, if available. See Note and the argument do.nec to MEDseq_control.
bicThe BIC value corresponding to the optimal model. May not necessarily be the optimal BIC.
iclThe ICL value corresponding to the optimal model. May not necessarily be the optimal ICL.
aicThe AIC value corresponding to the optimal model. May not necessarily be the optimal AIC.
loglikThe vector of increasing log-likelihood values for every EM/CEM iteration under the optimal model. The last element of this vector is the maximum log-likelihood achieved by the parameters returned at convergence.
dfThe number of estimated parameters in the optimal model (i.e. the number of `used' degrees of freedom). Subtract this number from the sample size to get the degrees of freedom.
itersThe total number of EM/CEM iterations for the optimal model.
cvThe cross-validated log-likelihood value corresponding to the optimal model, if available. May not necessarily be the optimal one.
necThe NEC value corresponding to the optimal model, if available. May not necessarily be the optimal NEC.
ZSA list of lists giving the z matrices for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
uncertThe uncertainty associated with the classification.
covarsA data frame gathering the set of covariates used in the gating network, if any. Will contain zero columns in the absence of gating covariates. Supplied gating covariates will be excluded if the optimal model has only one component. May have fewer columns than covariates supplied via the covars argument also, as only the included covariates are gathered here.
Dedicated plot, print, and summary functions exist for objects of class "MEDseq".
A state-sequence object of class "stslist" as created by the seqdef function in the TraMineR package (which is reexported by MEDseq for convenience). Note that the data set must have equal sequence lengths, the intervals are assumed to be evenly spaced, and neither missing values nor void values are allowed. These conditions can be checked using is.stslist and seqhasmiss.
A positive integer vector specifying the numbers of mixture components (clusters) to fit. Defaults to G=1:9.
A vector of character strings indicating the type of MEDseq models to be fitted, in terms of the constraints or lack thereof on the precision parameters. By default, all valid model types are fitted (except some only where G > 1 or G > 2, see Note).
The models are named "CC", "CU", "UC", "UU", "CCN", "CUN", "UCN", and "UUN". The first letter denotes whether the precision parameters are constrained/unconstrained across clusters. The second letter denotes whether the precision parameters are constrained/unconstrained across sequence positions (i.e. time points). The third letter denotes whether one of the components is constrained to have zero-precision/infinite variance. Such a noise component assumes sequences in that cluster follow a uniform distribution.
A formula for determining the model matrix for the multinomial logistic regression in the gating network when fixed covariates enter the mixing proportions. Defaults to ~1, i.e. no covariates. This will be ignored where G=1. Continuous, categorical, and/or ordinal covariates are allowed. Logical covariates will be coerced to factors. Interactions, transformations, and higher order terms are permitted: the latter must be specified explicitly using the AsIs operator (I). The specification of the LHS of the formula is ignored. Intercept terms are included by default.
Optional numeric vector containing observation-specific sampling weights, which are accounted for in the model fitting and other functions where applicable. All weights must be non-negative and non-missing. weights are always internally normalised to sum to the sample size. See the unique argument to MEDseq_control to see how incorporating weights also yields computational benefits. Note that weights must always be explicitly supplied here; it is not enough to use weights when constructing the state sequence object via seqdef (reexported by MEDseq for convenience). If you are using a weighted "stslist" state sequence object and do not specify weights, you will be prompted to explicitly specify weights=attr(seqs, "weights") for a weighted model or weights=NULL for an unweighted model.
A list of control parameters for the EM/CEM and other aspects of the algorithm. The defaults are set by a call to MEDseq_control.
An optional data frame (or a matrix with named columns) in which to look for the covariates in the gating network formula, if any. If not found in covars, any supplied gating covariates are taken from the environment from which MEDseq_fit is called. Try to ensure the names of variables in covars do not match any of those in seqs.
Catches unused arguments (see MEDseq_control).
Arguments required for the print and summary functions: x and object are objects of class "MEDseq" resulting from a call to MEDseq_fit, while digits gives the number of decimal places to round to for printing purposes (defaults to 3). classification, parameters, and network are logicals which govern whether a table of the MAP classification of observations, the mixture component parameters, and the gating network coefficients are printed, respectively. SPS governs the printing of the relevant quantities in "summaryMEDseq" objects when any of classification, parameters, &/or network are TRUE (see MEDseq_clustnames and seqformat).
Keefe Murphy - <keefe.murphy@mu.ie>
The function effectively allows 8 different MEDseq precision parameter settings for models with or without gating network covariates. By constraining the mixing proportions to be equal (see equalPro in MEDseq_control) an extra special case is facilitated in the latter case.
While model selection in terms of choosing the optimal number of components and the MEDseq model type is performed within MEDseq_fit, using one of the criterion options within MEDseq_control, choosing between multiple fits with different combinations of covariates or different initialisation settings can be done by supplying objects of class "MEDseq" to MEDseq_compare.
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <tools:::Rd_expr_doi("10.1111/rssa.12712")>.
seqdef (reexported by MEDseq for convenience), MEDseq_control, MEDseq_compare, plot.MEDseq, predict.MEDgating, MEDseq_stderr, MEDseq_clustnames, dbs, wcSilhouetteObs, seqformat, is.stslist, seqhasmiss, I
if (FALSE) { # interactive()
# Load the MVAD data
data(mvad)
mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x)
which(x == "yes")), labels = colnames(mvad[,5:9]))
mvad <- list(covariates = mvad[c(3:4,10:14,87)],
sequences = mvad[,15:86],
weights = mvad[,2])
mvad.cov <- mvad$covariates
# Create a state sequence object with the first two (summer) time points removed
states <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels)
# Fit a range of exponential-distance models without clustering
mod0 <- MEDseq_fit(mvad.seq, G=1)
# \donttest{
# Fit a range of unweighted mixture models without covariates
# Only consider models with a noise component
# Supply some MEDseq_control() arguments
# mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"),
# algo="CEM", init.z="kmodes", criterion="icl")
# Fit a model with weights and a gating covariate
# Have the probability of noise-component membership be constant
mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights,
gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE)
# Examine this model in greater detail
summary(mod2, classification=TRUE, parameters=TRUE)
summary(mod2$gating, SPS=TRUE)
print(mod2$params$theta, SPS=TRUE)
plot(mod2, "clusters")# }
}
Run the code above in your browser using DataLab