MEDseq_fit: MEDseq: Mixtures of Exponential-Distance Models with Covariates

Description

Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences. Additional arguments are available via the function MEDseq_control.

Usage

MEDseq_fit(seqs, 
           G = 1L:9L, 
           modtype = c("CC", "UC", "CU", "UU", 
                       "CCN", "UCN", "CUN", "UUN"), 
           gating = NULL, 
           weights = NULL, 
           ctrl = MEDseq_control(...), 
           covars = NULL, 
           ...)
# S3 method for MEDseq
summary(object,
        classification = TRUE,
        parameters = FALSE,
        network = FALSE,
        SPS = FALSE,
        ...)
# S3 method for MEDseq
print(x,
      digits = 3L,
      ...)

Value

A list (of class "MEDseq") with the following named entries (of which some may be missing, depending on the criterion employed), mostly corresponding to the chosen optimal model (as determined by the criterion within MEDseq_control):

call

The matched call.

data

The input data, seqs.

modtype

A character string denoting the MEDseq model type at which the optimal criterion occurs.

G

The optimal number of mixture components according to criterion.

params

A list with the following named components:

theta: A matrix with G rows and T columns, where T is the number of sequence positions, giving the central sequences of each cluster. The mean of the noise component is not reported, as it does not contribute in any way to the likelihood. A dedicated print function is provided.

lambda

A matrix of precision parameters. Will contain 1 row if the 1st letter of modtype is `C' and G columns otherwise. Will contain 1 column if the 2nd letter of modtype is `C' and T columns otherwise, where T is the number of sequence positions. Precision parameter values of zero are reported for the noise component, if any. Note that values of Inf are also possible, corresponding to zero-variance, which is most likely under the "UU" or "UUN" models. A dedicated print function is provided.

tau

The mixing proportions: either a vector of length G or, if gating covariates were supplied, a matrix with an entry for each observation (rows) and component (columns).

gating

An object of class "MEDgating" (for which dedicated print, summary, and predict methods exist) and either "multinom" or "glm" (only for single-component models) giving the multinom regression coefficients of the gating network. If gating covariates were NOT supplied (or the best model has just one component), this corresponds to a RHS of ~1, otherwise the supplied gating formula. As such, a fitted gating network is always returned even in the absence of supplied covariates or clusters. If there is a noise component (and the option noise.gate=TRUE is invoked), its coefficients are those for the last component. Users are cautioned against making inferences about statistical significance from summaries of the coefficients in the gating network. Users are instead advised to use the function MEDseq_stderr.

z

The final responsibility matrix whose [i,k]-th entry is the probability that observation i belongs to the k-th component. If there is a noise component, its values are found in the last column.

MAP

The vector of cluster labels for the chosen model corresponding to z, i.e. max.col(z). Observations belonging to the noise component, if any, will belong to component 0.

BIC

A matrix of all BIC values with length{G} rows and length(modtype) columns. See Note.

ICL

A matrix of all ICL values with length{G} rows and length(modtype) columns. See Note.

AIC

A matrix of all AIC values with length{G} rows and length(modtype) columns. See Note.

DBS

A matrix of all (weighted) mean/median DBS values with length{G} rows and length(modtype) columns. See Note and dbs.

DBSvals

A list of lists giving the observation-specific DBS values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.

dbs

The (weighted) mean/median DBS value corresponding to the optimal model. May not necessarily be the optimal DBS.

dbsvals

Observation-specific DBS values corresponding to the optimum model, which may not be optimal in terms of DBS.

ASW

A matrix of all (weighted) mean/median ASW values with length{G} rows and length(modtype) columns. See Note and wcSilhouetteObs.

ASWvals

A list of lists giving the observation-specific ASW values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.

asw

The (weighted) mean/median ASW value corresponding to the optimal model. May not necessarily be the optimal ASW.

aswvals

Observation-specific ASW values corresponding to the optimum model, which may not be optimal in terms of ASW.

LOGLIK

A matrix of all maximal log-likelihood values with length{G} rows and length(modtype) columns. See Note.

DF

A matrix giving the numbers of estimated parameters (i.e. the number of `used' degrees of freedom) for all visited models, with length{G} rows and length(modtype) columns. Subtract these numbers from the sample size to get the degrees of freedom. See Note.

ITERS

A matrix giving the total number of EM/CEM iterations for all visited models, with length{G} rows and length(modtype) columns. See Note.

CV

A matrix of all cross-validated log-likelihood values with length{G} rows and length(modtype) columns, if available. See Note and the arguments do.cv and nfolds to MEDseq_control.

NEC

A matrix of all NEC values with length{G} rows and length(modtype) columns, if available. See Note and the argument do.nec to MEDseq_control.

bic

The BIC value corresponding to the optimal model. May not necessarily be the optimal BIC.

icl

The ICL value corresponding to the optimal model. May not necessarily be the optimal ICL.

aic

The AIC value corresponding to the optimal model. May not necessarily be the optimal AIC.

loglik

The vector of increasing log-likelihood values for every EM/CEM iteration under the optimal model. The last element of this vector is the maximum log-likelihood achieved by the parameters returned at convergence.

df

The number of estimated parameters in the optimal model (i.e. the number of `used' degrees of freedom). Subtract this number from the sample size to get the degrees of freedom.

iters

The total number of EM/CEM iterations for the optimal model.

cv

The cross-validated log-likelihood value corresponding to the optimal model, if available. May not necessarily be the optimal one.

nec

The NEC value corresponding to the optimal model, if available. May not necessarily be the optimal NEC.

ZS

A list of lists giving the z matrices for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.

uncert

The uncertainty associated with the classification.

covars

A data frame gathering the set of covariates used in the gating network, if any. Will contain zero columns in the absence of gating covariates. Supplied gating covariates will be excluded if the optimal model has only one component. May have fewer columns than covariates supplied via the covars argument also, as only the included covariates are gathered here.

Dedicated plot, print, and summary functions exist for objects of class "MEDseq".

Arguments

seqs: A state-sequence object of class "stslist" as created by the seqdef function in the TraMineR package (which is reexported by MEDseq for convenience). Note that the data set must have equal sequence lengths, the intervals are assumed to be evenly spaced, and neither missing values nor void values are allowed. These conditions can be checked using is.stslist and seqhasmiss.
G: A positive integer vector specifying the numbers of mixture components (clusters) to fit. Defaults to G=1:9.
modtype: A vector of character strings indicating the type of MEDseq models to be fitted, in terms of the constraints or lack thereof on the precision parameters. By default, all valid model types are fitted (except some only where G > 1 or G > 2, see Note). The models are named "CC", "CU", "UC", "UU", "CCN", "CUN", "UCN", and "UUN". The first letter denotes whether the precision parameters are constrained/unconstrained across clusters. The second letter denotes whether the precision parameters are constrained/unconstrained across sequence positions (i.e. time points). The third letter denotes whether one of the components is constrained to have zero-precision/infinite variance. Such a noise component assumes sequences in that cluster follow a uniform distribution.
gating: A formula for determining the model matrix for the multinomial logistic regression in the gating network when fixed covariates enter the mixing proportions. Defaults to ~1, i.e. no covariates. This will be ignored where G=1. Continuous, categorical, and/or ordinal covariates are allowed. Logical covariates will be coerced to factors. Interactions, transformations, and higher order terms are permitted: the latter must be specified explicitly using the AsIs operator (I). The specification of the LHS of the formula is ignored. Intercept terms are included by default.
weights: Optional numeric vector containing observation-specific sampling weights, which are accounted for in the model fitting and other functions where applicable. All weights must be non-negative and non-missing. weights are always internally normalised to sum to the sample size. See the unique argument to MEDseq_control to see how incorporating weights also yields computational benefits. Note that weights must always be explicitly supplied here; it is not enough to use weights when constructing the state sequence object via seqdef (reexported by MEDseq for convenience). If you are using a weighted "stslist" state sequence object and do not specify weights, you will be prompted to explicitly specify weights=attr(seqs, "weights") for a weighted model or weights=NULL for an unweighted model.
ctrl: A list of control parameters for the EM/CEM and other aspects of the algorithm. The defaults are set by a call to MEDseq_control.
covars: An optional data frame (or a matrix with named columns) in which to look for the covariates in the gating network formula, if any. If not found in covars, any supplied gating covariates are taken from the environment from which MEDseq_fit is called. Try to ensure the names of variables in covars do not match any of those in seqs.
...: Catches unused arguments (see MEDseq_control).
x, object, digits, classification, parameters, network, SPS: Arguments required for the print and summary functions: x and object are objects of class "MEDseq" resulting from a call to MEDseq_fit, while digits gives the number of decimal places to round to for printing purposes (defaults to 3). classification, parameters, and network are logicals which govern whether a table of the MAP classification of observations, the mixture component parameters, and the gating network coefficients are printed, respectively. SPS governs the printing of the relevant quantities in "summaryMEDseq" objects when any of classification, parameters, &/or network are TRUE (see MEDseq_clustnames and seqformat).

Author

Keefe Murphy - <keefe.murphy@mu.ie>

Details

The function effectively allows 8 different MEDseq precision parameter settings for models with or without gating network covariates. By constraining the mixing proportions to be equal (see equalPro in MEDseq_control) an extra special case is facilitated in the latter case.

While model selection in terms of choosing the optimal number of components and the MEDseq model type is performed within MEDseq_fit, using one of the criterion options within MEDseq_control, choosing between multiple fits with different combinations of covariates or different initialisation settings can be done by supplying objects of class "MEDseq" to MEDseq_compare.

References

Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <tools:::Rd_expr_doi("10.1111/rssa.12712")>.

Examples

Run this code

if (FALSE) { # interactive()
# Load the MVAD data
data(mvad)
mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) 
                 which(x == "yes")), labels = colnames(mvad[,5:9]))
mvad          <- list(covariates = mvad[c(3:4,10:14,87)],
                      sequences = mvad[,15:86], 
                      weights = mvad[,2])
mvad.cov      <- mvad$covariates

# Create a state sequence object with the first two (summer) time points removed
states        <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels        <- c("Employment", "Further Education", "Higher Education", 
                   "Joblessness", "School", "Training")
mvad.seq      <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels)

# Fit a range of exponential-distance models without clustering
mod0          <- MEDseq_fit(mvad.seq, G=1)

# \donttest{
# Fit a range of unweighted mixture models without covariates
# Only consider models with a noise component
# Supply some MEDseq_control() arguments
# mod1        <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"),
#                           algo="CEM", init.z="kmodes", criterion="icl")

# Fit a model with weights and a gating covariate
# Have the probability of noise-component membership be constant
mod2          <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, 
                            gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE)
                            
# Examine this model in greater detail
summary(mod2, classification=TRUE, parameters=TRUE)
summary(mod2$gating, SPS=TRUE)
print(mod2$params$theta, SPS=TRUE)
plot(mod2, "clusters")# }
}

Run the code above in your browser using DataLab