seqCompare: BIC and Likelihood ratio test for comparing two groups

Description

Functions seqCompare and dissCompare compute the likelihood ratio test (LRT) and Bayesian Information Criterion (BIC) difference for comparing two groups within each of a series of sets. seqCompare compares two groups of sequences. dissCompare is more general and compares any two groups of data represented by a pairwise dissimilarity matrix. Functions seqBIC and seqLRT are aliases of seqCompare that return only the \(\Delta\)BIC or the LRT.

Usage

seqCompare(seqdata, seqdata2=NULL, group=NULL, set=NULL,
    s=100, seed=36963, stat="all", squared="LRTonly",
    weighted=TRUE, opt=NULL, BFopt=NULL, method, ...)
dissCompare(diss, group, set=NULL,
    s=100, seed=36963, stat="all", squared="LRTonly",
    weighted=TRUE, weights=NULL, BFopt=NULL)
seqLRT(seqdata, seqdata2=NULL, group=NULL, set=NULL, s=100,
    seed=36963, squared="LRTonly", weighted=TRUE, opt=NULL,
    BFopt=NULL, method, ...)
seqBIC(seqdata, seqdata2=NULL, group=NULL, set=NULL, s=100,
    seed=36963, squared="LRTonly", weighted=TRUE, opt=NULL,
    BFopt=NULL, method, ...)

Value

Function seqLRT (and seqCompare with the default "LRT" stat value) outputs a matrix with two columns, LRT and p.value.

LRT: This is the likelihood ratio test statistic for comparing the two groups.
p.value: This is the upper tail probability associated with the LRT.

Function seqBIC (and seqLRT with the "BIC" stat value) outputs a matrix with two columns, Delta BIC and BF.

Delta BIC: This is the difference between BIC values of the model that does not distinguish the two groups and the model taking account of the distinction.
Bayes Factor: This is the Bayes factor associated with the BIC difference.

seqCompare and dissCompare with stat="all" return a matrix with all four indicators.

When set=NULL, the matrix has a single row. Otherwise, there is one row per distinct set value.

Arguments

seqdata: Either a state sequence object of class stslist created with seqdef or a list of state sequence objects, e.g., list(cohort1.seq,cohort2.seq,cohort3.seq).
seqdata2: Either a state sequence object of class stslist or a list of state sequence objects. Must be NULL when group is not NULL. If not NULL, must be of same type than seqdata. See details.
diss: Matrix or distance object. Symmetric pairwise dissimilarities.
group: Vector of length equal to number of sequences in seqdata. A dichotomous grouping variable. See details.
set: Vector of length equal to number of sequences in seqdata. Variable defining the sets. See details.
s: Integer. Default 100. The size of random samples of sequences. When 0, no sampling is done.
seed: Integer. Default 36963. Using the same seed number guarantees the same results each time. Set s=NULL if you don't want to set a seed. The random generator can be chosen with RNGkind.
stat: String. The requested statistics. One of "LRT", "BIC", or "all"
squared: Logical. Should squared distances be used? Can also be "LRTonly", in which case the distances to the centers are computed using non-squared distances and LRT is computed with squared distances.
weighted: Logical or String. Should weights be taken into account when available? Can also be "by.group", in which case weights are used and normalized to respect group sizes.
weights: Vector of length equal to number of row of diss. Case weights. If NULL weights are set as rep(1, nrow(diss)).
opt: Integer or NULL. Either 1 or 2. Computation option. When 1, the distance matrix is computed successively for each pair of samples of size s. When 2, the distances are computed only once for each pair of sets of observed sequences and the distances for the samples are extracted from that matrix. When NULL (default), 1 is chosen when the sum of sizes of the two groups is larger than 2*s and 2 otherwise.
BFopt: Integer or NULL. Either 1 or 2. Applies only when BIC is computed on multiple samples. When 1 the displayed Bayes Factor (BF) is the averaged BF. When 2, the displayed BF is obtained from the averaged BIC. When NULL both BFs are displayed.
method: String. Method for computing sequence distances. See documentation for seqdist. Additional arguments may be required depending on the method chosen.
...: Additional arguments passed to seqdist.

Author

Tim Liao and Gilbert Ritschard

Details

The group and set arguments can only be used when seqdata is an stslist object (a state sequence object).

When seqdata and seqdata2 are both provided, the LRT and Delta BIC statistics are computed for comparing these two sets. In that case both group and set should be left at their default NULL value.

When seqdata is a list of stslist objects, seqdata2 must be a list of the same number of stslist objects.

The default option squared="LRTonly" corresponds to the initial proposition of Liao and Fasang (2021). With that option, the distances to the virtual center are obtained from the pairwise non-squared dissimilarities and the resulting distances to the virtual center are squared when computing the LRT (which is in turn used to compute \(\Delta\)BIC). With squared=FALSE, non-squared distances are used in both cases, and with squared=TRUE, squared distances are used in both cases.

The computation is based on the pairwise distances between the sequences. The opt argument permits to choose between two strategies. With opt=1, the matrix of distances is computed successively for each pair of samples of size s. When opt=2, the matrix of distances is computed once for the observed sequences and the distances for the samples are extracted from that matrix. Option 2 is often more efficient, especially for distances based on spells. It may be slower for methods such as OM or LCS when the number of observed sequences becomes large.

References

Tim F. Liao & Anette E. Fasang (2021). "Comparing Groups of Life Course Sequences Using the Bayesian Information Criterion and the Likelihood Ratio Test.” Sociological Methodology, 55 (1), 44-85. tools:::Rd_expr_doi("10.1177/0081175020959401").

Examples

Run this code

## biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
                "Child", "Left+Child", "Left+Marr+Child", "Divorced")
alph <- seqstatl(biofam[10:25])
## To illustrate, we use only a sample of 150 cases
set.seed(10)
biofam <- biofam[sample(nrow(biofam),150),]
biofam.seq <- seqdef(biofam, 10:25, alphabet=alph, labels=biofam.lab)

## Defining the grouping variable
lang <- as.vector(biofam[["plingu02"]])
lang[is.na(lang)] <- "unknown"
lang <- factor(lang)

## Chronogram by language group
seqdplot(biofam.seq, group=lang)

## Extracting the sequence subsets by language
lev <- levels(lang)
l <- length(lev)
seq.list <- list()
for (i in 1:l){
  seq.list[[i]] <- biofam.seq[lang==lev[i],]
}

seqCompare(list(seq.list[[1]]),list(seq.list[[2]]), stat="all", method="OM", sm="CONSTANT")
seqBIC(biofam.seq, group=biofam$sex, method="HAM")
seqLRT(biofam.seq, group=biofam$sex, set=lang, s=80, method="HAM")

Run the code above in your browser using DataLab