Learn R Programming

TraMineR (version 2.2-5)

seqMD: Multidomain sequences

Description

Build multidomain (MD) sequences of combined individual domain states (expanded alphabet), derive multidomain CAT indel and substitution costs from domain costs by means of an additive trick, and compute OM pairwise distances using CAT costs.

Usage

seqMD(channels, method=NULL, norm="none", indel="auto", sm=NULL,
     with.missing=FALSE, full.matrix=TRUE, link="sum", cval=2,
     miss.cost=2, cweight=NULL, what="MDseq", ch.sep="+")

seqdistmc(channels, what="diss", ch.sep="@@@@TraMineRSep@@@@", ...)

Value

When what="MDseq", the MD sequences of combined states as a stslist sequence object.

When what="cost", the matrix of AT-substitution costs with three attributes: indel the AT-indel cost(s), alphabet the alphabet of the combined state sequences, and cweight the channel weights used.

When what="diss", a matrix of pairwise distances between MD sequences.

Arguments

channels

A list of domain state sequence objects defined with the seqdef function, each state sequence object corresponding to a domain.

method

a character string indicating a dissimilarity measure between sequences. One of "OM" (Optimal Matching), "LCS" (Longest Common Subsequence), "HAM" (Hamming distance), "DHD" (Dynamic Hamming distance).

norm

String. Default: "none". The normalization method to use. See seqdist.

indel

Double, vector of doubles, or list with an insertion/deletion cost or a vector of state dependent indel costs for each domain. Can also be "auto" (default), in which case the indel cost of each domain is automatically set in accordance with the sm value of the domain. See indel argument of seqdist.

sm

A list with a substitution-cost matrix for each domain or a list of method names for generating the domain substitution costs (see seqcost). Ignored when method="LCS".

with.missing

Logical or vector of logical. Must be TRUE for channels with non deleted gaps (missing values). See details.

full.matrix

Logical. If TRUE (default), the full distance matrix between MD sequences is returned. If FALSE, an object of class dist is returned.

link

Character string. One of "sum" or "mean". Method to compute the "link" between domains. Default is to sum substitution and indel costs.

cval

Double. Domain substitution cost for "CONSTANT" matrix, see seqcost.

miss.cost

Double. Cost to substitute missing values at domain level, see seqcost.

cweight

A vector of domain weights. Default is 1 (same weight for each domain).

what

Character string. What output should be returned? One of "MDseq", "cost", "diss". The deprecated value what="sm" is treated as what="cost".

ch.sep

Character string. Separator used for building state names of the expanded alphabet.

...

arguments passed to seqMD

Author

Gilbert Ritschard and Matthias Studer

Details

The seqMD function builds MD sequences by combining the domain states. When what="cost", it derives multidomain indel and substitution costs from the indel and substitution costs of each domain by means of the additive trick (AT) proposed by Pollock, 2007). When what="cost", it computes multidomain distances using the AT-multidomain costs. The available metrics (see method argument) are optimal matching ("OM"), longest common subsequence ("LCS"), Hamming distance ("HAM"), and Dynamic Hamming Distance ("DHD"). The selected metric is used to compute pairwise domain dissimilarities. It is also used to compute MD distances except when "LCS", in which case MD distances are obtained with OM. For other edit distances, extract the combined state sequence object (by setting what="MDseq") and the AT-multidomain substitution and indel costs (by setting what="cost"). Then use these outcomes as input in a call to seqdist. See seqdist for more information about available distance measures.

Normalization may be useful when dealing with sequences that are not all of the same length. For details on the applied normalization, see seqdist.

References

Pollock, Gary (2007) Holistic trajectories: a study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A 170, Part 1, 167--183.

See Also

seqcost, seqdef, seqdist.

Examples

Run this code
data(biofam)

## Building one channel per type of event left home, married, and child
cases <- 200
bf <- as.matrix(biofam[1:cases, 10:25])
left <- bf==1 | bf==3 | bf==5 | bf==6
married <- bf == 2 | bf== 3 | bf==6
children <-  bf==4 | bf==5 | bf==6

## Building sequence objects
left.seq <- seqdef(left)
marr.seq <- seqdef(married)
child.seq <- seqdef(children)
channels <- list(LeftHome=left.seq, Marr=marr.seq, Child=child.seq)

## AT-multidomain distances based on channel specific cost methods
MDdist <- seqMD(channels, method="OM",
    sm =list("INDELSLOG", "INDELSLOG", "TRATE"), what="diss")

## Providing channel specific substitution costs
smatrix <- list()
smatrix[[1]] <- seqsubm(left.seq, method="TRATE")
smatrix[[2]] <- seqsubm(marr.seq, method="CONSTANT")
smatrix[[3]] <- seqsubm(child.seq, method="CONSTANT")

## Retrieving the MD sequences
MDseq <- seqMD(channels)
alphabet(MDseq)

## Retrieving the AT-multidomain substitution costs
## Using a weight of 2 for domain "child"
MDcost <- seqMD(channels,
    sm=smatrix, cweight=c(1,1,2), what="cost")

## OMspell distances between MD sequences
MDdist2 <- seqdist(MDseq, method="OMspell",
    sm = MDcost, indel=attr(MDcost,"indel"))

Run the code above in your browser using DataLab