Learn R Programming

TraMineR (version 2.2-4)

seqdistmc: Multichannel distances between sequences

Description

Compute multichannel pairwise optimal matching (OM) distances between sequences by deriving the substitution costs from the costs of the single channels. Works with OM and its following variants: distance based on longest common subsequence (LCS), Hamming distance (HAM), and Dynamic Hamming distance (DHD).

Usage

seqdistmc(channels, method, norm="none", indel="auto", sm=NULL,
     with.missing=FALSE, full.matrix=TRUE, link="sum", cval=2,
     miss.cost=2, cweight=NULL, what="diss", ch.sep="@@@@TraMineRSep@@@@")

Value

When what="diss", a matrix of pairwise distances between multichannel sequences.

When what="cost", the matrix of AT-substitution costs with three attributes: indel the AT-indel cost(s), alphabet the alphabet of the combined state sequences, and cweight the channel weights used.

When what="seqmc", the combined state sequence object.

Arguments

channels

A list of state sequence objects defined with the seqdef function, each state sequence object corresponding to a "channel".

method

a character string indicating the metric to be used. One of "OM" (Optimal Matching), "LCS" (Longest Common Subsequence), "HAM" (Hamming distance), "DHD" (Dynamic Hamming distance).

norm

String. Default: "none". The normalization method to use. See seqdist.

indel

Double, vector of doubles, or list with an insertion/deletion cost or a vector of state dependent indel costs for each channel. Can also be "auto" (default), in which case the indel cost of each channel is automatically set in accordance with the sm value of the channel. See indel argument of seqdist.

sm

A list with a substitution-cost matrix for each channel or a list of method names for generating the channel substitution costs (see seqcost). Ignored when method="LCS".

with.missing

Logical or vector of logical. Must be TRUE for channels with non deleted gaps (missing values). See details.

full.matrix

Logical. If TRUE (default), the full distance matrix is returned. If FALSE, an object of class dist is returned.

link

Character string. One of "sum" or "mean". Method to compute the "link" between channels. Default is to sum the substitution costs.

cval

Double. Substitution cost for "CONSTANT" matrix, see seqcost.

miss.cost

Double. Cost to substitute missing values, see seqcost.

cweight

A vector of channel weights. Default is 1 (same weight for each channel).

what

Character string. What output should be returned? One of "diss", "cost", "seqmc". The deprecated value what="sm" is treated as what="cost".

ch.sep

Character string. Separator used for building state names of the expanded alphabet.

Author

Gilbert Ritschard and Matthias Studer

Details

The seqdistmc function first builds a state sequence by combining the channels. Then, it derives the multichannel indel and substitution costs from the indel and substitution costs of each channel by means of the additive trick (AT) proposed by Pollock, 2007). Finally, it computes the multichannel distances using the AT-multichannel costs. The available metrics (see method argument) are optimal matching ("OM"), longest common subsequence ("LCS"), Hamming distance ("HAM"), and Dynamic Hamming Distance ("DHD"). For other edit distances, extract the combined state sequence object (by setting what="seqmc") and the AT-multichannel substitution and indel costs (by setting what="cost"). Then use these outcomes as input in a call to seqdist. See seqdist for more information about available distance measures.

Normalization may be useful when dealing with sequences that are not all of the same length. For details on the applied normalization, see seqdist.

References

Pollock, Gary (2007) Holistic trajectories: a study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A 170, Part 1, 167--183.

See Also

seqcost, seqdef, seqdist.

Examples

Run this code
data(biofam)

## Building one channel per type of event left, children or married
bf <- as.matrix(biofam[, 10:25])
children <-  bf==4 | bf==5 | bf==6
married <- bf == 2 | bf== 3 | bf==6
left <- bf==1 | bf==3 | bf==5 | bf==6

## Building sequence objects
child.seq <- seqdef(children)
marr.seq <- seqdef(married)
left.seq <- seqdef(left)

## Using transition rates to compute substitution costs on each channel
mcdist <- seqdistmc(channels=list(child.seq, marr.seq, left.seq),
 	method="OM", sm =list("INDELSLOG", "INDELSLOG", "TRATE"))

## Using a weight of 2 for children channel and specifying
##   channel specific substitution costs
smatrix <- list()
smatrix[[1]] <- seqsubm(child.seq, method="CONSTANT")
smatrix[[2]] <- seqsubm(marr.seq, method="CONSTANT")
smatrix[[3]] <- seqsubm(left.seq, method="TRATE")
mcdist2 <- seqdistmc(channels=list(child.seq, marr.seq, left.seq),
	method="OM", sm =smatrix, cweight=c(2,1,1))

## Retrieving the multichannel sequences
mcseq <- seqdistmc(channels=list(child.seq, marr.seq, left.seq),
    method="OM", sm =smatrix, cweight=c(2,1,1), what="seqmc", ch.sep="+")
    alphabet(mcseq)

## Retrieving the AT-multichannel substitution costs
mcsm <- seqdistmc(channels=list(child.seq, marr.seq, left.seq),
    method="OM", sm=smatrix, cweight=c(2,1,1), what="cost", ch.sep="+")

Run the code above in your browser using DataLab