cprob: Empirical conditional probability distributions of order `L`

Description

Compute the empirical conditional probability distributions of order L from a set of sequences

Usage

# S4 method for stslist
cprob(object, L, cdata=NULL, context, stationary=TRUE, nmin=1, prob=TRUE, 
weighted=TRUE, with.missing=FALSE, to.list=FALSE)

Value

If stationary=TRUE a matrix with one row for each subsequence of length $L$ and minimal frequency $nmin$ appearing in object. If stationary=FALSE a list where each element corresponds to one subsequence and contains a matrix whith the probability distribution at each position $p$ where a state is preceded by the subsequence.

Arguments

object: a sequence object, that is an object of class stslist as created by TraMineR seqdef function.
L: integer. Context length.
cdata: under development
context: character. An optional subsequence (a character string where symbols are separated by '-') for which the conditional probability distribution is to be computed.
stationary: logical. If FALSE probability distributions are computed for each sequence position L+1 ... l where l is the maximum sequence length. If TRUE the probability distributions are stationary that is time homogenous.
nmin: integer. Minimal frequency of a context. See details.
prob: logical. If TRUE the probability distributions are returned. If FALSE the function returns the empirical counts on which the probability distributions are computed.
weighted: logical. If TRUE case weights attached to the sequence object are used in the computation of the probabilities.
with.missing: logical. If FALSE only contexts contining no missing status are considered.
to.list: logical. If TRUE and stationary=TRUE, a list instead of a matrix is returned. See value.

Author

Alexis Gabadinho

Details

The empirical conditional probability $\hat{P}(\sigma | c)$ of observing a symbol $\sigma \in A$ after the subsequence $c=c_{1}, \ldots, c_{k}$ of length $k=L$ is computed as $$ \hat{P}(\sigma | c) = \frac{N(c\sigma)}{\sum_{\alpha \in A} N(c\alpha)} $$ where $$ N(c)=\sum_{i=1}^{\ell} 1 \left[x_{i}, \ldots, x_{i+|c|-1}=c \right], \; x=x_{1}, \ldots, x_{\ell}, \; c=c_{1}, \ldots, c_{k} $$ is the number of occurrences of the subsequence $c$ in the sequence $x$ and $c\sigma$ is the concatenation of the subsequence $c$ and the symbol $\sigma$.

Considering a - possibly weighted - sample of $m$ sequences having weights $w^{j}, \; j=1 \ldots m$, the function $N(c)$ is replaced by $$ N(c)=\sum_{j=1}^{m} w^{j} \sum_{i=1}^{\ell} 1 \left[x_{i}^{j}, \ldots, x_{i+|c|-1}^{j}=c \right], \; c=c_{1}, \ldots, c_{k} $$ where $x^{j}=x_{1}^{j}, \ldots, x_{\ell}^{j}$ is the $j$th sequence in the sample. For more details, see Gabadinho 2016.

References

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

Examples

Run this code

## Example with the single sequence s1
data(s1)
s1 <- seqdef(s1)
cprob(s1, L=0, prob=FALSE)
cprob(s1, L=1, prob=TRUE)

## Preparing a sequence object with the SRH data set
data(SRH)
state.list <- levels(SRH$p99c01)
## sequential color palette
mycol5 <- rev(brewer.pal(5, "RdYlGn"))
SRH.seq <- seqdef(SRH, 5:15, alphabet=state.list, states=c("G1", "G2", "M", "B2", "B1"), 
	labels=state.list, weights=SRH$wp09lp1s, right=NA, cpal=mycol5)
names(SRH.seq) <- 1999:2009

## Example 1: 0th order: weighted and unweigthed counts
cprob(SRH.seq, L=0, prob=FALSE, weighted=FALSE)
cprob(SRH.seq, L=0, prob=FALSE, weighted=TRUE)

## Example 2: 2th order: weighted and unweigthed probability distrib.
cprob(SRH.seq, L=2, prob=TRUE, weighted=FALSE)
cprob(SRH.seq, L=2, prob=TRUE, weighted=TRUE)

Run the code above in your browser using DataLab