PST (version 0.94.1)

cprob: Empirical conditional probability distributions of order L

Description

Compute the empirical conditional probability distributions of order L from a set of sequences

Usage

# S4 method for stslist
cprob(object, L, cdata=NULL, context, stationary=TRUE, nmin=1, prob=TRUE, 
weighted=TRUE, with.missing=FALSE, to.list=FALSE)

Value

If stationary=TRUE a matrix with one row for each subsequence of length \(L\) and minimal frequency \(nmin\) appearing in object. If stationary=FALSE a list where each element corresponds to one subsequence and contains a matrix whith the probability distribution at each position \(p\) where a state is preceded by the subsequence.

Arguments

object

a sequence object, that is an object of class stslist as created by TraMineR seqdef function.

L

integer. Context length.

cdata

under development

context

character. An optional subsequence (a character string where symbols are separated by '-') for which the conditional probability distribution is to be computed.

stationary

logical. If FALSE probability distributions are computed for each sequence position L+1 ... l where l is the maximum sequence length. If TRUE the probability distributions are stationary that is time homogenous.

nmin

integer. Minimal frequency of a context. See details.

prob

logical. If TRUE the probability distributions are returned. If FALSE the function returns the empirical counts on which the probability distributions are computed.

weighted

logical. If TRUE case weights attached to the sequence object are used in the computation of the probabilities.

with.missing

logical. If FALSE only contexts contining no missing status are considered.

to.list

logical. If TRUE and stationary=TRUE, a list instead of a matrix is returned. See value.

Author

Alexis Gabadinho

Details

The empirical conditional probability \(\hat{P}(\sigma | c)\) of observing a symbol \(\sigma \in A\) after the subsequence \(c=c_{1}, \ldots, c_{k}\) of length \(k=L\) is computed as $$ \hat{P}(\sigma | c) = \frac{N(c\sigma)}{\sum_{\alpha \in A} N(c\alpha)} $$ where $$ N(c)=\sum_{i=1}^{\ell} 1 \left[x_{i}, \ldots, x_{i+|c|-1}=c \right], \; x=x_{1}, \ldots, x_{\ell}, \; c=c_{1}, \ldots, c_{k} $$ is the number of occurrences of the subsequence \(c\) in the sequence \(x\) and \(c\sigma\) is the concatenation of the subsequence \(c\) and the symbol \(\sigma\).

Considering a - possibly weighted - sample of \(m\) sequences having weights \(w^{j}, \; j=1 \ldots m\), the function \(N(c)\) is replaced by $$ N(c)=\sum_{j=1}^{m} w^{j} \sum_{i=1}^{\ell} 1 \left[x_{i}^{j}, \ldots, x_{i+|c|-1}^{j}=c \right], \; c=c_{1}, \ldots, c_{k} $$ where \(x^{j}=x_{1}^{j}, \ldots, x_{\ell}^{j}\) is the \(j\)th sequence in the sample. For more details, see Gabadinho 2016.

References

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

Examples

Run this code
## Example with the single sequence s1
data(s1)
s1 <- seqdef(s1)
cprob(s1, L=0, prob=FALSE)
cprob(s1, L=1, prob=TRUE)

## Preparing a sequence object with the SRH data set
data(SRH)
state.list <- levels(SRH$p99c01)
## sequential color palette
mycol5 <- rev(brewer.pal(5, "RdYlGn"))
SRH.seq <- seqdef(SRH, 5:15, alphabet=state.list, states=c("G1", "G2", "M", "B2", "B1"), 
	labels=state.list, weights=SRH$wp09lp1s, right=NA, cpal=mycol5)
names(SRH.seq) <- 1999:2009

## Example 1: 0th order: weighted and unweigthed counts
cprob(SRH.seq, L=0, prob=FALSE, weighted=FALSE)
cprob(SRH.seq, L=0, prob=FALSE, weighted=TRUE)

## Example 2: 2th order: weighted and unweigthed probability distrib.
cprob(SRH.seq, L=2, prob=TRUE, weighted=FALSE)
cprob(SRH.seq, L=2, prob=TRUE, weighted=TRUE)

Run the code above in your browser using DataLab