```
## S3 method for class 'PSTf':
predict(object, data, cdata, group, L=NULL, p1=NULL, output="prob", decomp=FALSE, base=2)
```

object

data

a sequence object, i.e., an object of class

`'stslist'`

as created by TraMineR `seqdef`

function, containing the sequences to predict.cdata

not implemented yet.

group

if

`object`

is a segmented PST, providing a vector of group membership so that each sequence probability will be predicted with the conditional probability distributions for the group it belongs to. If `object`

is a segmented PST andL

integer. Maximal context length for sequence prediction. This is the same as pruning the PST by removing all nodes of depth

p1

vector. A probability distribution for the first position in the sequence that will be used instead of the root node of the tree.

output

character. One of

`'prob'`

, `'logloss'`

, `'SIMn'`

or `'SIMo'`

. See details.decomp

logical. If

`TRUE`

the predicted probability for each state in the sequence(s) is returned instead of the whole sequence probability.base

integer. Base for the logarithm if a logarithm is used in the used prediction measure.

- Either a vector of sequence probabilities (decomp=FALSE) or a matrix (if decomp=FALSE) containing for each sequence (row) the probability of each state in columns.

`a-b-a-a-b`

given a PST `S1`

fitted to the example sequence `s1`

(see example) is
$$P^{S1}(abaab)= P^{S1}(a) \times P^{S1}(b|a) \times P^{S1}(a|ab) \times P^{S1}(a|aba) \times P^{S1}(b|abaa)$$The probability of each of the state is retrieved from the PST. To get for example `P(a|a-b-a)`

, the tree is scanned for the node labelled with the string `a-b-a`

, and if this node does not exist, it is scanned for the node labelled with the longest suffix of this string, that is `b-a`

, and so on. The node `a-b-a`

is not found in the tree (it has been removed during the pruning stage), and the longest suffix of `a-b-a`

found is `b-a`

. The probability `P(a|b-a)`

is then used instead of `P(a|a-b-a)`

.
The sequence likelihood is returned by the `predict`

function. By setting `decomp=TRUE`

the output is a matrix containing the probability of each of the symbol composing the sequence. The score $P^S(x)$ of a sequence $x$ represents the probability that the VLMC model stored by the PST $S$ generates $x$. It can be turned into a more readable prediction quality measure such as the *average log-loss*
$$logloss(S,x)=-\frac{1}{\ell} \sum_{i=1}^{\ell} \log_{2} P^{S}(x_{i}| x_{1}, \ldots, x_{i-1})=-\frac{1}{\ell} \log_{2} P^{S}(x)$$
by using `'output=logloss'`

.
The returned value is the average log-loss of each state in the sequence, which allows to compare the prediction for sequences of unequal lengths. The average log-loss can be interpreted as a residual, that is the distance between the prediction of a sequence by a PST $S$ and the perfect prediction $P(x)=1$ yielding $logloss(P^{S},x)=0$. The lower the value of $logloss(P^{S},s)$ the better the sequence is predicted.

data(s1) s1 <- seqdef(s1) S1 <- pstree(s1, L=3, nmin=2, ymin=0.001) S1 <- prune(S1, gain="G1", C=1.20, delete=FALSE) predict(S1, s1, decomp=TRUE) predict(S1, s1)