## S3 method for class 'PSTf':
predict(object, data, cdata, group, L=NULL, p1=NULL, output="prob", decomp=FALSE, base=2)
'stslist'
as created by TraMineR seqdef
function, containing the sequences to predict.object
is a segmented PST, providing a vector of group membership so that each sequence probability will be predicted with the conditional probability distributions for the group it belongs to. If object
is a segmented PST and'prob'
, 'logloss'
, 'SIMn'
or 'SIMo'
. See details.TRUE
the predicted probability for each state in the sequence(s) is returned instead of the whole sequence probability.a-b-a-a-b
given a PST S1
fitted to the example sequence s1
(see example) is
$$P^{S1}(abaab)= P^{S1}(a) \times P^{S1}(b|a) \times P^{S1}(a|ab) \times P^{S1}(a|aba) \times P^{S1}(b|abaa)$$The probability of each of the state is retrieved from the PST. To get for example P(a|a-b-a)
, the tree is scanned for the node labelled with the string a-b-a
, and if this node does not exist, it is scanned for the node labelled with the longest suffix of this string, that is b-a
, and so on. The node a-b-a
is not found in the tree (it has been removed during the pruning stage), and the longest suffix of a-b-a
found is b-a
. The probability P(a|b-a)
is then used instead of P(a|a-b-a)
.
The sequence likelihood is returned by the predict
function. By setting decomp=TRUE
the output is a matrix containing the probability of each of the symbol composing the sequence. The score $P^S(x)$ of a sequence $x$ represents the probability that the VLMC model stored by the PST $S$ generates $x$. It can be turned into a more readable prediction quality measure such as the average log-loss
$$logloss(S,x)=-\frac{1}{\ell} \sum_{i=1}^{\ell} \log_{2} P^{S}(x_{i}| x_{1}, \ldots, x_{i-1})=-\frac{1}{\ell} \log_{2} P^{S}(x)$$
by using 'output=logloss'
.
The returned value is the average log-loss of each state in the sequence, which allows to compare the prediction for sequences of unequal lengths. The average log-loss can be interpreted as a residual, that is the distance between the prediction of a sequence by a PST $S$ and the perfect prediction $P(x)=1$ yielding $logloss(P^{S},x)=0$. The lower the value of $logloss(P^{S},s)$ the better the sequence is predicted.
data(s1)
s1 <- seqdef(s1)
S1 <- pstree(s1, L=3, nmin=2, ymin=0.001)
S1 <- prune(S1, gain="G1", C=1.20, delete=FALSE)
predict(S1, s1, decomp=TRUE)
predict(S1, s1)
Run the code above in your browser using DataLab