## S3 method for class 'PSTf':
prune(object, nmin, L, gain, C, keep, drop, state, delete = TRUE)
details
.state
are pruned.PSTf
.pstree
function may yield an overly complex model containing all contexts of maximal length $L$ and frequency $N(c) \geq nmin$ found in the learning sample. The pruning stage potentially reduces the number of nodes in the tree, and thus the model complexity. It compares the conditional probabilities associated to a node labelled by a subsequence $c=c_{1},c_{2}, \ldots, c_{k}$ to the conditional probabilities of its parent node labelled by the longest suffix of $c$, $suf(c)=c_{2}, \ldots, c_{k}$. The general idea is to remove a node if it does not contribute additional information with respect to its parent in predicting the next symbol, that is if $\hat{P}(\sigma | c)$ is not significantly different from $\hat{P}(\sigma | suf(c))$ for all $\sigma \in A$. The pruning procedure starts from the terminal nodes and is applied recursively until all terminal nodes remaining in the tree represent an information gain relative to their parent.A gain function, whose outcome will determine the pruning decision, is used to compare the two probability distributions. The gain function is driven by a cut-off, and different values of this parameter will yield more or less complex trees. A method for selecting the pruning cut-off is described in the tune
help page.
A first implemented gain function, which is used by the Learn-PSA algorithm, is based on the ratio between $\hat{P}(\sigma|c)$ and $hat{P}(\sigma|suf(c))$ for each $\sigma \in A$. A node represents an information gain if for any symbol $\sigma \in A$ the ratio is greater than the cut-off $C$ or lower than $1/C$, that is if $$G_{1}(c)=\sum_{\sigma \in A} 1 \left[ \frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \geq C \; \cup \; \frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \leq \frac{1}{C} \right] \geq 1$$ where $C$ is a user defined cut-off value. Nodes that do not satisfy the above condition are pruned. For $C=1$ no node is removed since even a node having a next probability distribution similar to the one of its parent does not satisfy the pruning condition.
The context algorithm uses another gain function, namely $$G_{2}(c)=\sum_{\sigma \in A} \hat{P}(\sigma|c)\log \left( \frac{\hat{P}(\sigma|c)}{\hat{P}(\sigma|suf(c))} \right) N(c) > C$$ where $c$ is the context labelling the terminal node, $N(c)$ is the number of occurrences of $c$ in the data. The cutoff $C$ is specified on the scale of $\chi^{2}$-quantiles Maechler-2004 $$C=C(\alpha)=\frac{1}{2}qchisq(1-\alpha,v), v=|A|-1$$ where $qchisq(p=1-\alpha,v)$ is the quantile function of a $\chi^{2}$ distribution with $v$ degrees of freedom. The cutoff $C$ is a threshold for the difference of deviances between a tree $S^{1}$ and its subtree $S^{2}$ obtained by pruning the terminal node $c$. Typical values for $\alpha$ are $5%$ and $1%$, yielding $p=0.95$ and $p=0.99$ respectively.
Bejerano, G. & Yona, G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 2001, 17, 23-43
Maechler, M. & Buehlmann, P. Variable Length Markov Chains: Methodology, Computing, and Software Journal of Computational and Graphical Statistics, 2004, 13, pp. 435-455
tune
, ppplot
data(s1)
s1.seq <- seqdef(s1)
S1 <- pstree(s1.seq, L=3, nmin=2, ymin=0.001)
## --
S1.p1 <- prune(S1, gain="G1", C=1.20, delete=FALSE)
summary(S1.p1)
plot(S1.p1)
## --
C95 <- qchisq(0.95,1)/2
S1.p2 <- prune(S1, gain="G2", C=C95, delete=FALSE)
plot(S1.p2)
Run the code above in your browser using DataLab