```
## S3 method for class 'PSTf':
prune(object, nmin, L, gain, C, keep, drop, state, delete = TRUE)
```

object

nmin

integer. All strings having counts less than nmin are removed.

L

integer. If specified the the tree is cut at depth L., that is all nodes with depth > L are removed.

gain

character. Function for measuring information gain. See

`details`

.C

numeric. Cutoff value to use with the gain function

keep

character. A vector of character strings containing the names of the nodes to keep in the tree. All nodes that are not a suffix of contexts in keep are removed from the tree.

drop

character. A vector of character strings containing the names of the nodes to remove from the tree. All nodes that are a suffix of contexts in drop are removed from the tree as weel.

state

character. All nodes corresponding to contexts which include

`state`

are pruned.delete

Logical. If FALSE, the pruned nodes are not removed from the tree but tagged as pruned=FALSE, so that when plotting the pruned tree these nodes wil appear surrounded with red (can be set to another color) lines.

- A probabilistic suffix tree, i.e., an object of class

.PSTf

`pstree`

function may yield an overly complex model containing all contexts of maximal length $L$ and frequency $N(c) \geq nmin$ found in the learning sample. The pruning stage potentially reduces the number of nodes in the tree, and thus the model complexity. It compares the conditional probabilities associated to a node labelled by a subsequence $c=c_{1},c_{2}, \ldots, c_{k}$ to the conditional probabilities of its parent node labelled by the longest suffix of $c$, $suf(c)=c_{2}, \ldots, c_{k}$. The general idea is to remove a node if it does not contribute additional information with respect to its parent in predicting the next symbol, that is if $\hat{P}(\sigma | c)$ is not The pruning procedure starts from the terminal nodes and is applied recursively until all terminal nodes remaining in the tree represent an information gain relative to their parent.A gain function, whose outcome will determine the pruning decision, is used to compare the two probability distributions. The gain function is driven by a cut-off, and different values of this parameter will yield more or less complex trees. A method for selecting the pruning cut-off is described in the `tune`

help page.

A first implemented gain function, which is used by the *Learn-PSA* algorithm, is based on the ratio between $\hat{P}(\sigma|c)$ and $hat{P}(\sigma|suf(c))$ for each $\sigma \in A$. A node represents an information gain if for any symbol $\sigma \in A$ the ratio is greater than the cut-off $C$ or lower than $1/C$, that is if
$$G_{1}(c)=\sum_{\sigma \in A} 1 \left[ \frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \geq C \; \cup \;
\frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \leq \frac{1}{C} \right] \geq 1$$
where $C$ is a user defined cut-off value. Nodes that do not satisfy the above condition are pruned. For $C=1$ no node is removed since even a node having a next probability distribution similar to the one of its parent does not satisfy the pruning condition.

The *context* algorithm uses another gain function, namely
$$G_{2}(c)=\sum_{\sigma \in A} \hat{P}(\sigma|c)\log \left( \frac{\hat{P}(\sigma|c)}{\hat{P}(\sigma|suf(c))} \right) N(c) > C$$
where $c$ is the context labelling the terminal node, $N(c)$ is the number of occurrences of $c$ in the data. The cutoff $C$ is specified on the scale of $\chi^{2}$-quantiles Maechler-2004
$$C=C(\alpha)=\frac{1}{2}qchisq(1-\alpha,v), v=|A|-1$$
where $qchisq(p=1-\alpha,v)$ is the quantile function of a $\chi^{2}$ distribution with $v$ degrees of freedom. The cutoff $C$ is a threshold for the difference of deviances between a tree $S^{1}$ and its subtree $S^{2}$ obtained by pruning the terminal node $c$. Typical values for $\alpha$ are $5%$ and $1%$, yielding $p=0.95$ and $p=0.99$ respectively.

Bejerano, G. & Yona, G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. *Bioinformatics*, 2001, 17, 23-43

Maechler, M. & Buehlmann, P. Variable Length Markov Chains: Methodology, Computing, and Software *Journal of Computational and Graphical Statistics*, 2004, 13, pp. 435-455

`tune`

, `ppplot`

data(s1) s1.seq <- seqdef(s1) S1 <- pstree(s1.seq, L=3, nmin=2, ymin=0.001) ## -- S1.p1 <- prune(S1, gain="G1", C=1.20, delete=FALSE) summary(S1.p1) plot(S1.p1) ## -- C95 <- qchisq(0.95,1)/2 S1.p2 <- prune(S1, gain="G2", C=C95, delete=FALSE) plot(S1.p2)