prune: Prune a probabilistic suffix tree

Description

Prune a PST, using either a gain function, a maximal depth or a list of nodes to keep or remove. Optionally, nodes are not removed from the tree but tagged as deleted, helping to visualize the pruning process.

Usage

## S3 method for class 'PSTf':
prune(object, nmin, L, gain, C, keep, drop, state, delete = TRUE)

Arguments

object

a probabilistic suffix tree, i.e., an object of class "PSTf" as returned by the pstree, prune or

nmin

integer. All strings having counts less than nmin are removed.

integer. If specified the the tree is cut at depth L., that is all nodes with depth > L are removed.

gain

character. Function for measuring information gain. See details.

numeric. Cutoff value to use with the gain function

keep

character. A vector of character strings containing the names of the nodes to keep in the tree. All nodes that are not a suffix of contexts in keep are removed from the tree.

drop

character. A vector of character strings containing the names of the nodes to remove from the tree. All nodes that are a suffix of contexts in drop are removed from the tree as weel.

state

character. All nodes corresponding to contexts which include state are pruned.

delete

Logical. If FALSE, the pruned nodes are not removed from the tree but tagged as pruned=FALSE, so that when plotting the pruned tree these nodes wil appear surrounded with red (can be set to another color) lines.

Value

A probabilistic suffix tree, i.e., an object of class PSTf.

Details

The initial tree returned by the pstree function may yield an overly complex model containing all contexts of maximal length $L$ and frequency $N(c) \geq nmin$ found in the learning sample. The pruning stage potentially reduces the number of nodes in the tree, and thus the model complexity. It compares the conditional probabilities associated to a node labelled by a subsequence $c=c_{1},c_{2}, \ldots, c_{k}$ to the conditional probabilities of its parent node labelled by the longest suffix of $c$, $suf(c)=c_{2}, \ldots, c_{k}$. The general idea is to remove a node if it does not contribute additional information with respect to its parent in predicting the next symbol, that is if $\hat{P}(\sigma | c)$ is not significantly different from $\hat{P}(\sigma | suf(c))$ for all $\sigma \in A$.

The pruning procedure starts from the terminal nodes and is applied recursively until all terminal nodes remaining in the tree represent an information gain relative to their parent.A gain function, whose outcome will determine the pruning decision, is used to compare the two probability distributions. The gain function is driven by a cut-off, and different values of this parameter will yield more or less complex trees. A method for selecting the pruning cut-off is described in the tune help page.

A first implemented gain function, which is used by the Learn-PSA algorithm, is based on the ratio between $\hat{P}(\sigma|c)$ and $hat{P}(\sigma|suf(c))$ for each $\sigma \in A$. A node represents an information gain if for any symbol $\sigma \in A$ the ratio is greater than the cut-off $C$ or lower than $1/C$, that is if $$G_{1}(c)=\sum_{\sigma \in A} 1 \left[ \frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \geq C \; \cup \; \frac{\hat{P}(\sigma |c)}{\hat{P}(\sigma | suf(c))} \leq \frac{1}{C} \right] \geq 1$$ where $C$ is a user defined cut-off value. Nodes that do not satisfy the above condition are pruned. For $C=1$ no node is removed since even a node having a next probability distribution similar to the one of its parent does not satisfy the pruning condition.

The context algorithm uses another gain function, namely $$G_{2}(c)=\sum_{\sigma \in A} \hat{P}(\sigma|c)\log \left( \frac{\hat{P}(\sigma|c)}{\hat{P}(\sigma|suf(c))} \right) N(c) > C$$ where $c$ is the context labelling the terminal node, $N(c)$ is the number of occurrences of $c$ in the data. The cutoff $C$ is specified on the scale of $\chi^{2}$-quantiles Maechler-2004 $$C=C(\alpha)=\frac{1}{2}qchisq(1-\alpha,v), v=|A|-1$$ where $qchisq(p=1-\alpha,v)$ is the quantile function of a $\chi^{2}$ distribution with $v$ degrees of freedom. The cutoff $C$ is a threshold for the difference of deviances between a tree $S^{1}$ and its subtree $S^{2}$ obtained by pruning the terminal node $c$. Typical values for $\alpha$ are $5%$ and $1%$, yielding $p=0.95$ and $p=0.99$ respectively.

References

Ron, D.; Singer, Y. & Tishby, N. The power of amnesia: Learning probabilistic automata with variable memory length Machine Learning, 1996, 25, 117-149

Bejerano, G. & Yona, G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 2001, 17, 23-43

Maechler, M. & Buehlmann, P. Variable Length Markov Chains: Methodology, Computing, and Software Journal of Computational and Graphical Statistics, 2004, 13, pp. 435-455

Examples

Run this code

data(s1)
s1.seq <- seqdef(s1)
S1 <- pstree(s1.seq, L=3, nmin=2, ymin=0.001)

## --
S1.p1 <- prune(S1, gain="G1", C=1.20, delete=FALSE)
summary(S1.p1)
plot(S1.p1)

## --
C95 <- qchisq(0.95,1)/2
S1.p2 <- prune(S1, gain="G2", C=C95, delete=FALSE)
plot(S1.p2)

Run the code above in your browser using DataLab