sdists.trace: Edit Transcripts and Sequence Alignments

Description

This function computes and returns the set of all optimal but equivalent edit transcripts that transforms one sequences into another at minimum cost, as well as the corresponding aligned sequences, or, alternatively a combined edit graph.

Usage

sdists.trace(x, y, method = "ow", weight = c(1, 0, 2),
	     exclude = c(NA,NaN,Inf,-Inf), graph = FALSE)

Arguments

x,y

a numeric or string vector.

method

a mnemonic string referencing a distance measure.

weight

vector or matrix of parameter values.

exclude

argument to factor.

graph

option to compute the combined edit graph.

Value

A list with components each a list of two factors, the aligned sequences. The names of the components are the edit transcripts, and the attribute value contains the minimum cost, i.e. the distance (or negative similarity). If graph = TRUE a vector of edit transcripts is returned with attributes value, table, pointer, and graph. The second contains the values of the dynamic programming table and the third a list of vectors x0, y0, x1, y1 representing the (back)pointers. Similarly, the fourth attribute is a list of vectors x0, y0, x1, y1, weight representing the edge set of all optimal paths. That is, each tuple contains the from and to coordinates as used by segments, each representing a pair of indexes into the first and second sequence, and the number of times an edge occurs on a path. Note that the origin of the coordinate system (0,0) corresponds to the element of table indexed by ("",""), where "" indicates the space symbol. Thus, if used as subscripts the coordinates have to be offset by one.

Warning

The interface is experimental and may change in the future

Details

Function sdists.trace complements the distance computation between sequences by sdists. So, please, see the details of method, weight, and exclude there. However, note the following differences: 1) you can supply only two sequences, either as vectors of numeric symbol codes, factors, or as strings, i.e. scalar vectors of type character. 2) you can supply a weight matrix with the rownames and colnames representing the symbol sets of the first and second sequence. For instance, this allows you to align a sequence with the profile of a multiple alignment. 3) if method = "ow" the space symbol "" is included in the factor levels so that you can conveniently replace NA in the aligned sequences.

A transcript uses the character codes I, D, R, and M, for insert, delete, replace, and match operations, which transform the first into the second sequence. Thus, conceptually a symbol has to be inserted into the first, deleted from the second, replaced in the first sequence, or matched in both, to obtain the second sequence. However, in the aligned sequences you will see NA, where an insert or delete would take place, indicating space.

In the case of a local alignment different symbols are used for the prefix and/or suffix of the alignment: i, d, and ? for insert, delete, and replace or match operations. However, note that their sole purpose is to obtain a common representation of the two sequences. Finally, only alignments of maximal length are reported.

The time complexity of finding a transcript is $O(n+m)$ for two sequences of length n and m, respectively $O(n*m)$ for the local alignment problem. However, note that the runtime for generating all transcripts can be $O((n*m)^3)$ in the worst case.

References

D. Gusfield (1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Chapter 11.

Examples

Run this code

### from the book
x1 <- "vintner"
y1 <- "writers"
b1 <- sdists.trace(x1, y1, weight=c(1,0,1))
b1
## longest common subsequence ?
sdists.trace("a","b", weight=c(0,-1,0))
## from the book
w2 <- matrix(-2,ncol=13,nrow=13)
w2[1,] <- w2[,1] <- -1
diag(w2) <- c(0,rep(2,12))
x2 <- "pqraxabcstvq"
y2 <- "xyaxbacsll"
colnames(w2) <- c("",unique(strsplit(paste(x2, y2, sep = ""),"")[[1]]))
b2 <- sdists.trace(x2, y2, method="awl", weight=w2)
b2
## alignment with different symbol sets
x3 <- "121314"
y3 <- "ABACAD"
w3 <- matrix(-1,nrow=5,ncol=5)
diag(w3) <- 0
rownames(w3) <- c("","1","2","3","4")
colnames(w3) <- c("","A","B","C","D")
b3 <- sdists.trace(x3, y3, method="aw", weight=w3)
b3
### test graphics computations
b11 <- sdists.trace(x1, y1, weight=c(1,0,1), graph = TRUE)
b11

Run the code above in your browser using DataLab

Join us for
RADAR: AI Edition

sdists.trace: Edit Transcripts and Sequence Alignments

Description

Usage

Arguments

Value

Warning

Details

References

See Also

Examples

Join us for RADAR: AI Edition

Description

Usage

Arguments

Value

Warning

Details

References

See Also

Examples

Join us for
RADAR: AI Edition