sdists: Sequence Distance Computation

Description

This function computes and returns the auto-distance matrix between the vectors of a list or between the character strings of a vector treating them as sequences of symbols, as well as the cross-distance matrix between two such lists or vectors.

Usage

sdists(x, y = NULL, method = "ow", weight = c(1, 0, 2), 
       exclude = c(NA,NaN,Inf,-Inf))

Arguments

x,y

a list (of vectors) or a vector of character.

method

a mnemonic string referencing a distance measure.

weight

vector or matrix of parameter values.

exclude

argument to factor.

Value

Auto distances are returned as an object of class dist and cross-distances as an object of class matrix.

Warning

The interface is experimental and may change in the future

Details

This function provides a common interface to different methods for computation of distances between sequences, such as the edit a.k.a. Levenshtein distance. Conversely, in the context of sequence alignment the similarity of the maximizing alignment is computed. Note that negative similarities are returned as distances. So be careful to use a proper weighting (scoring) scheme.

The following methods are currently implemented:

[object Object],[object Object],[object Object]

Missing (and non-finite) values should be avoided, i.e. either be removed or recoded (and appropriately weighted). By default they are excluded when coercing to factor and therefore mapped to NA. The result is then defined to be NA as we cannot determine a match!

The time complexity is O(n*m) for two sequences of length n and m.

Note that in the case of auto-distances the weight matrix must be (exactly) symmetric. Otherwise, for asymmetric weights y must not be NULL. For instance, x may be supplied twice (see the examples).

References

D. Gusfield (1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Chapter 11.

Examples

Run this code

### numeric data
sdists(list(c(2,2,3),c(2,4,3)))			# 2
sdists(list(c(2,2,3),c(2,4,3)),weight=c(1,0,1)) # 1

### character data
w <- matrix(-1,nrow=8,ncol=8)			# weight/score matrix for
diag(w) <- 0					# longest common subsequence
colnames(w) <- c("",letters[1:7])
x <- sapply(rbinom(3,64,0.5),function(n,x)
    paste(sample(x,n,rep=TRUE),collapse=""),
    colnames(w)[-1])
x
sdists(x,method="aw",weight=w)
sdists(x,x,method="aw",weight=w)		# check
diag(w) <- seq(0,7)
sdists(x,method="aw", weight=w)			# global alignment
sdists(x,method="awl",weight=w)			# local alignment

## empty strings
sdists("", "FOO")
sdists("", list(c("F","O","O")))
sdists("", list(""))				# space symbol
sdists("", "abc", method="aw", weight=w)
sdists("", list(""), method="aw", weight=w)

### asymmetric weights
w[] <- matrix(-sample(0:5,64,TRUE),ncol=8)
diag(w) <- seq(0,7)
sdists(x,x,method="aw", weight=w)
sdists(x,x,method="awl",weight=w)

### missing values
sdists(list(c(2,2,3),c(2,NA,3)),exclude=NULL)	# 2 (include anything)
sdists(list(c(2,2,3),c(2,NA,3)),exclude=NA)	# NA

Run the code above in your browser using DataLab