dissimilarity: Dissimilarities and Correlations Between Seriation Orders

Description

Calculates dissimilarities/correlations between seriation orders in a list.

Usage

seriation_dist(x, method = "spearman", align = TRUE)
seriation_align(x, method = "spearman")
seriation_cor(x, method = "spearman")

Arguments

seriation orders as a list with elements of class ser_permutation_vector.

method

a character string with the name of the used measure. Available measures are: "kendall", "spearman", "manhattan", "euclidean", "hamming", and "ppc" (positional prox

align

a logical indicating if the orders should be pairwise aligned (i.e., also check reversed order) for calculating the distances.

Value

seriation_dist returns an object of class dist. seriation_align returns a new list with elements of class ser_permutation.

Details

For seriation_dist, the correlation coefficients (Kendall's tau and Spearman's rho) are converted into a dissimilarity by taking one minus the absolute value. For these and the ranking-based distance measures (Manhattan, Euclidean and Hamming), the direction of the distance between all seriations in forward and reverse order are calculated and the pairwise minimum is used for align=TRUE. Note that Manhattan distance between the ranks in a linear order is equivalent to Spearman's footrule metric (Diaconis 1988).

The positional proximity coefficient (ppc) is a precedence invariant measure based on the squared positional distances in two permutations (see Goulermas et al 2015). The similarity measure is converted into a dissimilarity via $1-ppc$ align is ignored.

seriation_align normalizes the direction in a list of seriations such that ranking-based methods can be used. For the correlation coefficients "spearman" and "kendall" we first find the order which has the largest sum of positive correlations with all other orders. We use this order as the seed and reverse all orders that are negatively correlated. For "manhattan" and "euclidean" we add all reversed orders to the set and then use a modified version of Prim's algorithm for finding a minimum spanning tree (MST) to choose if the original seriation order or its reverse should be used. We use the orders first added to the MST. Every time an order is added, its reverse is removed from the possible orders.

References

P. Diaconis (1988): Group Representations in Probability and Statistics. Institute of Mathematical Statistics, Hayward, CA.

J.Y. Goulermas, A. Kostopoulos, and T. Mu (2015): A New Measure for Analyzing and Fusing Sequences of Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence. Forthcomming.

Examples

Run this code

set.seed(1234)
## seriate dist of 50 flowers from the iris data set
data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(1:nrow(x), 50),]
rownames(x) <- 1:50
d <- dist(x)

## create a list of different seriations
methods <- c("HC_single", "HC_complete", "OLO", "GW", "R2E", "VAT", 
  "TSP", "Spectral", "SPIN", "MDS", "Identity", "Random")

os <- sapply(methods, function(m) {
  cat("Doing ", m, "... ")
  tm <- system.time(o <- seriate(d, method = m))
  cat("took ", tm[3],"s.
")
  o
})

## compare the methods using distances (default is based on 
## Spearman's rank correlation coefficient)
ds <- seriation_dist(os)
hmap(ds, margin=c(7,7))

## compare using actual correlation (reversed orders are neg. correlated!)
cs <- seriation_cor(os)
hmap(cs, margin=c(7,7))

## normalize direction of the seriation orders. 
## Now all but random and identity are highly pos. correlated
os2 <- seriation_align(os)
cs2 <- seriation_cor(os2)
hmap(cs2, margin=c(7,7))
  
## use Manhattan distance of the ranks (i.e., Spearman's foot rule)
## first without and then with pairwise alignment
ds <- seriation_dist(os, method="manhattan", align=FALSE)
plot(hclust(ds))

ds <- seriation_dist(os, method="manhattan", align=TRUE)
plot(hclust(ds))

Run the code above in your browser using DataLab