RankAggreg: Weighted Rank Aggregation of partial ordered lists

Description

Performs aggregation of ordered lists based on the ranks (optionally with additional weights) via the Cross-Entropy Monte Carlo algorithm or the Genetic Algorithm.

Usage

RankAggreg(x, k, weights=NULL, method=c("CE", "GA"),
distance=c("Spearman", "Kendall"), seed=NULL, maxIter = 1000,
convIn=ifelse(method=="CE", 7, 30), importance=rep(1,nrow(x)),
rho=.1, weight=.25, N=10*k^2, v1=NULL,
popSize=100, CP=.4, MP=.01, verbose=TRUE, standardizeWeights = TRUE, ...)

Arguments

a matrix of ordered lists to be combined (lists must be in rows)

size of the top-k list

weights

a matrix of scores (weights) to be used in the aggregation process. Weights in each row must be ordered either in decreasing or increasing order and must correspond to the elements in x

method

method to be used to perform rank aggregation: Cross Entropy Monte Carlo (CE) or Genetic Algorithm (GA)

distance

distance to be used which "measures" the similarity of ordered lists

seed

a random seed specified for reproducability; default: NULL

maxIter

the maximum number of iterations allowed; default: 1000

convIn

stopping criteria for both CE and GA algorithms. If the best solution does not change in convIn iterations, the algorithm converged; default: 7 for CE, 30 for GA

importance

vector of weights indicating the importance of each list in x; default: a vector of 1's ( equal weights are given to all lists

rho

(rho*N) is the "quantile" of candidate lists sorted by the function values. Used only by the Cross-Entropy algorithm

weight

a learning factor used in the probability update procedure of the CE algorithm

a number of samples to be generated by the MCMC; default: 10nk, where n is the number of unique elements in x. Used only by the Cross-Entropy algorithm

optional, can be used to specify the initial probability matrix; if v1=NULL, the initial probability matrix is set to 1/n, where n is the number of unique elements in x

popSize

population size in each generation of Genetic Algorithm; default: 100

Cross-over probability for the GA; the default value is .4

Mutation probability for the GA. This value should be small and the number of mutations in the population of size popSize and the number of features k is computed as popSize*k*MP.

verbose

boolean, if console output is to be displayed at each iteration

standardizeWeights

boolean, default is true which standardizes weights to [0,1]

…

additional arguments can be passed to the internal procedures:

-- p - penalty for the Kendall's tau distance; default: 0

Value

top.list

Top-k aggregated list

optimal.value

the minimum value of the objective function corresponding to the top-k list

sample.size

the number of samples generated by the MCMC at each iteration

num.iter

the number of iterations until convergence

method

which algorithm was used

distance

which distance was used

importance

an importance vector used

lists

the original ordered lists

weights

scaled weights if specified

sample

objective function scores from the last iteration

summary

matrix containing minimum and median objective function scores for each iteration

Details

The function performs rank aggregation via the Cross-Entropy Monte Carlo algorithm or the Genetic Algorithm. Both approaches can and should be used when k is relatively large (k > 10). If k is small, one can enumerate all possible candidate lists and find the minimum directly using the BruteAggreg function available in this package.

The Cross-Entropy Monte Carlo algorithm is an iterative procedure for solving difficult combinatorial problems in which it is computationally not feasable to find the solution directly. In the context of rank aggregation, the algorithm searches for the "super"-list which is as close as possible to the ordered lists in x. We use either the Spearman footrule distance or the Kendall's tau to measure the "closeness" of any two ordered lists (or modified by us the weighted versions of these distances). Please refer to the paper in the references for further details.

The Genetic Algorithm requires setting CP and MP parameters which effect the rate of "evolution" in the population. If both CP and MP are small, the algorithms is very conservative and may take a long time to search the solution space of all ordered candidate lists. On the other hand, setting CP and MP (especially MP) large will introduce a large number of mutations in the population which can result in a local optima.

The convergence criteria used by both algorithms is the repetition of the same minimum value of the objective function in convIn consecutive iterations.

References

Pihur, V., Datta, S., and Datta, S. (2007) "Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach" Bioinformatics, 23(13):1607-1615

Examples

Run this code

# NOT RUN {
# rank aggregation without weights
x <- matrix(c("A", "B", "C", "D", "E",
        "B", "D", "A", "E", "C",
        "B", "A", "E", "C", "D",
        "A", "D", "B", "C", "E"), byrow=TRUE, ncol=5)

(CESnoweights <- RankAggreg(x, 5, method="CE", distance="Spearman", N=100, convIn=5, rho=.1))

# weighted rank aggregation
set.seed(100)
w <- matrix(rnorm(20), ncol=5)
w <- t(apply(w, 1, sort))

# using the Cross-Entropy Monte-Carlo algorithm
(CES <- RankAggreg(x, 5, w, "CE", "Spearman", rho=.1, N=100, convIn=5))
plot(CES)
(CEK <- RankAggreg(x, 5, w, "CE", "Kendall", rho=.1, N=100, convIn=5))

# using the Genetic algorithm
(GAS <- RankAggreg(x, 5, w, "GA", "Spearman"))
plot(GAS)
(GAK <- RankAggreg(x, 5, w, "GA", "Kendall"))

# more complex example (to get a better solution, increase maxIter)
data(geneLists)
topGenes <- RankAggreg(geneLists, 25, method="GA", maxIter=100)
plot(topGenes)
# }

Run the code above in your browser using DataLab