wmwTest: Wilcoxon-Mann-Whitney rank sum test for high-throughput expression profiling data

Description

We have implemented an highly efficient Wilcoxon-Mann-Whitney rank sum test for high-throughput expression profiling data. For datasets with more than 100 features (genes), the function performs almost identical to its R implementations (wilcox.test in stats, or rankSumTestWithCorrelation in limma) can be more than 1000 times faster.

Usage

wmwTest(x, ind.list, alternative = c("greater", "less", "two.sided",
"U","abs.log10.greater","log10.less","abs.log10.two.sided","Q"), simplify = TRUE)

Arguments

A numeric matrix. All other data types (e.g. numeric vectors or ExpressionSet objects) are coerced into matrix.

ind.list

A list of integer indices (starting from 1) indicating signature genes. Can be of length zero. Other data types (e.g. a list of numeric or logical vectors, or a numeric or logical vector) are coerced into such a list. See details below for a special case using GMT files.

alternative

The value type to be returned, allowed values include greater and less (one-sided tests), two.sided, and U statistic, and their log10 transformation variants. See details below.

simplify

Logical. If not, the returning value is in matrix format; if set to TRUE, the results are simplified into vectors when possible (default).

Value

A numeric matrix or vector containing the statistic.

Details

The basic application of the function is to test the enrichment of gene sets in expression profiling data or differentially expressed data (the matrix with feature/gene in rows and samples in columns).

A special case is when x is an eSet object (e.g. ExpressionSet), and ind.list is a list returned from readGmt function. In this case, the only requirement is that one column named GeneSymbol in the featureData contain gene symbols used in the GMT file. See the example below.

Besides the conventional alternatives such as ‘greater’, ‘less’, ‘two.sided’, and ‘U’, wmwTest (from version 0.99-1) provides further alternatives: abs.log10.greater and log10.less perform log10 transformation on respective p-values and give the transformed value a proper sign (positive for greater than, and negative for less than); abs.log10.two.sided transforms two-sided p-values to non-negative values; and Q score reports absolute log10-transformation of p-value of the two-side variant, and gives a proper sign to it, depending on whether it is rather greater than (positive) or less than (negative).

References

Barry, W.T., Nobel, A.B., and Wright, F.A. (2008). A statistical framework for testing functional categories in microarray data. _Annals of Applied Statistics_ 2, 286-315.

Wu, D, and Smyth, GK (2012). Camera: a competitive gene set test accounting for inter-gene correlation. _Nucleic Acids Research_ 40(17):e133

Zar, JH (1999). _Biostatistical Analysis 4th Edition_. Prentice-Hall International, Upper Saddle River, New Jersey.

Examples

Run this code

## R-native data structures
set.seed(1887)
rd <- rnorm(1000)
rl <- sample(c(TRUE, FALSE), 1000, replace=TRUE)
wmwTest(rd, rl, alternative="two.sided")
wmwTest(rd, which(rl), alternative="two.sided")
rd1 <- rd + ifelse(rl, 0.5, 0)
wmwTest(rd1, rl, alternative="greater")
wmwTest(rd1, rl, alternative="U")
rd2 <- rd - ifelse(rl, 0.2, 0)
wmwTest(rd2, rl, alternative="greater")
wmwTest(rd2, rl, alternative="two.sided")
wmwTest(rd2, rl, alternative="less")

## matrix forms
rmat <- matrix(c(rd, rd1, rd2), ncol=3, byrow=FALSE)
wmwTest(rmat, rl, alternative="two.sided")
wmwTest(rmat, rl, alternative="greater")

wmwTest(rmat, which(rl), alternative="two.sided")
wmwTest(rmat, which(rl), alternative="greater")

## other alternatives
wmwTest(rmat, which(rl), alternative="U")
wmwTest(rmat, which(rl), alternative="abs.log10.greater")
wmwTest(rmat, which(rl), alternative="log10.less")
wmwTest(rmat, which(rl), alternative="abs.log10.two.sided")
wmwTest(rmat, which(rl), alternative="Q")

## using ExpressionSet
data(sample.ExpressionSet)
testSet <- sample.ExpressionSet
fData(testSet)$GeneSymbol <- paste("GENE_",1:nrow(testSet), sep="")
mySig1 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.25, 0.75), replace=TRUE)
wmwTest(testSet, which(mySig1), alternative="greater")

## using integer
exprs(testSet)[,1L] <- exprs(testSet)[,1L] + ifelse(mySig1, 50, 0)
wmwTest(testSet, which(mySig1), alternative="greater")

## using lists
mySig2 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.6, 0.4), replace=TRUE)
wmwTest(testSet, list(first=mySig1, second=mySig2))

## using GMT file
gmt_file <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC")
gmt_list <- readGmt(gmt_file)

gss <- sample(unlist(sapply(gmt_list, function(x) x$genes)), 1000)
eset<-new("ExpressionSet",
          exprs=matrix(rnorm(10000), nrow=1000L),
          phenoData=new("AnnotatedDataFrame", data.frame(Sample=LETTERS[1:10])),
          featureData=new("AnnotatedDataFrame",data.frame(GeneSymbol=gss)))
esetWmwRes <- wmwTest(eset ,gmt_list, alternative="greater")
summary(esetWmwRes)

Run the code above in your browser using DataLab