GeneOverlap: Test overlap between two gene lists using Fisher's exact test.

Description

Given two gene lists, tests the significance of their overlap in comparison with a genomic background. The null hypothesis is that the odds ratio is no larger than 1. The alternative is that the odds ratio is larger than 1.0. It returns the p-value, estimated odds ratio and intersection.

Usage

"show"(object)
"print"(x, ...)

Arguments

object

A GeneOverlap object.

...

They are not used.

Details

The problem of gene overlap testing can be described by a hypergeometric distribution where one gene list A defines the number of white balls in the urn and the other gene list B defines the number of white balls in the draw. Assume the total number of genes is n, the number of genes in A is a and the number of genes in B is b. If the intersection between A and B is t, the probability density of seeing t can be calculated as:

dhyper(t, a, n - a, b)

without loss of generality, we can assume b <= a. So the largest possible value for t is b. Therefore, the p-value of seeing intersection t is:

sum(dhyper(t:b, a, n - a, b))

The Fisher's exact test forms this problem slightly different but its calculation is also based on the hypergeometric distribution. It starts by constructing a contingency table:

matrix(c(n - union(A,B), setdiff(A,B), setdiff(B,A), intersect(A,B)), nrow=2)

It therefore tests the independence between A and B and is conceptually more straightforward. The GeneOverlap class is implemented using Fisher's exact test.

It is better to illustrate a concept using some example. Let's assume we have a genome of size 200 and two gene lists with 70 and 30 genes each. If the intersection between the two is 10, the hypergeometric way to calculate the p-value is:

sum(dhyper(10:30, 70, 130, 30))

which gives us p-value 0.6561562. If we use Fisher's exact test, we should do:

fisher.test(matrix(c(110, 20, 60, 10), nrow=2), alternative="greater")

which gives exactly the same p-value. In addition, the Fisher's test function also provides an estimated odds ratio, confidence interval, etc. The Jaccard index is a measurement of similarity between two sets. It is defined as the number of intersections over the number of unions.

References

http://en.wikipedia.org/wiki/Fisher's_exact_test

http://en.wikipedia.org/wiki/Jaccard_index

Examples

Run this code

data(GeneOverlap)
go.obj <- newGeneOverlap(hESC.ChIPSeq.list$H3K4me3, 
                         hESC.ChIPSeq.list$H3K9me3, 
                         gs.RNASeq)
go.obj <- testGeneOverlap(go.obj)
go.obj  # show.
print(go.obj)  # more details.
getContbl(go.obj)  # contingency table.