The problem of gene overlap testing can be described by a hypergeometric
distribution where one gene list A defines the number of white balls in the
urn and the other gene list B defines the number of white balls in the
draw. Assume the total number of genes is n, the number of genes in A
is a and the number of genes in B is b. If the intersection
between A and B is t, the probability density of seeing t can
be calculated as: dhyper(t, a, n - a, b)
without loss of generality, we can assume b <= a. So the
largest possible value for t is b. Therefore, the p-value of
seeing intersection t is:=>
sum(dhyper(t:b, a, n - a, b))
The Fisher's exact test forms this problem slightly different but its
calculation is also based on the hypergeometric distribution. It starts by
constructing a contingency table:
matrix(c(n - union(A,B), setdiff(A,B),
setdiff(B,A), intersect(A,B)),
nrow=2)
It therefore tests the independence between A and B and is conceptually
more straightforward. The GeneOverlap class is implemented using Fisher's
exact test.
It is better to illustrate a concept using some example. Let's assume we
have a genome of size 200 and two gene lists with 70 and 30 genes each. If
the intersection between the two is 10, the hypergeometric way to calculate
the p-value is:
sum(dhyper(10:30, 70, 130, 30))
which gives us p-value 0.6561562. If we use Fisher's exact test, we should
do:
fisher.test(matrix(c(110, 20, 60, 10), nrow=2),
alternative="greater")
which gives exactly the same p-value. In addition, the Fisher's test
function also provides an estimated odds ratio, confidence interval, etc.
The Jaccard index is a measurement of similarity between two sets. It is
defined as the number of intersections over the number of unions.