Takes a presence/absence matrix with genes as the rows and modules as columns and calculates a matrix of log-transformed Fisher p-values.
scoreLFMatrix_C( geneSetCollection_m,
e_precision = as.numeric(c(12)),
alternative = as.integer(c(1)))# # NOTE: The following also works and may be preferable for
# # many users:
# scoreLFMatrix_C( geneSetCollection_m,
# e_precision = 12,
# alternative = 1 )
A numerical matrix containing the specified log Fisher p-values for all non-self pairs. Values on the
diagonal (which would correspond to self-self comparison p-values) are NA. The 'lower_is_closer'
attribute on the matrix is set to TRUE
, except in the case of alternative=2
where it is set
to FALSE
.
The distance
attribute in the output matrix is set to 'stlf'
for option 1 (single, upper tail),
'ltlf'
for option 2 (lower tail), 'ttlf'
for option 3 (two-tailed), and 'lf'
for option 4
(log partial Fisher p-value).
(required) A logical presence/absence matrix representation of a gene set collection
in which columns correspond to gene sets, rows correspond to genes and values are TRUE
if a gene is present
in a gene set and FALSE
otherwise. Row and column names correspond to gene symbols and gene set
identifiers, respectively. NOTE: for a typical GSNA analysis, this matrix would include only observed filtered
genes and significant gene set hits from pathways analysis. Using a matrix version of the full MSigDB without filtering
genes, for example, would likely be unworkably slow and memory intensive.
(optional, default 12) Numeric to control the precision of the log p-value calculated. Due to precision limits inherent in C++ double precision numbers, log p-values for which the corresponding untransformed p-values differ by more than a certain magnitude cannot effectively be added. This feature was introduced as a way to accelerate summation of p-values so as to allow summation to be cut off when the acceptable level of precision had been reached, but it was found that it also seems to prevent artifacts caused by arithmetic underflow.
(optional, default 1) An integer value specifying one of 4 alternative p-value calculations
where 1
specifies single, upper tail log Fisher p-value, 2
signifies single, lower-tail Fisher
p-value, 3
signifies 2-tailed Fisher p-value, and 4
signifies partial Fisher p-value (see below).
We use the Fisher test to assess the statistical significance of the overlap of two gene sets. For our purposes the test determines whether two gene sets share a greater (or in some cases less) than expected number of common members, assuming a null hypothesis of random membership. The two sets need not necessarily be of the same size, but are for the purposes of the test assumed to have set sizes.
Consider a 2x2 contingency matrix of the following form:
$$\biggl[\begin{matrix}a & b \\ c & d\end{matrix}\biggr]$$
Given a background of observable genes and two gene sets, i and j that may overlap, this contingency table is used to represent four numbers:
a: the number of genes observed in the background but not in i or j
b: the number of observed genes in i but not j
c: the number of observed genes in j but not i and
d: the number of observed genes in both j and i, i.e. the overlap.
The partial-Fisher p-value, signifying the likelihood of that particular contingency table is given by:
$$p = \dfrac{(a + b)! (c + d)! (a + c)! (b + d)!}{a! b! c! d! (a+b+c+d)!}$$
This partial p-value is what is returned in the distance matrix when the argument alternative = 4
and it is less than, though tracks closely with, the two-tailed p-value, in most cases.
The actual single- and two-tailed p-values are derived from this number by summation, keeping the sum of
each row and column of the 2x2 contingency matrix constant, as per the assumptions of the Fisher test.
For the single-tailed alternative representing the upper-tail 'greater-than' expected overlap of the two gene
sets (alternative = 1
), the terms start with d as the observed number of shared members between set
i and set j. Then d is incremented toward the maximal number possible shared genes (the lesser of the
number of genes in sets i and j). a, b, and c adjusted accordingly to keep constant row and
column sums, and the partial p-values are thus summed.
For the lower-tail ('less-than') alternative (alternative = 2
), the summation starts with d as the
number of shared members of sets between i and j, (as with the upper-tail calculation) but then decrements
that to 0.
For the 2-tailed alternative, the function sums all the terms with values equal to or less than the the partial p-value defined above.
All calculations are done on log-transformed values to avoid arithmetic underflow:
$$ ln(p) = ln(( a + b )!) + ln(( c + d )!) + ln(( a + c )!) + ln(( b + d )!) - ln(a!) - ln(b!) - ln(c!) - ln(d!) - ln(( a + b + c + d )!) $$
Since log-transformed p-values cannot be directly added, the so-called log-sum-exponential trick is used to combine them.
Fisher p-values have long been used to assess the statistical significance of over- or underrepresentation of a component of a mixture to assess whether a sample is drawn from a particular mixture. The test has also long been used in pathways analysis as a way to assess whether an experimentally derived list of genes contains a statistical overrepresentation of genes from predefined gene sets or modules. Such experimental gene lists may include differentially expressed genes from a transcriptomic experiment, genes possessing promoters with differential chromatin accessibility from an ATAC-Seq experiments, genes that were positive in screens of mutants, genes that were identified from GWAS experiments, and genes from other analyses. Likewise, the gene sets or modules are generally drawn from databases of experimentally characterized pathways, sets of genes over- or under-expressed in particular conditions, or associated with particular biological processes, chromosome regions, etc.
In the case of GSNA, we use the Fisher test to assess the overlap of genes not between an experimentally derived gene list and predefined gene sets from a database, but between the predefined gene sets themselves given their observability in a particular experiment.
buildGeneSetNetworkLFFast
scoreJaccardMatrix_C
library( GSNA )
# Get the background of observable genes set from
# expression data:
gene_background <- toupper(rownames( Bai_empty_expr_mat ))
# Using the sample gene set collection **Bai_gsc.tmod**,
# generate a gene presence-absence matrix filtered for the
# ref.background of observable genes:
presence_absence.mat <-
makeFilteredGenePresenceAbsenceMatrix( ref.background = gene_background,
geneSetCollection = Bai_gsc.tmod )
lf.mat <- scoreLFMatrix_C( presence_absence.mat, 1 )
Run the code above in your browser using DataLab