scoreLFMatrix_C: scoreLFMatrix_C

Description

Takes a presence/absence matrix with genes as the rows and modules as columns and calculates a matrix of log-transformed Fisher p-values.

Usage

scoreLFMatrix_C( geneSetCollection_m,
                  e_precision = as.numeric(c(12)),
                  alternative = as.integer(c(1)))
# # NOTE: The following also works and may be preferable for
# # many users:
# scoreLFMatrix_C( geneSetCollection_m,
#                  e_precision = 12,
#                  alternative = 1 )

Value

A numerical matrix containing the specified log Fisher p-values for all non-self pairs. Values on the diagonal (which would correspond to self-self comparison p-values) are NA. The 'lower_is_closer'

attribute on the matrix is set to TRUE, except in the case of alternative=2 where it is set to FALSE.

The distance attribute in the output matrix is set to 'stlf' for option 1 (single, upper tail), 'ltlf' for option 2 (lower tail), 'ttlf' for option 3 (two-tailed), and 'lf' for option 4 (log partial Fisher p-value).

Arguments

geneSetCollection_m: (required) A logical presence/absence matrix representation of a gene set collection in which columns correspond to gene sets, rows correspond to genes and values are TRUE if a gene is present in a gene set and FALSE otherwise. Row and column names correspond to gene symbols and gene set identifiers, respectively. NOTE: for a typical GSNA analysis, this matrix would include only observed filtered genes and significant gene set hits from pathways analysis. Using a matrix version of the full MSigDB without filtering genes, for example, would likely be unworkably slow and memory intensive.
e_precision: (optional, default 12) Numeric to control the precision of the log p-value calculated. Due to precision limits inherent in C++ double precision numbers, log p-values for which the corresponding untransformed p-values differ by more than a certain magnitude cannot effectively be added. This feature was introduced as a way to accelerate summation of p-values so as to allow summation to be cut off when the acceptable level of precision had been reached, but it was found that it also seems to prevent artifacts caused by arithmetic underflow.
alternative: (optional, default 1) An integer value specifying one of 4 alternative p-value calculations where 1 specifies single, upper tail log Fisher p-value, 2 signifies single, lower-tail Fisher p-value, 3 signifies 2-tailed Fisher p-value, and 4 signifies partial Fisher p-value (see below).

Implementation

We use the Fisher test to assess the statistical significance of the overlap of two gene sets. For our purposes the test determines whether two gene sets share a greater (or in some cases less) than expected number of common members, assuming a null hypothesis of random membership. The two sets need not necessarily be of the same size, but are for the purposes of the test assumed to have set sizes.

Consider a 2x2 contingency matrix of the following form:

$$\biggl[\begin{matrix}a & b \\ c & d\end{matrix}\biggr]$$

Given a background of observable genes and two gene sets, i and j that may overlap, this contingency table is used to represent four numbers:

a: the number of genes observed in the background but not in i or j
b: the number of observed genes in i but not j
c: the number of observed genes in j but not i and
d: the number of observed genes in both j and i, i.e. the overlap.

The partial-Fisher p-value, signifying the likelihood of that particular contingency table is given by:

$$p = \dfrac{(a + b)! (c + d)! (a + c)! (b + d)!}{a! b! c! d! (a+b+c+d)!}$$

This partial p-value is what is returned in the distance matrix when the argument alternative = 4 and it is less than, though tracks closely with, the two-tailed p-value, in most cases.

The actual single- and two-tailed p-values are derived from this number by summation, keeping the sum of each row and column of the 2x2 contingency matrix constant, as per the assumptions of the Fisher test. For the single-tailed alternative representing the upper-tail 'greater-than' expected overlap of the two gene sets (alternative = 1), the terms start with d as the observed number of shared members between set i and set j. Then d is incremented toward the maximal number possible shared genes (the lesser of the number of genes in sets i and j). a, b, and c adjusted accordingly to keep constant row and column sums, and the partial p-values are thus summed.

For the lower-tail ('less-than') alternative (alternative = 2), the summation starts with d as the number of shared members of sets between i and j, (as with the upper-tail calculation) but then decrements that to 0.

For the 2-tailed alternative, the function sums all the terms with values equal to or less than the the partial p-value defined above.

All calculations are done on log-transformed values to avoid arithmetic underflow:

$$ ln(p) = ln(( a + b )!) + ln(( c + d )!) + ln(( a + c )!) + ln(( b + d )!) - ln(a!) - ln(b!) - ln(c!) - ln(d!) - ln(( a + b + c + d )!) $$

Since log-transformed p-values cannot be directly added, the so-called log-sum-exponential trick is used to combine them.

Details

Fisher p-values have long been used to assess the statistical significance of over- or underrepresentation of a component of a mixture to assess whether a sample is drawn from a particular mixture. The test has also long been used in pathways analysis as a way to assess whether an experimentally derived list of genes contains a statistical overrepresentation of genes from predefined gene sets or modules. Such experimental gene lists may include differentially expressed genes from a transcriptomic experiment, genes possessing promoters with differential chromatin accessibility from an ATAC-Seq experiments, genes that were positive in screens of mutants, genes that were identified from GWAS experiments, and genes from other analyses. Likewise, the gene sets or modules are generally drawn from databases of experimentally characterized pathways, sets of genes over- or under-expressed in particular conditions, or associated with particular biological processes, chromosome regions, etc.

In the case of GSNA, we use the Fisher test to assess the overlap of genes not between an experimentally derived gene list and predefined gene sets from a database, but between the predefined gene sets themselves given their observability in a particular experiment.

Examples

Run this code


library( GSNA )

# Get the background of observable genes set from
# expression data:
gene_background <- toupper(rownames( Bai_empty_expr_mat ))

# Using the sample gene set collection **Bai_gsc.tmod**,
# generate a gene presence-absence matrix filtered for the
# ref.background of observable genes:
presence_absence.mat <-
  makeFilteredGenePresenceAbsenceMatrix( ref.background = gene_background,
                                         geneSetCollection = Bai_gsc.tmod )

lf.mat <- scoreLFMatrix_C( presence_absence.mat,  1 )

Run the code above in your browser using DataLab