NetGSA: Network-based Gene Set Analysis

Description

Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological network.

Usage

NetGSA(A1, A2, x, y, B, lklMethod = c("REML", "ML"), directed = FALSE, eta = 0.1,
       lim4kappa = 500)

Arguments

The weighted adjacency matrix for condition 1.

The weighted adjacency matrix for condition 2.

The $p \times n$ data matrix.

Vector of class indicators of length $n$.

The npath by $p$ indicator matrix for pathways.

lklMethod

Method used for likelihood calculation: options are ML (maximum likelihood) or REML (restricted maximum likelihood).

directed

Whether the networks are directed.

eta

Approximation limit for the Influence matrix. See 'Details'.

lim4kappa

Limit for condition number (used to adjust eta). See 'Details'.

Value

A list with components
betaVector of fixed effects of length $2p$, of which the first half is for condition 1 and the second half for conditin 2.
dfDegrees of freedom for the test statistics.
p.valueP-values for gene sets (pathways).
s2.epsilonVariance of the random errors $\epsilon$.
s2.gammaVariance of the random effects $\gamma$.

Details

The function NetGSA carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks.

The NetGSA method is formulated in terms of a mixed linear model. Let $X$ represent the rearrangement of data x into an $np \times 1$ column vector. $$X=\Psi \beta + \Pi \gamma + \epsilon$$ where $\beta$ is the vector of fixed effects, $\gamma$ and $\epsilon$ are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices A1 and A2, which determine the influence matrix under each condition. The influence matrices further determine the design matrices $\Psi$ and $\Pi$ in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A1 or A2). A small value of eta is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa.)

The problem is then to test the null hypothesis $\ell\beta = 0$ vs. the alterernative $\ell\beta \neq 0$, where $\ell$ is a contrast vector, optimally defined through the underlying networks. The test statistic $T$ for each gene set is then a function of $\beta$, variances of $\gamma$ and $\epsilon$, the constrast vector $\ell$ and the underlying biological network(s) in both conditions. Under the null hypothesis, $T$ has approximately a $t$-distribution, whose degrees of freedom are estimated using the Satterthwaite approximation method. The fixed effects $\beta$ are estimated by generalized least squares, and the estimate depends on estimates of the variance components of $\gamma$ and $\epsilon$. The variance components ($\sigma^2_{\epsilon}$ and $\sigma^2_{\gamma}$) are estimated using Newton's method based on the profiling out $\sigma_{\epsilon}$.

This function can deal with both directed and undirected networks, which are specified via the option directed. Note NetGSA uses slightly different procedures to calculate the influence matrices for directed and undirected networks. In the case of undirected networks, the user can still apply NetGSA if only partial information on the adjacency matrices is available. The function covsel provides one way to estimate the weighted adjacency matrices from data based on available network information.

References

Ma, J., Shojaie, A. & Michailidis, G. (2014). Network-based pathway enrichment analysis with incomplete network information, submitted. http://arxiv.org/abs/1411.7919

Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22.

Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426.

Examples

Run this code

set.seed(1)
library(igraph)
data(netgsaex)

A1 = netgsaex$A1
A2 = netgsaex$A2
B = netgsaex$B
x = netgsaex$x
y = netgsaex$y

##Visualize the networks
par(mar = c(0.5, 0.5, 3, 0.5))
plot(netgsaex$g.alt, vertex.size = 5, vertex.label = NA, main="Network - alt")

par(mar = c(0.5, 0.5, 3, 0.5))
plot(netgsaex$g.null, vertex.size = 5, vertex.label = NA, main="Network - null")

out = NetGSA(A1, A2, x, y, B, lklMethod = "REML")
out2 = NetGSA(A1, A2, x, y, B, lklMethod = "ML")

Run the code above in your browser using DataLab