NetGSA(A1, A2, x, y, B, lklMethod = c("REML", "ML"), directed = FALSE, eta = 0.1,
lim4kappa = 500)
ML
(maximum likelihood) or REML
(restricted maximum likelihood).eta
). See 'Details'.NetGSA
carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks.The NetGSA method is formulated in terms of a mixed linear model. Let $X$ represent the rearrangement of data x
into an $np \times 1$ column vector.
$$X=\Psi \beta + \Pi \gamma + \epsilon$$
where $\beta$ is the vector of fixed effects, $\gamma$ and $\epsilon$ are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices A1
and A2
, which determine the
influence matrix under each condition. The influence matrices further determine the design matrices $\Psi$ and $\Pi$ in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and
is calculated from the adjacency matrix (A1
or A2
). A small value of eta
is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa
.)
The problem is then to test the null hypothesis $\ell\beta = 0$ vs. the alterernative $\ell\beta \neq 0$, where $\ell$ is a contrast vector, optimally defined through the underlying networks. The test statistic $T$ for each gene set is then a function of $\beta$, variances of $\gamma$ and $\epsilon$, the constrast vector $\ell$ and the underlying biological network(s) in both conditions. Under the null hypothesis, $T$ has approximately a $t$-distribution, whose degrees of freedom are estimated using the Satterthwaite approximation method. The fixed effects $\beta$ are estimated by generalized least squares, and the estimate depends on estimates of the variance components of $\gamma$ and $\epsilon$. The variance components ($\sigma^2_{\epsilon}$ and $\sigma^2_{\gamma}$) are estimated using Newton's method based on the profiling out $\sigma_{\epsilon}$.
This function can deal with both directed and undirected networks, which are specified via the option directed
. Note NetGSA
uses slightly different procedures to calculate the influence matrices for directed and undirected networks.
In the case of undirected networks, the user can still apply NetGSA
if only partial information on the adjacency matrices is available. The function covsel
provides one way to estimate the weighted adjacency matrices from data based on available network information.
Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22.
Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426.
edgelist2adj
, covsel
set.seed(1)
library(igraph)
data(netgsaex)
A1 = netgsaex$A1
A2 = netgsaex$A2
B = netgsaex$B
x = netgsaex$x
y = netgsaex$y
##Visualize the networks
par(mar = c(0.5, 0.5, 3, 0.5))
plot(netgsaex$g.alt, vertex.size = 5, vertex.label = NA, main="Network - alt")
par(mar = c(0.5, 0.5, 3, 0.5))
plot(netgsaex$g.null, vertex.size = 5, vertex.label = NA, main="Network - null")
out = NetGSA(A1, A2, x, y, B, lklMethod = "REML")
out2 = NetGSA(A1, A2, x, y, B, lklMethod = "ML")
Run the code above in your browser using DataLab