Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological network.
NetGSA(A, x, y, B, lklMethod = c("REML", "ML"), directed = FALSE, eta = 0.1,
lim4kappa = 500)
A list of weighted adjacency matrices.
The \(p \times n\) data matrix.
Vector of class indicators of length \(n\).
The npath by \(p\) indicator matrix for pathways.
Method used for likelihood calculation: options are ML
(maximum likelihood) or REML
(restricted maximum likelihood).
Whether the networks are directed. By default, directed=FALSE
.
Approximation limit for the Influence matrix. See 'Details'.
Limit for condition number (used to adjust eta
). See 'Details'.
A list with components
Vector of fixed effects of length \(2p\), of which the first half is for condition 1 and the second half for conditin 2.
Test statistics for gene sets (pathways).
Degrees of freedom for the test statistics.
P-values for gene sets (pathways).
Variance of the random errors \(\epsilon\).
Variance of the random effects \(\gamma\).
The function NetGSA
carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks.
The NetGSA method is formulated in terms of a mixed linear model. Let \(X\) represent the rearrangement of data x
into an \(np \times 1\) column vector.
$$X=\Psi \beta + \Pi \gamma + \epsilon$$
where \(\beta\) is the vector of fixed effects, \(\gamma\) and \(\epsilon\) are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices A
, which determine the influence matrix under each condition. The influence matrices further determine the design matrices \(\Psi\) and \(\Pi\) in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A[[k]]
for the \(k\)-th condition). A small value of eta
is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa
.)
The problem is then to test the null hypothesis \(\ell\beta = 0\) vs. the alterernative \(\ell\beta \neq 0\), where \(\ell\) is a contrast vector, optimally defined through the underlying networks. For a two-sample test, the test statistic \(T\) for each gene set is a function of \(\beta\), variances of \(\gamma\) and \(\epsilon\), the constrast vector \(\ell\) and the underlying biological network(s) in both conditions. Under the null hypothesis, \(T\) has approximately a \(t\)-distribution, whose degrees of freedom are estimated using the Satterthwaite approximation method. When analyzing complex experiments involving multiple conditions, often multiple contrast vectors of interest are considered for a specific subnetwork. Alternatively, one can combine the contrast vectors into a contrast matrix \(L\). A different test statistic \(F\) will be used. Under the null, \(F\) has an F-distribution, whose degrees of freedom are calculated based on the contrast matrix \(L\) as well as variances of \(\gamma\) and \(\epsilon\). The fixed effects \(\beta\) are estimated by generalized least squares, and the estimate depends on estimates of the variance components of \(\gamma\) and \(\epsilon\). The variance components (\(\sigma^2_{\epsilon}\) and \(\sigma^2_{\gamma}\)) are estimated using Newton's method based on the profiling out \(\sigma_{\epsilon}\).
This function can deal with both directed and undirected networks, which are specified via the option directed
. Note NetGSA
uses slightly different procedures to calculate the influence matrices for directed and undirected networks.
In the case of undirected networks, the user can still apply NetGSA
if only partial information on the adjacency matrices is available. The function covsel
provides one way to estimate the weighted adjacency matrices from data based on available network information.
Ma, J., Shojaie, A. & Michailidis, G. (2014). Network-based pathway enrichment analysis with incomplete network information, submitted. http://arxiv.org/abs/1411.7919
Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22. http://www.ncbi.nlm.nih.gov/pubmed/20597848.
Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131840/
# NOT RUN {
set.seed(1)
## NetGSA with directed networks
## NetGSA with undirected networks
data(netgsaex2)
A = netgsaex2$A
B = netgsaex2$B
x = netgsaex2$x
y = netgsaex2$y
# -Not-run-
# fit = NetGSA(A, x, y, B, lklMethod="REML", directed=FALSE)
# }
Run the code above in your browser using DataLab