NetGSA: Network-based Gene Set Analysis

Description

Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological network.

Usage

NetGSA(A, x, y, B, lklMethod = c("REML", "ML"), directed = FALSE, eta = 0.1, 
       lim4kappa = 500)

Arguments

A list of weighted adjacency matrices.

The $p \times n$ data matrix.

Vector of class indicators of length $n$.

The npath by $p$ indicator matrix for pathways.

lklMethod

Method used for likelihood calculation: options are ML (maximum likelihood) or REML (restricted maximum likelihood).

directed

Whether the networks are directed. By default, directed=FALSE.

eta

Approximation limit for the Influence matrix. See 'Details'.

lim4kappa

Limit for condition number (used to adjust eta). See 'Details'.

Value

A list with components

beta

Vector of fixed effects of length $2p$, of which the first half is for condition 1 and the second half for conditin 2.

teststat

Test statistics for gene sets (pathways).

Degrees of freedom for the test statistics.

p.value

P-values for gene sets (pathways).

s2.epsilon

Variance of the random errors $\epsilon$.

s2.gamma

Variance of the random effects $\gamma$.

Details

The function NetGSA carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks.

The NetGSA method is formulated in terms of a mixed linear model. Let $X$ represent the rearrangement of data x into an $np \times 1$ column vector. $$X=\Psi \beta + \Pi \gamma + \epsilon$$ where $\beta$ is the vector of fixed effects, $\gamma$ and $\epsilon$ are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices A, which determine the influence matrix under each condition. The influence matrices further determine the design matrices $\Psi$ and $\Pi$ in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A[[k]] for the $k$-th condition). A small value of eta is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa.)

The problem is then to test the null hypothesis $\ell\beta = 0$ vs. the alterernative $\ell\beta \neq 0$, where $\ell$ is a contrast vector, optimally defined through the underlying networks. For a two-sample test, the test statistic $T$ for each gene set is a function of $\beta$, variances of $\gamma$ and $\epsilon$, the constrast vector $\ell$ and the underlying biological network(s) in both conditions. Under the null hypothesis, $T$ has approximately a $t$-distribution, whose degrees of freedom are estimated using the Satterthwaite approximation method. When analyzing complex experiments involving multiple conditions, often multiple contrast vectors of interest are considered for a specific subnetwork. Alternatively, one can combine the contrast vectors into a contrast matrix $L$. A different test statistic $F$ will be used. Under the null, $F$ has an F-distribution, whose degrees of freedom are calculated based on the contrast matrix $L$ as well as variances of $\gamma$ and $\epsilon$. The fixed effects $\beta$ are estimated by generalized least squares, and the estimate depends on estimates of the variance components of $\gamma$ and $\epsilon$. The variance components ($\sigma^2_{\epsilon}$ and $\sigma^2_{\gamma}$) are estimated using Newton's method based on the profiling out $\sigma_{\epsilon}$.

This function can deal with both directed and undirected networks, which are specified via the option directed. Note NetGSA uses slightly different procedures to calculate the influence matrices for directed and undirected networks. In the case of undirected networks, the user can still apply NetGSA if only partial information on the adjacency matrices is available. The function covsel provides one way to estimate the weighted adjacency matrices from data based on available network information.

References

Ma, J., Shojaie, A. & Michailidis, G. (2014). Network-based pathway enrichment analysis with incomplete network information, submitted. http://arxiv.org/abs/1411.7919

Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22. http://www.ncbi.nlm.nih.gov/pubmed/20597848.

Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131840/

Examples

Run this code

# NOT RUN {
set.seed(1)

## NetGSA with directed networks

## NetGSA with undirected networks
data(netgsaex2)

A = netgsaex2$A
B = netgsaex2$B
x = netgsaex2$x
y = netgsaex2$y

# -Not-run-
# fit = NetGSA(A, x, y, B, lklMethod="REML", directed=FALSE)

# }

Run the code above in your browser using DataLab