goana: Gene Ontology or KEGG Pathway Analysis

Description

Test for over-representation of gene ontology (GO) terms or KEGG pathways in one or more sets of genes, optionally adjusting for abundance or gene length bias.

Usage

## S3 method for class 'MArrayLM':
goana(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
## S3 method for class 'default':
goana(de, universe = NULL, species = "Hs", prior.prob = NULL, covariate=NULL,
      plot=FALSE, \dots)
## S3 method for class 'MArrayLM':
kegga(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
## S3 method for class 'default':
kegga(de, universe = NULL, species = "Hs", species.KEGG = NULL, convert = FALSE,
      gene.pathway = NULL, pathway.names = NULL,
      prior.prob = NULL, covariate=NULL, plot=FALSE, \dots)
getGeneKEGGLinks(species.KEGG = "hsa", convert = FALSE)
getKEGGPathwayNames(species.KEGG = NULL, remove.qualifier = FALSE)

Arguments

a vector of Entrez Gene IDs, or a list of such vectors, or an MArrayLM fit object.

coef

column number or column name specifying for which coefficient or contrast differential expression should be assessed.

geneid

Entrez Gene identifiers. Either a vector of length nrow(de) or the name of the column of de$genes containing the Entrez Gene IDs.

FDR

false discovery rate cutoff for differentially expressed genes. Numeric value between 0 and 1.

species

two-letter Bioconductor species identifier. Possible values are "Hs", "Mm", "Rn", "Dm" or "Pt". Ignored if species.KEGG or is not NULL or if gene.pathway and pathway.names are not NULL.

species.KEGG

three-letter KEGG species identifier. See http://www.kegg.jp/kegg/catalog/org_list.html or http://rest.kegg.jp/list/organism for possible values. Ignored if gene.pathway and pathway.names are not NULL.

convert

if TRUE then KEGG gene identifiers will be converted to NCBI Entrez Gene identifiers. Note that KEGG IDs are the same as Entrez Gene IDs for most species anyway.

gene.pathway

data.frame linking genes to pathways. First column gives gene IDs, second column gives pathway IDs. By default this is obtained automatically by getGeneKEGGLinks(species.KEGG).

remove.qualifier

if TRUE, the species qualifier will be removed from the pathway names.

pathway.names

data.frame giving full names of pathways. First column gives pathway IDs, second column gives pathway names. By default this is obtained automatically using getKEGGPathwayNames(species.KEGG, remove=TRUE).

trend

adjust analysis for gene length or abundance? Can be logical, or a numeric vector of covariate values, or the name of the column of de$genes containing the covariate values. If TRUE, then de$Amean is used as the covariate.

universe

vector specifying the set of Entrez Gene identifiers to be the background universe. If NULL then all Entrez Gene IDs associated with any gene ontology term will be used as the universe.

prior.prob

optional numeric vector of the same length as universe giving the prior probability that each gene in the universe appears in a gene set. Will be computed from covariate if the latter is provided. Ignored if universe is NULL.

covariate

optional numeric vector of the same length as universe giving a covariate against which prior.prob should be computed. Ignored if universe is NULL.

plot

logical, should the prior.prob vs covariate trend be plotted?

...

any other arguments in a call to the MArrayLM method are passed to the default method.

Value

The goana default method produces a data frame with a row for each GO term and the following columns:
TermGO term.
Ontontology that the GO term belongs to. Possible values are "BP", "CC" and "MF".
Nnumber of genes in the GO term.
DEnumber of genes in the DE set.
P.DEp-value for over-representation of the GO term in the set.
The last two column names above assume one gene set with the name DE. In general, there will be a pair of such columns for each gene set and the name of the set will appear in place of "DE". The goana method for MArrayLM objects produces a data frame with a row for each GO term and the following columns:
TermGO term.
Ontontology that the GO term belongs to. Possible values are "BP", "CC" and "MF".
Nnumber of genes in the GO term.
Upnumber of up-regulated differentially expressed genes.
Downnumber of down-regulated differentially expressed genes.
P.Upp-value for over-representation of GO term in up-regulated genes. Not adjusted for multiple testing.
P.Downp-value for over-representation of GO term in down-regulated genes. Not adjusted for multiple testing.
The row names of the data frame give the GO term IDs. The output from kegga is the same except that row names become KEGG pathway IDs, Term becomes Pathway and there is no Ont column.

Details

These functions performs a over-representation analysis for Gene Ontology terms or KEGG pathways in a list of Entrez Gene IDs. The default method accepts the gene list as a vector of gene IDs, while the MArrayLM method extracts the gene lists automatically from a linear model fit object. goana uses annotation from the appropriate Bioconductor organism package. goana can be used for any of the five species specified. kegga reads KEGG pathway annotation from the KEGG website. Note that the species name can be provided in either Bioconductor or KEGG format. kegga can be used for any species supported by KEGG, of which there are more than 14,000 possibilities. By default, kegga obtains the KEGG annotation for the specified species from the http://rest.kegg.jp website. Alternatively one can supply the required pathway annotation to kegga in the form of two data.frames. If this is done, then an internet connection is not required. The ability to supply data.frame annotation to kegga means that kegga can in principle be used to analyze any user-supplied gene sets. The default goana and kegga methods accept a vector prior.prob giving the prior probability that each gene in the universe appears in a gene set. This vector can be used to correct for unwanted trends in the differential expression analysis associated with gene length, gene abundance or any other covariate. The MArrayLM object computes the prior.prob vector automatically when trend is non-NULL. If prior.prob=NULL, the function computes one-sided hypergeometric tests equivalent to Fisher's exact test. If prior probabilities are specified, then a test based on the Wallenius' noncentral hypergeometric distribution is used to adjust for the relative probability that each gene will appear in a gene set, following the approach of Young et al (2010). The MArrayLM methods performs over-representation analyses for the up and down differentially expressed genes from a linear model analysis. In this case, the universe is all the genes found in the fit object. trend=FALSE is equivalent to prior.prob=NULL. If trend=TRUE or a covariate is supplied, then a trend is fitted to the differential expression results and this is used to set prior.prob. The statistical approach provided here is the same as that provided by the goseq package, with one methodological difference and a few restrictions. Unlike the goseq package, the gene identifiers here must be Entrez Gene IDs and the user is assumed to be able to supply gene lengths if necessary. The goseq package has additional functionality to convert gene identifiers and to provide gene lengths. The only methodological difference is that goana and kegga computes gene length or abundance bias using tricubeMovingAverage instead of monotonic regression. While tricubeMovingAverage does not enforce monotonicity, it has the advantage of numerical stability when de contains only a small number of genes.

References

Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11, R14. http://genomebiology.com/2010/11/2/R14

Examples

Run this code

## Linear model usage:

fit <- lmFit(y, design)
fit <- eBayes(fit)

# Standard GO analysis
go.fisher <- goana(fit, species="Hs")
topGO(go.fisher, sort = "up")
topGO(go.fisher, sort = "down")

# GO analysis adjusting for gene abundance
go.abund <- goana(fit, geneid = "GeneID", trend = TRUE)
topGO(go.abund, sort = "up")
topGO(go.abund, sort = "down")

# GO analysis adjusting for gene length bias
# (assuming that y$genes$Length contains gene lengths)
go.len <- goana(fit, geneid = "GeneID", trend = "Length")
topGO(go.len, sort = "up")
topGO(go.len, sort = "down")

## Default usage with a gene list:

go.de <- goana(list(DE1 = EG.DE1, DE2 = EG.DE2, DE3 = EG.DE3))
topGO(go.de, sort = "DE1")
topGO(go.de, sort = "DE2")
topGO(go.de, ontology = "BP", sort = "DE3")
topGO(go.de, ontology = "CC", sort = "DE3")
topGO(go.de, ontology = "MF", sort = "DE3")

## Standard KEGG analysis
k <- kegga(fit, species="Hs")
k <- kegga(fit, species.KEGG="hsa") # equivalent to previous

Run the code above in your browser using DataLab