goseq Gene Ontology analyser
Does selection-unbiased testing for category enrichment amongst differentially expressed (DE) genes for RNA-seq data. By default, tests gene ontology (GO) categories, but any categories may be tested.
goseq(pwf, genome, id, gene2cat = NULL, test.cats=c("GO:CC", "GO:BP", "GO:MF"), method = "Wallenius", repcnt = 2000, use_genes_without_cat=FALSE)
An object containing gene names, DE calls, the probability weighting function. Usually generated by
A string identifying the genome that
genesrefer to. For a list of supported organisms run
A string identifying the gene identifier used by
genes. For a list of supported gene IDs run
A data frame with two columns containing the mapping between genes and the categories of interest. Alternatively, a list where the names are genes and each entry is a vector containing GO categories associated with that gene (this is the output produced by
getgo). If set to
NULLgoseq attempts to fetch GO categories automatically using
- A vector specifying which categories to test for over representation amongst DE genes. See details for allowed options.
- The method to use to calculate the unbiased category enrichment scores. Valid options are "Wallenius", "Sampling" & "Hypergeometric". "Hypergeometric" and "Sampling" should almost never be used (see details).
Number of random samples to be calculated when random sampling is used. Ignored unless
- A boolean to indicate whether genes without a categorie should still be used. For example, a large number of gene may have no GO term annotated. If this option is set to FALSE, those genes will be ignored in the calculation of p-values (default behaviour). If this option is set to TRUE, then these genes will count towards the total number of genes outside the category being tested (default behaviour prior to version 1.15.2).
pwf argument is almost always the output of the function
nullp. This is a data frame with 3 columns, named "DEgenes", "bias.data" and "pwf" with the rownames set to the gene names. Each row corresponds to a gene with the DEgenes column specifying if the gene is DE (1 for DE, 0 for not DE), the bias.data column giving the numeric value of the DE bias being accounted for (usually the gene length or number of counts) and the pwf column giving the genes value on the probability weighting function.
goseq obtains length data from UCSC and GO mappings from the organim packages (see
getlength for details). If your data is in an unsupported format you will need to obtain the GO category mapping and supply them to the
goseq function using the
To use your own gene to category mapping with
goseq, use the
gene2cat arguement. This arguement takes a data.frame, with one column containing gene IDs and the other containing the associated categories. As the mapping from gene <-> category is in general many to many there will be multiple rows containing the same gene identifier. Alternatively,
gene2cat can take a list, where the names are the genes and the entries are the GO categories associated with the genes. This is the format produced by the
getgo function and is more space efficient than the data.frame representation.
gene2cat is left as
goseq attempts to use
getgo to fetch GO catgeory to gene identifier mappings.
The PWF is usually calculated using the
nullp function to correct for length bias. However,
goseq will work with any vector of weights. Any bias can be accounted for so long as a weight for each gene is supplied using this arguement.
NAs are allowed in the "pwf" and "bias.data" columns of the PWF data frame (these usually occur as a result of missing length data for some genes). Any entry which is
NA is set to the weighting of the median gene.
Valid options for the
test.cats arguement are any combination of "GO:CC", "GO:BP", "GO:MF" & "KEGG". The three GO terms refer to the Cellular Component, Biological Process and Molecular Function respectively. "KEGG" refers to KEGG pathways.
The three methods, "Wallenius", "Sampling" & "Hypergeometric", calculate the p-values as follows.
"Wallenius" approximates the true distribution of numbers of members of a category amongst DE genes by the Wallenius non-central hypergeometric distribution. This distribution assumes that within a category all genes have the same probability of being chosen. Therefore, this approximation works best when the range in probabilities obtained by the probability weighting function is small. "Wallenius" is the recommended method for calculating p-values.
"Sampling" uses random sampling to approximate the true distribution and uses it to calculate the p-values for over (and under) representation of categories. In practice, its use quickly becomes computationally prohibitive because
repcnt would need to be set very high for most applications.
CAUTION: "Hypergeometric" should NEVER be used for producing results for biological interpretation. If there is genuinly no bias in power to detect DE in your experiment, the PWF will reflect this and the other methods will produce accuracte results.
"Hypergeometric" assumes there is no bias in power to detect differential expression at all and calculates the p-values using a standard hypergeometric distribution. Useful if you wish to test the effect of selection bias on your results.
goseq returns a data frame with several columns. The first column gives the name of the category, the second gives the p-value for the associated category being over represented amongst DE genes. The third column gives the p-value for the associated category being under represented amongst DE genes. The p-values have not been corrected for multiple hypothesis testing. The fourth and fifth columns give the number of differentially expressed genes in the category and total genes in the category respectively. If any of the categories was a GO term, there will be two additional columns for the GO term and its ontology.
Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. (2010) Gene ontology analysis for RNA-seq: accounting for selection bias Genome Biology Date: Feb 2010 Vol: 11 Issue: 2 Pages: R14
data(genes) pwf <- nullp(genes,'hg19','ensGene') pvals <- goseq(pwf,'hg19','ensGene') head(pvals)