buildSetCollection: Create a gene set collection

Description

Builds an object containing the collection of all gene sets to be used by the setRankAnalysis function.

Usage

buildSetCollection(..., referenceSet = NULL, maxSetSize = 500)

Arguments

referenceSet

Optional but very strongly, recommended. A vector of geneIDs specifying the background gene set against which to test for over-representation of genesets. The default is to use all genes present in the supplied gene annotation tables. However, many expe

maxSetSize

The maximum number of genes in a gene set. Any gene sets with more genes will not be considered during the analysis.

...

One or more data frame objects containing the annotation of genes with pathway identifiers and descriptions. The idea is to provide one data frame per pathway database. Several gene set databases are provided in the organism-specific GeneSets packages. A

Value

A gene set collection which is a list object containing the following fields:
- maxSetSize
{The maximum set size applied when constructing the collection.}
referenceSetA vector listing all gene IDS that are part of the reference.
setsA list of vectors. The list names are the pathway IDs as supplied in the termID column of the annotation frame(s) supplied..
Each vector contains all geneIDs of the gene set and has three attributes set: ID, name, and db which correspond respectively to the termID, termName, and dbName fields of the annotation frame.
gThe size of the reference set.
bigSetsA list of pathway IDs of gene sets with sizes bigger than the specified maximum set size.
intersection.p.cutoffThe p-value cutoff used to determine which intersections of pairs of gene sets (see Details) are significant.
intersectionsA data frame listing all significant intersections together with the p-value.

Execution time

This function typically takes some time to execute as it pre-calculates all significant intersections between pairs of gene sets in the collection. An intersection between two gene sets is considered significant if it contains more elements than expected by chance, given the sizes of both sets. Computation time can be sped up dramatically by running this function on multiple CPU-cores. To do so, simply set the mc.cores option to the desired number of cores to use, like so: options("mc.cores=4") Performing this calculation beforehand allows to re-use the same setCollection object for different analysis. It is therefore recommended to separate the creation of the setCollection object and the actual analysis in different scripts. Once the collection is created, it can be stored on disk using the save command. The analysis script can then load the collection using the load command.

Examples

Run this code

options(mc.cores=1)
referenceSet = sprintf("gene_%02d", 1:50)
geneSets = lapply(1:9, function(i) sample(referenceSet[((i-1)*5):((i+1)*5)], 5))
annotationTable = data.frame(termID=sprintf("set_%02d", rep(1:9, each=5)), 
        geneID=unlist(geneSets),
        termName = sprintf("dummy gene set %d", rep(1:9, each=5)),
        dbName = "dummyDB",
        description = "A dummy gene set DB for testing purposes")
collection = buildSetCollection(annotationTable, referenceSet=referenceSet)

Run the code above in your browser using DataLab