GO_analyse: Identifies gene ontologies clustering samples according to predefined factor.

Description

Combines gene expression data with Gene Ontology (GO) annotations to rank and visualise genes and GO terms enriched for genes best clustering predefined groups of samples based on gene expression levels.

This methods semi-automatically retrieves the latest information from Ensembl using the biomaRt package, except if custom GO annotations are provided. Custom GO annotations have two main benefits: firstly they allow the analysis of species not supported in the Ensembl BioMart server, and secondly they save time skipping calls to the Ensembl BioMart server for species that are supported. The latter also presents the possiblity of using an older release of the Ensembl annotations, for example.

Using default settings, a random forest analysis is performed to evaluate the ability of each gene to cluster samples according to a predefined grouping factor (one-way ANOVA available as an alteranative). Each GO term is scored and ranked according to the average rank (alternatively, average power) of all associated genes to cluster the samples according to the factor. The ranked list of GO terms is returned, with tools allowing to visualise the statistics on a gene- and ontology-basis.

Usage

GO_analyse( eSet, f, subset=NULL, biomart_dataset="", microarray="", method="randomForest", rank.by="rank", do.trace=100, ntree=1000, mtry=ceiling(2*sqrt(nrow(eSet))), GO_genes=NULL, all_GO=NULL, all_genes=NULL, FUN.GO=mean, ...)

Arguments

eSet

ExpressionSet of the Biobase package including a gene-by-sample expression matrix in the AssayData slot, and a phenotypic information data-frame in the phenodate slot. In the expression matrix, row names are identifiers of expressed features, and column names are identifiers of the individual samples. In the phenotypic data-frame, row names are sample idenfifiers, column names are grouping factors and phenotypic traits usable for statistical tests and visualisation methods.

A column name in phenodata used as the grouping factor for the analysis.

subset

A named list to subset eSet for the analysis. Names must be column names existing in colnames(pData(eSet)). Values must be vectors of values existing in the corresponding column of pData(eSet). The original ExpressionSet will be left unchanged.

biomart_dataset

The Ensembl BioMart dataset identifier corresponding to the species studied. If not specified and no custom annotations were provided, the method will attempt to automatically identify the adequate dataset from the first feature identifier in the dataset. Use data(prefix2dataset) to access a table listing valid choices.

microarray

The identifier in the Ensembl BioMart corresponding to the microarray platform used. If not specified and no custom annotations were provided, the method will attempt to automatically identify the platform used from the first feature identifier in the dataset. Use data(microarray2dataset) to access a table listing valid choices.

method

The statistical framework to score genes and gene ontologies. Either "randomForest" or "rf" to use the random forest algorithm, or alternatively either of "anova" or "a" to use the one-way ANOVA model. Default is "randomForest".

rank.by

Either of "rank" or "score" to chose the metric used to order the gene and GO term result tables. Default to 'rank'.

do.trace

Only used if method="randomForest". If set to TRUE, gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees. Default is 100.

ntree

Only used if method="randomForest". Number of trees to grow. This should be set to a number large enough to ensure that every input row gets predicted at least a few times

mtry

Only used if method="randomForest". Number of features randomly sampled as candidates at each split. Default value is 2*sqrt(gene_count) which is approximately 220 genes for a dataset of 12,000 genes.

GO_genes

Custom annotations associating features present in the expression dataset to gene ontology identifiers. This must be provided as a data-frame of two columns, named gene_id and go_id. If provided, no call to the Ensembl BioMart server will be done, and arguments all_GO and all_genes should be provided as well, to enable all downstream features of GOexpress. An example is provided in AlvMac_GOgenes.

all_GO

Custom annotations used to annotate each GO identifier present in GO_genes with the ontology name (e.g. "apoptotic process") and namespace (i.e. "biological_process", "molecular_function", or "cellular_component"). This must be provided as a data-frame containing at least one column named go_id, and preferably two more columns named name_1006 and namespace_1003 for consistency with the Ensembl BioMart. Supported alternative column headers are name and namespace. Respectively, name should be used to provide the description of the GO term, and namespace should contain one of "biological_process", "molecular_function" and "cellular_component". name is used to generate the title of ontology-based figured, and namespace is important to enable subsequent filtering of results by their corresponding value. An example is provided in data(AlvMac_allGO).

all_genes

Custom annotations used to annotate each feature identifier in the expression dataset with the gene name or symbol (e.g. "TNF"), and an optional description. This must be provided as a data-frame containing at least a column named gene_id and preferably two more columns named external_gene_name and description for consistency with the Ensembl BioMart. A supported alternative header is name. While external_gene_name is important to enable subsequent visualisation of results by gene symbol, description is only displayed for readability of result tables. An example is provided in data(AlvMac_allgenes).

FUN.GO

Function to summarise the score and rank of all feature associated with each gene ontology. Default is mean function. If using "lambda-like" (anonymous) functions, these must take a list of numeric values as an input, and return a single numeric value as an output.

...

Additional arguments passed on to the randomForest() method, if applicable.

Value

GO: A table ranking all GO terms related to genes in the expression dataset based on the average ability of their related genes to cluster the samples according to the predefined grouping factor.
mapping: The table mapping genes present in the dataset to GO terms.
genes: A table ranking all genes present in the expression dataset based on their ability to cluster the samples according to the predefined grouping factor.
factor: The predefined grouping factor.
method: The statistical framework used.
subset: The filters used to run the analysis only on a subet of the samples. NULL if no filter was applied.
rank.by: The metric used to rank order the genes and gene ontologies.
FUN.GO: The function used to summarise the score and rank of all gene features associated with each gene ontology.
ntree: Number of trees grown.
mtry: Number of variables randomly sampled as candidates at each split.

Warning

Make sure that the factor f is an actual factor in the R language meaning. This is important for the underlying statistical framework to identify the groups of samples defined by their level of this factor. If the column defining the factor (e.g. "Treatment") in phenodata is not an R factor, use pData(targets)$Treatment = factor(pData(targets)$Treatment) to convert the character values into an actual R factor with appropriate levels.

Details

The default scoring functions strongly favor GO terms associated with fewer genes at the top of the ranking. This bias may actually be seen as a valuable feature which enables the user to browse through GO terms of increasing "granularity", i.e. associated with increasingly large sets of genes, although consequently being increasingly vague and blurry (e.g. "protein binding" molecular function associated with over 6,000 genes). It is suggested to use the subset_scores() function to subsequently filter out GO terms with fewer than 5+ genes associated with it. Indeed, those GO terms are more sensitive to outlier genes as they were scored on the average of a handful of genes. Additionally, the pValue_GO function may be used to generate a permutation-based P-value indicating the chance of seeing each GO term reaching an equal or higher rank -- or score -- by chance.

Examples

Run this code

# Load example data subset
data(AlvMac)
# Load a local copy of annotations obtained from the Ensembl Biomart server
data(AlvMac_GOgenes)
data(AlvMac_allgenes)
data(AlvMac_allGO)

# Run the analysis on factor "Treatment",
# considering only treatments "MB" and "TB" at time-point "48H"
# using a local copy of annotations obtained from the Ensembl BioMart server
AlvMac_results <- GO_analyse(  
    eSet=AlvMac, f="Treatment",
    subset=list(Time=c("48H"), Treatment=c("MB", "TB")),
    GO_genes=AlvMac_GOgenes, all_genes=AlvMac_allgenes, all_GO=AlvMac_allGO
    )

# Valid Ensembl BioMart datasets are listed in the following variable
data(prefix2dataset)

# Valid microarray= values are listed in the following variable
data(microarray2dataset)

## Not run: 
# # Other valid but time-consuming examples:
# 
# # Run the analysis on factor "Treatment" including all samples
# GO_analyse(eSet=AlvMac, f="Treatment")
# 
# # Run the analysis on factor "Treatment" using ANOVA method
# GO_analyse(eSet=AlvMac, f="Treatment", method="anova")
# 
# 
# # Use alternative GO scoring/summarisation functions (Default is: average)
# 
# # Named functions
# GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = median)
# 
# # Anonymous functions (simple example without scientific value)
# GO_analyse(eSet=AlvMac, f="Treatment", FUN.GO = function(x){median(x)/100})
# 
# 
# # Syntax examples without actual data:
# 
# # To force the use of the Ensembl BioMart for the human species, use:
# GO_analyse(eSet, f, biomart_dataset = "hsapiens_gene_ensembl")
# 
# # To force use of the bovine affy_bovine microarray annotations use:
# GO_analyse(eSet, f, microarray = "affy_bovine")
# ## End(Not run)

Run the code above in your browser using DataLab