adSplit: Annotation-Driven Splits

Description

This function searches for annotation-driven splits of patients in microarray data. A split is a partitioning of patients into two groups. In order to do so it refers to GO terms and KEGG pathways. In addition, a significance measure can be computed by simulating a random distribution of scores. DLD-scores are used to judge the quality of a split.

Usage

adSplit(mydata, annotation.ids, chip.name,  min.probes = 20, max.probes = NULL,  B = NULL, min.group.size = 5, ngenes = 50,  ignore.genes = 5)

Arguments

mydata

either an expression set as defined by the package Biobase or a matrix of expression levels (rows=genes, columns=samples).

annotation.ids

a vector of GO or KEGG identifiers in the form "GO:..." or "KEGG:..." respectively. The prefix "KEGG:" is removed from the KEGG-identifiers before accessing the chip's "...PATH2PROBES" hash.

chip.name

the name of the chip by which the expression set is measured. adSplit attempts to load a library of the same name and expects to find a hash called "GO2ALLPROBES" and one called "PATH2PROBES" there.

min.probes

annotation identifiers with fewer than this associated genes are skipped.

max.probes

annotation identifiers with more than this associated genes are skipped. The default is ten percent of the genes on the chip.

the number of random gene set samplings to be performed to compute empirical p-values.

min.group.size

filter criteria to avoid splits suggesting tiny groups. Splits where one of the two suggested groups are smaller than this number are removed from the split set.

ngenes

number of genes used to compute DLD scores.

ignore.genes

number of best scoring genes to be ignored when computing DLD scores.

Value

cuts: a matrix of split attributions. One row per annotation identifier (GO term or KEGG pathway for which a split has been generated. One column per object in the dataset.
score: one score per generated split.
pvalue: one empirical p-value per generated split, or NULL
qvalue: one q-value computed according Benjamini-Hochberg's correction for multiple testing per generated split, or NULL

Details

This function applies the same splitting procedure to all annotation identifiers provided. Firstly, the associated genes for one identifier are determined and extracted from the expression data. Then the diana2means function is applied to the restricted data and the different splits generated are collected into a single splitSet object.

As annotation identifiers vectors of identifiers of the KEGG:nnnnn and GO:nnnnnn are valid. In addition, the keywords "KEGG", "GO" and "all" are allowed, representing all terms in the corresponding ontology.

If B is set to a integer number this number of samplings are used to generate a null-distribution of DLD-scores. This distribution is used to compute empirical p-values for each split. If more than one valid split is found, multiple testing is corrected for by applying Benjamini-Hochbergs correction from the multtest package.

Examples

Run this code

# prepare data
library(golubEsets) 
data(Golub_Merge) 

# generate annotation-driven splits for apoptosis and signal transduction
x <- adSplit(Golub_Merge, "GO:0006915", "hu6800")
x <- adSplit(Golub_Merge, c("GO:0007165","GO:0006915"), "hu6800", max.probes=7000)

# generate a split for glutamate metabolism including 
# an empirical p-value
x <- adSplit(Golub_Merge, "KEGG:00251", "hu6800", B=100)

## Not run: 
# # generate splits for all KEGG pathways.
# x <- adSplit(Golub_Merge, "KEGG", "hu6800")
# image(x)
# ## End(Not run)

Run the code above in your browser using DataCamp Workspace