subtype: Cluster analysis to find molecular subtypes and their assessment

Description

subtype performs a biclustering procedure on a input dataset and assess whether resulting clusters are promising subtypes.

Usage

subtype(GEset, outcomeLabels, treatment=NULL, Npermutes=10, Nchunks = 25, minClusterSizeB = 20, NclustersASet = 100, FDRpermutation = TRUE, nFDRperm = 50, seed = NULL, testMode="quick",survivaltimes=NULL,method="penalized", top_best_probes=100, Niter=20, showMovie=0, redefineSubtypeMembers=0,holdOut=10 )

Arguments

GEset

p-by-n data matrix, where p is the number of variables (e.g. genes) and n is the number of subjects. Row and column names are necessary.

outcomeLabels

n-by-1 vector. Binary prognosis labels assigned to the subjects. The order of subjects should be equalized to that of GEset.

treatment

NULL.

Npermutes

Number of permutations for the variables. For each permutation, the variables belong to different chunks.

Nchunks

Number of chunks of the variables. When the number of variables is too large for clustering analysis, we split the variables into several(=Nchunks) chunks.

minClusterSizeB

The minimum number of subjects per each selected subtype. The default is 20.

NclustersASet

Cut a tree from hierarchical clustering into several groups. The default is 100.

FDRpermutation

Determine whether FDR computation is based on permutation procedure. The default is TRUE.

nFDRperm

Number of permutation to compute FDR. The default is 50.

seed

seed number for reproducibility.

testMode

the mode is fixed at "quick".

survivaltimes

NULL.

method

penalized is used.

top_best_probes

top-ranked probes are used in t-test, and this is input for penalized. The default is 100.

Niter

The number of iterations of (TrainingSet, TestSet)->training->test->recordResults . The defualt is 20.

showMovie

display RUC/Surv curves and heatmaps. The default is 0.

redefineSubtypeMembers

detect subtype members after every hold-out. The defualt is 0.

holdOut

out of the subtype, i.e. Nsubtype - holdOut = Ntraining_set. The defualt is 10.

Value

resultsAll:

a matrix including subtypeID and summary statistics for each subtypeID. For a specific subtypeID, it includes the number of genes, the number of subjects, area of low p-values (low_pValue_Area).

GenesDefiningSubtypes:

Variables in each subtypeID. This can be identified with "subtypeID".

SubtypePatients:

Subjects in each subtypeID. This can be identified with subtypeID.

Details

This implements a biclustering algorithm to find hidden subtypes in a dataset. summary provides a measure based on FDR and its p-value for assessing the subtypes. Note that the R-package rsmooth should be installed before implementing subtype. rsmooth can be downloaded from http://www.meb.ki.se/~yudpaw. For large dataset, the computation can be heavy, so it is desirable for users to consider parallel processing in R.

References

Alexeyenko, A. et al. (2011) Estimation of false discovery rate in a heterogeneous population.

Examples

Run this code


set.seed(1234)
p<-100   #num.variables
n1<-5    #number of sample in population 1
n2<-5    #num.samples from population 2 

group<-c(rep(1,length.out=n1),rep(2,length.out=n2))
data<-matrix(rnorm((n1+n2)*p),(n1+n2),p)

############################

dimnames(data)[[1]]<-as.character(paste("P",runif(nrow(data),0,1),sep="")) ### making row names
dimnames(data)[[2]]<-as.character(paste("G",runif(ncol(data),0,1),sep="")) ### making column names

### The following procedure takes ~ 1 minute.
A=subtype(
   GEset = t(data),
   outcomeLabels = group,
   Npermutes = 2, 
   Nchunks = 5, 
   NclustersASet = 3,
   seed=1234
)

summary(A,f.out=0)  ### f.out can be used for filtering out uninteresting subtypes. e.g. if f.out=2, we ignore subtypes having N01_0<=2.

Run the code above in your browser using DataLab