SigTree: Perform statistical analysis of tightness for branches of a hierarchical cluster.

Description

Description: Given data from which a hierarchical tree is grown, compute measures of tightness for each branch, sample from the null distribution of these measures in the randomized data and compute the corresponding p-values.

Usage

SigTree(myinput,mystat=c("all","fldc","bldc","fldcc","slb"),
        mymethod="complete",mymetric="euclidean",rand.fun=NA,
        by.block=NA,distrib=c("vanilla","Rparallel"),Ptail=TRUE,
        tailmethod=c("ML","MOM"),njobs=1,seed=NA,
        Nperm=ifelse(Ptail,1000,1000*nrow(myinput)),
        metric.args=list(),rand.args=list())

Arguments

myinput

A matrix with rows corresponding to items to be clustered.

mystat

A character string specifying the measures of tightness to be computed and evaluated for significance of finding. See Details for the definitions of these measures. If "all" is chosen, all the first three measures, "fldc", "bldc" and "fldcc", and the corresponding p-values are computed. Otherwise, only the specified measure and its p-value are computed.

mymethod

A character string specifying the linkage method for hierarchical clustering, to be used by the hclust function. See hclust argument method for method options.

mymetric

A character string specifying the definition of dissimilarity (distance) among the data items. The options, in addition to those for the argument method of the dist functiton, are "pearson","kendall", and "spearman". If one of the latter three is chosen, the distances are computed as as.dist(1 - cor(myinput)), with the corresponding option for the method argument of the cor function.It can also be a character string specifying a user supplied dissimilarity (distance) function for myinput. See details and examples below for further explanation.

rand.fun

A character string specifying the permutation method to be applied to myinput. If NA(default), no permutation is performed. "shuffle.column" performs a random permutation independently within each column. With "shuffle.block", a random permutation is performed independently within each block of columns, as specified by the by.block argument, and independently from the other blocks. It can also be a character string specifying a user supplied randomization function for myinput. See details and examples below for further explanation.

by.block

A vector of the same length as the column dimension of myinput, to specify the blocking of columns of myinput. It is used in conjunction with rand.fun = "shuffle.block", and is ignored otherwise.

distrib

One of "vanilla", "Rparallel" to specify the distributed computing option for the cluster assignment step. For "vanilla" (default) no distributed computing is performed. For "Rparallel" the parallel package of R core is used for multi-core processing.

Ptail

Logical. If Ptail is TRUE(default), the Generalized Pareto Distribution is used to approximate the tail of the null distribution for each of the chosen measures. Otherwise, empirical p-values are computed directly from the corresponding samples.

tailmethod

A character string only needed to be specified if the Ptail is set to TRUE. For "ML" the parameters of the Generalized Pareto Distribution are estimated by likelihood maximization; for "MOM" they are estimated by the method of moments.

njobs

A single integer specifying the number of worker jobs to create in case of distributed computation if distrib = "Rparallel"; ignored otherwise.

seed

An optional single integer value, to be used to set the random number generator seed (see details).

Nperm

A single integer specifying the size of a sample from the null distribution. See details for the default sample size.

metric.args

Additional arguments for user-supplied dissimilarity (distance) function. See details and examples below for further explanation.

rand.args

Additional arguments for user-supplied randomization function. See details and examples below for further explanation.

Value

If rand.fun is set to NA, the function returns a matrix whose rows correspond to the internal nodes of the tree and whose columns contain the tree structure as in the merge component of the class hclust; the height component of hclust;and columns tabulating the values of the measures of tightness specified by the mystat argument. If rand.fun is set to a specific randomization method, an object of class best is returned. See ?best for details.

Details

When rand.fun is set to the name of a user supplied randomization function, the first argument of that function should be set to myinput. See examples below.

The measures of tightness are defined as follows. Denote a node in the tree by a, its sibling node by b, and their parent node by p. Let their respective geights be ha,hb,hp. Finally, let Sx mean that the measure S is computed for the node x. Then the definitions are

fldc:

Sa = (hp-ha)/hp

fldcc:

Sa = (hp-(ha-hb)/2)/ha

bldc:

Sp = (2*hp-ha-hb)/(2*hp)

slb:

Sp = 2*hp-ha-hb

The first three measures test tightnss of all internal nodes at the same time, while slb only tests two-way split of input data. The seed argument is optional. Setting the seed ensures reproducibility of sampling from the null distribution.

References

Theo A. Knijnenburg, Lodewyk F. A. Wessels et al (2009) Fewer permutations, more accurate P-values

Examples

Run this code

# NOT RUN {
####Each column is a gene expression profile for a case of leukemia. 
####Each case belongs to one of three subtypes.
data(leukemia)
#output only statistic table
mytable<-SigTree(data.matrix(leukemia),mystat="all",
        mymethod="ward",mymetric="euclidean")
class(mytable)
# }
# NOT RUN {
#use multicore processing to detect significant sub-clusters
mytable<-SigTree(data.matrix(leukemia),mystat="all",
	mymethod="ward",mymetric="euclidean",rand.fun="shuffle.column",
	distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
class(mytable)
####Each row after the 1st describes an item belonging to one of four subtypes. 
####Each column corresponds to a genomic location in one of 22 human chromosomes. 
####The 1st row contains the chromosome numbers.
data(T10)
#Perform randomization within each chromosome
chrom<-as.numeric(T10[1,])
mydata<-T10[-1,] 
mytable<-SigTree(data.matrix(mydata),mystat="fldc",        
	mymethod="ward",mymetric="euclidean",rand.fun="shuffle.block",
	by.block=chrom,distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
#Compute dissimilarity using a user-supplied distance function,
#and perform randomization using a user-supplied randomization function, 
#with additional arguments. 
#Both user-supplied functions are only useful as illustration.
mydist<-function(x,y){return(dist(x)/y)}
myrand<-function(x,z){return(apply(x+z,2,sample))}
mytable<-SigTree(data.matrix(leukemia),mystat="fldc",
mymethod="ward",mymetric="mydist",rand.fun="myrand",
distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="MOM",metric.args=list(3),
rand.args=list(2))
# }

Run the code above in your browser using DataLab