TwoHC_assign: Function to assign new samples to one of the two given hierarchical clustering trees in a semi-supervised way

Description

For given molecular data sets from two non-overlapping groups of patients, this functions constructs two independent HC trees and assigns new samples to one of them in semi-supervised way. See details.

Usage

TwoHC_assign(X, index1, index2, new.X, dis.method = "cor", link.method = "ward",  minclus = 4, maxmiss = 30, surv.time, status, method1 = "BIC",  method2 = "g2")

Arguments

An object of class ExpressionSet or data matrix from which two HC tress to be derived. Columns are assumed to represent the samples, and rows represent the sample's features. Missing values are allowed.

index1

Column indices of patients in X correspond to the first group.

index2

Column indices of patients in X correspond to the second group.

new.X

An object of class ExpressionSet or data matrix corresponds to new samples. Columns are assumed to represent the samples, and rows represents the sample's features. Missing values are allowed.

dis.method

The distance measure to be used. This must be one of method acceptable for dist function or the Pearson correlation (default).

link.method

The agglomeration method to be used. This should be one of "ward" (default), "single", "complete", "average", "mcquitty", "median" or "centroid".

minclus

The minimum number of samples allowed to form a cluster. This parameter inversely proportional to the number of partition returned from a HC tree. e.g. a large value returns small number of partitions, and vice versa.

maxmiss

Maximum percentage of missing values per row in X.

surv.time

A numeric vector contains follow-up information of patient's in X

status

A binary vector contains survival status of patients in X, normally 0=alive, 1=dead.

method1

Type of partition evaluation measures to use for assessing the relationship between follow-up and a partition. Default is "BIC".

method2

Type of Partition evaluation measure to use for assessing the relationship between data matrix X and a partition. Default is Goodman and Kruskal index "g2".

Value

hc1: HC tree derived from the data corresponds to the first group.
hc2: HC tree derived from the data corresponds to the second group.
partitions.hc1: A matrix includes partitions extracted from hc1. Rows represent partitions and columns represent samples.
partitions.hc2: A matrix includes partitions extracted from hc2. Rows represent partitions and columns represent samples.
best.hc1: Optimal partition found on the hc1
best.hc2: Optimal partition found on the hc2
score.hc1: A matrix with two columns. The first column contains the quality scores of partitions.hc1 calculated using the follow-up data. The second column contains the quality scores of partition.hc1 calculated by using X.
score.hc2: The same as score.hc1, but for partitions.hc2.
Assign: A matrix with three columns. The first column contains the indices of HC trees to which a test sample was assigned. The second column contains the indices of clusters in best.hc1 to which a test sample was most similar. The third column contains the indices of clusters in best.hc2 to which a test sample was most similar.
surv.time: The same as input
status: The same as input
index1: The same as input
index2: The same as input
new.X: The same as input
X: The same as input
method1: The same as input
method2: The same as input
minclus: The same as input
id1: indices of the partitions obtained from the hc1 in which minimum cluster size is equal or larger than minclus.
id2: indices of the partitions obtained from the hc2 in which minimum cluster size is equal or larger than minclus.

Details

Say molecular profiles of two groups patients (without overlap) treated with two different drugs or the same drugs in different combinations are available. Besides that, their follow-up information are also given. When a new patient comes in (for which only molecular profiles are available), question will be to which group this patient should be assigned so that he/she will benefit most by the type of treatment this group received.

This function is designed for this problem. it works as follows: first, two independent HC trees will be derived from given data; second, partitions are extracted and the optimal partition is selected from each HC tree, separately; third, new patient's molecular profile is compared with each cluster in each optimal partition to calculate average similarity and identify two most similar clusters (competing clusters) fromt the two HC trees; finally, new sample is assigned to one of the two competing clusters which has better overall survival.

References

Harrel,E.F. et al., (1982). "Evaluating the yield of medical tests", JAMA, 247, 2543-2546.

Obulkasim,A. et al., (2011). "Stepwise classification of cancer samples using clinical and molecular data", BMC Bioinformatics, 12, 422.

Troyanskaya,O. et al., (2001). "Missing value estimation methods for DNA microarrays". Bioinformatics, 17, 520-525.

Obulkasim,A. et al., (2013). "Semi-supervised adaptive-height snipping of the Hierarchical Clustering tree", submitted.

Examples

Run this code

data(TcgaGBM)
attach(TcgaGBM)
id1 <- which(drugs == "Avastin")
id2 <- which(drugs == "Temodar") 
result <- TwoHC_assign(X = em[ ,c(id1[1:30], id2[1:30])], index1 = 1:30, index2 = 31:60, 
                      new.X = em[, c(id1[31:60], id2[31:60])], minclus = 4,
                     surv.time = surv.time[c(id1[1:30], id2[1:30])], 
                     status = status[c(id1[1:30], id2[1:30])])

Run the code above in your browser using DataLab