CScluster: CScluster

Description

Apply the Connectivity Scores to a K clustering result. More information can be found in the Details section below.

Usage

CScluster(data, clusterlabels, type = "CSmfa", WithinABS = TRUE,
  BetweenABS = TRUE, FactorABS = FALSE, verbose = FALSE, Within = NULL,
  Between = NULL, WithinSave = FALSE, BetweenSave = TRUE, ...)

Arguments

data

A gene expression matrix with the compounds in the columns.

clusterlabels

A vector of integers that represents the cluster grouping of the columns (compounds) in data. The labels should be integers starting from 1 to the total number of clusters. (e.g. the output of cutree)

type

Type of CS anaylsis (default="CSmfa"):

"CSmfa" (MFA or PCA)
"CSsmfa" (Sparse MFA or Sparse PCA)
"CSfabia" (Fabia)
"CSzhang" (Zhang and Gant)

In the first two options, either MFA or PCA is used depending on the cluster size. If the query set only contains a single compound, the latter is used. Also note that if a cluster only contains a single compound, no Within-CS can be computed.

WithinABS

Boolean value to take the mean of the absolute values in the final step of the Within-Cluster CS (default=TRUE).

BetweenABS

Boolean value to take the mean of the absolute values in the final step of the Between-Cluster CS (default=TRUE).

FactorABS

Boolean value to take the absolute value of the query loadings when determining the best factor (= factor with highest query loadings) in a CSanalysis application (default=FALSE). This option might be helpful if the `best factor` contains large positive and negative query loading which would average to zero.

verbose

Boolean value to output warnings and information about which factor is chosen in a CS analysis (if applicable).

Within

A vector for which cluster numbers the Within-Cluster CS should be computed. By default (=NULL) all within-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

Between

A vector fir which cluster numbers the Beween-Cluster CS (with the cluster as a query set) should be computed. By default (=NULL) all between-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

WithinSave

Boolean value to save the Within object in the Save slot of the returned list (default=FALSE).

BetweenSave

Boolean value to save the Between object in the Save slot of the returned list (default=TRUE).

...

Additional parameters given to CSanalysis specific to a certain type of CS analysis.

Value

A list object with components:

CSmatrix: A K\(\times\)K matrix containing the Within scores on the diagonal and the Between scores elsewhere with the rows being the query set clusters (e.g. \(m_{13}=\) Between CS between cluster 1 (as query set) and cluster 3).
CSRankmatrix: The same as CSmatrix, but with connectivity ranking scores (if applicable).
clusterlabels: The provided clusterlabels
Save: A list with components:
- Within: A list with a component for each cluster k that contains:
  - LeaveOneOutCS: Each leave-one-out connectivity score for cluster k.
  - LeaveOneOutCSRank: Each leave-one-out connectivity ranking score for cluster k (if applicable).
  - factorselect: A vector containing which factors/BCs were selected in each leave-one-out CS analysis (if applicable).
  - CS: A (columns (compounds) \(\times\) size of cluster k) matrix that contains all the connectivity scores in a leave-one-out CS analysis for each left out compound.
  - CSRank: The same as CS, but with connectivity ranking scores (if applicable).
- Between: List:
  - DataBetweenCS: A (columns (compounds) \(\times\) clusters) matrix containing all compound connectivity scores for each query cluster set.
  - DataBetweenCSRank: The same as DataBetweenCS, but with connectivity ranking scores (if applicable).
  - queryindex: The column indices for each query set in all CS analyses.
  - factorselect: A vector containing which factors/BCs were selected in each CS analysis (if applicable).

Details

After applying cluster analysis on the additional data matrix, K clusters are obtained. Each cluster will be seen as a potential query set (for CSanalysis) for which 2 connectivity score metrics can be computed, the Within-Cluster CS and the Between-Cluster CS.

Within-Cluster CS This metric will answer the question if the kth cluster is connected on a gene expression level (in addition to the samples being similar based on the other data source). The Within-Cluster CS for a cluster is computed as following:

Repeatedly for the ith sample in the kth cluster, apply CSMFA with:
- Query Set: All cluster samples excluding the ith sample.
- Reference: All samples including the ith sample of the kth cluster.
- Retrieve the CS of the ith sample in the cluster.
The Within-Cluster CS for cluster k is now defined as the average of all retrieved CS.

The concept of this metric is to investigate the connectivity for each compound with the cluster. The average of the 'leave-one-out' connectivity scores, the Within-Cluster CS, gives an indication of the gene expression connectivity of this cluster. A high Within-Cluster CS implies that the cluster is both similar on the external data source and on the gene expression level. A low score indicates that the cluster does not share a similar latent gene profile structure.

Between-Cluster CS In this stage of the analysis, we focus on the lth cluster and use all compounds in this cluster as the query set. A CSMFA is performed in which all other clusters are the reference set. Next, the connectivity scores are calculated for all reference compounds and averaged over the clusters (=the between connectivity score). A high Between-Cluster CS between the lth and jth clusters implies that, while the two clusters are not similar based on the other data source, they do share a latent structure when considering the gene expression data.

Examples

Run this code

# NOT RUN {
  # Example Data Set
  data("dataSIM",package="CSFA")
  # Remove some no-connectivity compounds
  nosignal <- sapply(colnames(dataSIM),FUN=function(x){grepl("c-",x)})
  data <- dataSIM[,-which(nosignal)[1:250]]
  
  # Toy example with random cluster assignment:
  # Note: clusterlabels can be acquired through cutree(hclust(...))
  clusterlabels <- sample(1:10,size=ncol(data),replace=TRUE)
  
  result1 <- CScluster(data,clusterlabels,type="CSmfa")
  result2 <- CScluster(data,clusterlabels,type="CSzhang")
  
  result1$CSmatrix
  result1$CSRankmatrix
  
  result2$CSmatrix
# }

Run the code above in your browser using DataLab