Learn R Programming

CSFA (version 1.2.0)

CScluster: CScluster

Description

Apply the Connectivity Scores to a K clustering result. More information can be found in the Details section below.

Usage

CScluster(data, clusterlabels, type = "CSmfa", WithinABS = TRUE,
  BetweenABS = TRUE, FactorABS = FALSE, verbose = FALSE, Within = NULL,
  Between = NULL, WithinSave = FALSE, BetweenSave = TRUE, ...)

Arguments

data

A gene expression matrix with the compounds in the columns.

clusterlabels

A vector of integers that represents the cluster grouping of the columns (compounds) in data. The labels should be integers starting from 1 to the total number of clusters. (e.g. the output of cutree)

type

Type of CS anaylsis (default="CSmfa"):

  • "CSmfa" (MFA or PCA)

  • "CSsmfa" (Sparse MFA or Sparse PCA)

  • "CSfabia" (Fabia)

  • "CSzhang" (Zhang and Gant)

In the first two options, either MFA or PCA is used depending on the cluster size. If the query set only contains a single compound, the latter is used. Also note that if a cluster only contains a single compound, no Within-CS can be computed.

WithinABS

Boolean value to take the mean of the absolute values in the final step of the Within-Cluster CS (default=TRUE).

BetweenABS

Boolean value to take the mean of the absolute values in the final step of the Between-Cluster CS (default=TRUE).

FactorABS

Boolean value to take the absolute value of the query loadings when determining the best factor (= factor with highest query loadings) in a CSanalysis application (default=FALSE). This option might be helpful if the `best factor` contains large positive and negative query loading which would average to zero.

verbose

Boolean value to output warnings and information about which factor is chosen in a CS analysis (if applicable).

Within

A vector for which cluster numbers the Within-Cluster CS should be computed. By default (=NULL) all within-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

Between

A vector fir which cluster numbers the Beween-Cluster CS (with the cluster as a query set) should be computed. By default (=NULL) all between-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

WithinSave

Boolean value to save the Within object in the Save slot of the returned list (default=FALSE).

BetweenSave

Boolean value to save the Between object in the Save slot of the returned list (default=TRUE).

...

Additional parameters given to CSanalysis specific to a certain type of CS analysis.

Value

A list object with components:

  • CSmatrix: A K\(\times\)K matrix containing the Within scores on the diagonal and the Between scores elsewhere with the rows being the query set clusters (e.g. \(m_{13}=\) Between CS between cluster 1 (as query set) and cluster 3).

  • CSRankmatrix: The same as CSmatrix, but with connectivity ranking scores (if applicable).

  • clusterlabels: The provided clusterlabels

  • Save: A list with components:

    • Within: A list with a component for each cluster k that contains:

      • LeaveOneOutCS: Each leave-one-out connectivity score for cluster k.

      • LeaveOneOutCSRank: Each leave-one-out connectivity ranking score for cluster k (if applicable).

      • factorselect: A vector containing which factors/BCs were selected in each leave-one-out CS analysis (if applicable).

      • CS: A (columns (compounds) \(\times\) size of cluster k) matrix that contains all the connectivity scores in a leave-one-out CS analysis for each left out compound.

      • CSRank: The same as CS, but with connectivity ranking scores (if applicable).

    • Between: List:

      • DataBetweenCS: A (columns (compounds) \(\times\) clusters) matrix containing all compound connectivity scores for each query cluster set.

      • DataBetweenCSRank: The same as DataBetweenCS, but with connectivity ranking scores (if applicable).

      • queryindex: The column indices for each query set in all CS analyses.

      • factorselect: A vector containing which factors/BCs were selected in each CS analysis (if applicable).

Details

After applying cluster analysis on the additional data matrix, K clusters are obtained. Each cluster will be seen as a potential query set (for CSanalysis) for which 2 connectivity score metrics can be computed, the Within-Cluster CS and the Between-Cluster CS.

Within-Cluster CS This metric will answer the question if the kth cluster is connected on a gene expression level (in addition to the samples being similar based on the other data source). The Within-Cluster CS for a cluster is computed as following:

  1. Repeatedly for the ith sample in the kth cluster, apply CSMFA with:

    • Query Set: All cluster samples excluding the ith sample.

    • Reference: All samples including the ith sample of the kth cluster.

    • Retrieve the CS of the ith sample in the cluster.

  2. The Within-Cluster CS for cluster k is now defined as the average of all retrieved CS.

The concept of this metric is to investigate the connectivity for each compound with the cluster. The average of the 'leave-one-out' connectivity scores, the Within-Cluster CS, gives an indication of the gene expression connectivity of this cluster. A high Within-Cluster CS implies that the cluster is both similar on the external data source and on the gene expression level. A low score indicates that the cluster does not share a similar latent gene profile structure.

Between-Cluster CS In this stage of the analysis, we focus on the lth cluster and use all compounds in this cluster as the query set. A CSMFA is performed in which all other clusters are the reference set. Next, the connectivity scores are calculated for all reference compounds and averaged over the clusters (=the between connectivity score). A high Between-Cluster CS between the lth and jth clusters implies that, while the two clusters are not similar based on the other data source, they do share a latent structure when considering the gene expression data.

Examples

Run this code
# NOT RUN {
  # Example Data Set
  data("dataSIM",package="CSFA")
  # Remove some no-connectivity compounds
  nosignal <- sapply(colnames(dataSIM),FUN=function(x){grepl("c-",x)})
  data <- dataSIM[,-which(nosignal)[1:250]]
  
  # Toy example with random cluster assignment:
  # Note: clusterlabels can be acquired through cutree(hclust(...))
  clusterlabels <- sample(1:10,size=ncol(data),replace=TRUE)
  
  result1 <- CScluster(data,clusterlabels,type="CSmfa")
  result2 <- CScluster(data,clusterlabels,type="CSzhang")
  
  result1$CSmatrix
  result1$CSRankmatrix
  
  result2$CSmatrix
# }

Run the code above in your browser using DataLab