gsnPareNetGenericHierarchic: gsnPareNetGenericHierarchic

Description

Method to perform hierarchical clustering and paring of gene set networks.

Usage

gsnPareNetGenericHierarchic(
  object,
  distance = NULL,
  extreme = NULL,
  cutoff = NULL,
  keepOrphans = TRUE,
  matrix_scaling_fun = NULL,
  lower_is_closer = NULL,
  k = NULL,
  h = NULL,
  method = "average"
)

Value

A GSNData copy of the original object argument containing a pared distance matrix for the specified distance metric.

Arguments

object: An object of type GSNData containing a distance matrix.
distance: (optional) character vector of length 1 indicating which pared distance matrix is to be used for assigning subnets. This defaults to the 'default_distance'.
extreme: (optional) Either min or max indicating whether low or high values are most significant, i.e. to be interpreted as the shortest distance for nearest neighbor paring. This defaults to the value set for the optimal_extreme field of the specified distance matrix.
cutoff: (optional) A cutoff specifying a maximal of minimal value that will be retained, dependent on the distance metric being used. This is not usually necessary to specify for hierarchical clustering. (see details)
keepOrphans: A boolean indicating whether 'orphan' gene sets that have no nearest neighbors should be retained in the final network. (default TRUE )
matrix_scaling_fun: A function to perform transformation and scaling of the distance matrix. The default, distMat2UnitNormRank converts the distance matrix to ranks and scales the resulting numbers to a range between 0 and 1. If set to NULL, the distances are not scaled or transformed. (see details)
lower_is_closer: Boolean indicating that lower values should be treated as closer for the sake of hierarchical clustering.
k: (optional) Parameter passed to cutree to determine the number of desired clusters. If both k and h are NULL, a value for k will be chosen. (see details)
h: (optional) Parameter passed to cutree to determine the cutting height for breaking the clusters into groups. (see details)
method: (optional) Parameter passed to hclust() to specify the hierarchical clustering method used. (default "average")

Details

This method performs hierarchical clustering, then joins the members of each cluster. This joining occurs as follows:

First, only the edges between gene sets belonging to the same hierarchical cluster are considered, and the edges within each cluster are ordered by distance.
The first edge is the edge defined by the shortest distance.
Subsequent edges are added to the subnet by selecting the shortest from the edges shared by one joined and one unjoined gene set.
This process is repeated until all gene sets in a cluster are joined as a subnet.

This joining method differs from nearest neighbor joining in that unjoined nodes are initially joined, not to their nearest neighbor necessarily, but to their nearest neighbor from among the nodes already joined together in a subnet. This method avoids bifurcation of subnets that could occur by regular nearest neighbor joining.

NOTE: The matrix_scaling_fun argument is a function that takes the distance matrix and transforms it into scaled data appropriate for hierarchical clustering. (As such, it should return data with low values indicating closer gene sets, as opposed to a Jaccard index where high values are closest.) Because this function may transform the data from a scale where high values are close to one where low values are close, such functions should return a matrix with a lower_is_closer attribute set as TRUE to indicate that. If the lower_is_closer attribute is not set by matrix_scaling_fun, then it will be assumed to be the same as the raw distance matrix, which may generate an error if the optimal_extreme of the distance matrix is not 'min'. This value will be used to set the corresponding $distances[[distance]]$pared_optimal_extreme field in the GSNData object. In general, a scaling transformation is necessary because some potential distance metrics are in log-space and have skewed distributions and negative values (like log Fisher) or are actually similarity metrics, with higher values being closer. In this way they differ from standard distances, and require transformation to be suitable for hierarchical clustering. The default, matrix_scaling_fun argument, distMat2UnitNormRank() scales the data to a range between 0 and 1, and converts it to a uniform distribution. This may be a bit extreme for some purposes, but it allows the hierarchical clustering method to work simply with default values for most users obviating the need to transform the data or adjust default parameters in many cases. Other values for this argument are identity() (which can be used when a transformation is not desired) and complement() which for an input value $x$ returns $1 - x$, useful for transforming Jaccard indices and Szymkiewicz–Simpson overlap coefficients. To produce a plot of the relationship between the raw and transformed/scaled pared distances, use gsnParedVsRawDistancePlot().

Examples

Run this code


library(GSNA)

# In this example, we generate a gene set network from CERNO example
# data. We begin by subsetting the CERNO data for significant results:
sig_pathways.cerno <- subset( Bai_CiHep_DN.cerno, adj.P.Val <= 0.05 )

# Now create a gene set collection containing just the gene sets
# with significant CERNO results, by subsetting Bai_gsc.tmod using
# the gene set IDs as keys:
sig_pathways.tmod <- Bai_gsc.tmod[sig_pathways.cerno$ID]

# And obtain a background gene set from differential expression data:
background_genes <- toupper( rownames( Bai_CiHep_v_Fib2.de ) )

# Build a gene set network:
sig_pathways.GSN <-
   buildGeneSetNetworkJaccard(geneSetCollection = sig_pathways.tmod,
                              ref.background = background_genes )

# Now import the CERNO data:
sig_pathways.GSN <- gsnImportCERNO( sig_pathways.GSN,
                                    pathways_data = sig_pathways.cerno )

# Now we can pare the network. By default, the distances are complemented
# and converted into ranks for the sake of generating a network.
sig_pathways.GSN <- gsnPareNetGenericHierarchic( object = sig_pathways.GSN )

# However, for similarity metrics such as the Jaccard index or Simkiewicz-
# Simpson overlap coefficient, with a domain of 0 to 1, in which higher
# values are "closer", \code{\link{complement}()} might be a good
# transformation as well.
sig_pathways.GSN <- gsnPareNetGenericHierarchic( object = sig_pathways.GSN,
                                           matrix_scaling_fun = complement )