blockwiseModules: Automatic network construction and module detection

Description

This function performs automatic network construction and module detection on large expression datasets in a block-wise manner.

Usage

blockwiseModules(
  # Input data

  datExpr, 

  # Data checking options

  checkMissingData = TRUE,

  # Options for splitting data into blocks

  blocks = NULL,
  maxBlockSize = 5000,
  randomSeed = 12345,

  # Network construction arguments: correlation options

  corType = "pearson",
  maxPOutliers = 1, 
  quickCor = 0,
  pearsonFallback = "individual",
  cosineCorrelation = FALSE,

  # Adjacency function options

  power = 6,
  networkType = "unsigned",

  # Topological overlap options

  TOMType = "signed",
  TOMDenom = "min",

  # Saving or returning TOM

  getTOMs = NULL,
  saveTOMs = FALSE, 
  saveTOMFileBase = "blockwiseTOM",

  # Basic tree cut options

  deepSplit = 2,
  detectCutHeight = 0.995, 
  minModuleSize = min(20, ncol(datExpr)/2 ),

  # Advanced tree cut options

  maxCoreScatter = NULL, minGap = NULL,
  maxAbsCoreScatter = NULL, minAbsGap = NULL,
  minSplitHeight = NULL, minAbsSplitHeight = NULL,

  useBranchEigennodeDissim = FALSE,
  minBranchEigennodeDissim = mergeCutHeight,

  pamStage = TRUE, pamRespectsDendro = TRUE,

  # Gene reassignment, module trimming, and module "significance" criteria

  reassignThreshold = 1e-6,
  minCoreKME = 0.5, 
  minCoreKMESize = minModuleSize/3,
  minKMEtoStay = 0.3,

  # Module merging options

  mergeCutHeight = 0.15, 
  impute = TRUE, 
  trapErrors = FALSE, 

  # Output options

  numericLabels = FALSE,

  # Options controlling behaviour

  nThreads = 0,
  verbose = 0, indent = 0,
  ...)

Arguments

datExpr

expression data. A data frame in which columns are genes and rows ar samples. NAs are allowed, but not too many.

checkMissingData

logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.

blocks

optional specification of blocks in which hierarchical clustering and module detection should be performed. If given, must be a numeric vector with one entry per column (gene) of exprData giving the number of the block to which the corresp

maxBlockSize

integer giving maximum block size for module detection. Ignored if blocks above is non-NULL. Otherwise, if the number of genes in datExpr exceeds maxBlockSize, genes will be pre-clustered into blocks whose size sho

randomSeed

integer to be used as seed for the random number generator before the function starts. If a current seed exists, it is saved and restored upon exit. If NULL is given, the function will not save and restore the seed.

corType

character string specifying the correlation to be used. Allowed values are (unique abbreviations of) "pearson" and "bicor", corresponding to Pearson and bidweight midcorrelation, respectively. Missing values are handled using t

maxPOutliers

only used for corType=="bicor". Specifies the maximum percentile of data that can be considered outliers on either side of the median separately. For each side of the median, if higher percentile than maxPOutliers is conside

quickCor

real number between 0 and 1 that controls the handling of missing data in the calculation of correlations. See details.

pearsonFallback

Specifies whether the bicor calculation, if used, should revert to Pearson when median absolute deviation (mad) is zero. Recongnized values are (abbreviations of) "none", "individual", "all". If set to "none", zero mad will r

cosineCorrelation

logical: should the cosine version of the correlation calculation be used? The cosine calculation differs from the standard one in that it does not subtract the mean.

power

soft-thresholding power for network construction.

networkType

network type. Allowed values are (unique abbreviations of) "unsigned", "signed", "signed hybrid". See adjacency.

TOMType

one of "none", "unsigned", "signed". If "none", adjacency will be used for clustering. If "unsigned", the standard TOM will be used (more generally, TOM function will receive the adjacency

TOMDenom

a character string specifying the TOM variant to be used. Recognized values are "min" giving the standard TOM described in Zhang and Horvath (2005), and "mean" in which the min function in the denominator is repl

getTOMs

deprecated, please use saveTOMs below.

saveTOMs

logical: should the consensus topological overlap matrices for each block be saved and returned?

saveTOMFileBase

character string containing the file name base for files containing the consensus topological overlaps. The full file names have "block.1.RData", "block.2.RData" etc. appended. These files are standard R data files and can be l

deepSplit

integer value between 0 and 4. Provides a simplified control over how sensitive module detection should be to module splitting, with 0 least and 4 most sensitive. See cutreeDynamic for

detectCutHeight

dendrogram cut height for module detection. See cutreeDynamic for more details.

minModuleSize

minimum module size for module detection. See cutreeDynamic for more details.

maxCoreScatter

maximum scatter of the core for a branch to be a cluster, given as the fraction of cutHeight relative to the 5th percentile of joining heights. See cutreeDynamic for more

minGap

minimum cluster gap given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. See cutreeDynamic for more details.

maxAbsCoreScatter

maximum scatter of the core for a branch to be a cluster given as absolute heights. If given, overrides maxCoreScatter. See cutreeDynamic for more details.

minAbsGap

minimum cluster gap given as absolute height difference. If given, overrides minGap. See cutreeDynamic for more details.

minSplitHeight

Minimum split height given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. Branches merging below this height will automatically be merged. Defaults to zero but is used only if minAbsSpli

minAbsSplitHeight

Minimum split height given as an absolute height. Branches merging below this height will automatically be merged. If not given (default), will be determined from minSplitHeight above.

useBranchEigennodeDissim

Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut?

minBranchEigennodeDissim

Minimum consensus branch eigennode (eigengene) dissimilarity for branches to be considerd separate. The branch eigennode dissimilarity in individual sets is simly 1-correlation of the eigennodes; the consensus is defined as quantile with probability

pamStage

logical. If TRUE, the second (PAM-like) stage of module detection will be performed. See cutreeDynamic for more details.

pamRespectsDendro

Logical, only used when pamStage is TRUE. If TRUE, the PAM stage will respect the dendrogram in the sense an object can be PAM-assigned only to clusters that lie below it on the branch that the object is merged i

minCoreKME

a number between 0 and 1. If a detected module does not have at least minModuleKMESize genes with eigengene connectivity at least minCoreKME, the module is disbanded (its genes are unlabeled and returned to the pool of genes wa

minCoreKMESize

see minCoreKME above.

minKMEtoStay

genes whose eigengene connectivity to their module eigengene is lower than minKMEtoStay are removed from the module.

reassignThreshold

p-value ratio threshold for reassigning genes between modules. See Details.

mergeCutHeight

dendrogram cut height for module merging.

impute

logical: should imputation be used for module eigengene calculation? See moduleEigengenes for more details.

trapErrors

logical: should errors in calculations be trapped?

numericLabels

logical: should the returned modules be labeled by colors (FALSE), or by numbers (TRUE)?

nThreads

non-negative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, b

verbose

integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

indent

indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

...

Other arguments.

Value

A list with the following components:
colorsa vector of color or numeric module labels for all genes.
unmergedColorsa vector of color or numeric module labels for all genes before module merging.
MEsa data frame containing module eigengenes of the found modules (given by colors).
goodSamplesnumeric vector giving indices of good samples, that is samples that do not have too many missing entries.
goodGenesnumeric vector giving indices of good genes, that is genes that do not have too many missing entries.
dendrogramsa list whose components conatain hierarchical clustering dendrograms of genes in each block.
TOMFilesif saveTOMs==TRUE, a vector of character strings, one string per block, giving the file names of files (relative to current directory) in which blockwise topological overlaps were saved.
blockGenesa list whose components give the indices of genes in each block.
blocksif input blocks was given, its copy; otherwise a vector of length equal number of genes giving the block label for each gene. Note that block labels are not necessarilly sorted in the order in which the blocks were processed (since we do not require this for the input blocks). See blockOrder below.
blockOrdera vector giving the order in which blocks were processed and in which blockGenes above is returned. For example, blockOrder[1] contains the label of the first-processed block.
MEsOKlogical indicating whether the module eigengenes were calculated without errors.

Details

Before module detection starts, genes and samples are optionally checked for the presence of NAs. Genes and/or samples that have too many NAs are flagged as bad and removed from the analysis; bad genes will be automatically labeled as unassigned, while the returned eigengenes will have NA entries for all bad samples. If blocks is not given and the number of genes exceeds maxBlockSize, genes are pre-clustered into blocks using the function projectiveKMeans; otherwise all genes are treated in a single block. For each block of genes, the network is constructed and (if requested) topological overlap is calculated. If requested, the topological overlaps are returned as part of the return value list. Genes are then clustered using average linkage hierarchical clustering and modules are identified in the resulting dendrogram by the Dynamic Hybrid tree cut. Found modules are trimmed of genes whose correlation with module eigengene (KME) is less than minKMEtoStay. Modules in which fewer than minCoreKMESize genes have KME higher than minCoreKME are disbanded, i.e., their constituent genes are pronounced unassigned. After all blocks have been processed, the function checks whether there are genes whose KME in the module they assigned is lower than KME to another module. If p-values of the higher correlations are smaller than those of the native module by the factor reassignThresholdPS, the gene is re-assigned to the closer module. In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by clustering module eigengenes using the dissimilarity given by one minus their correlation, cutting the dendrogram at the height mergeCutHeight and merging all modules on each branch. The process is iterated until no modules are merged. See mergeCloseModules for more details on module merging. The argument quick specifies the precision of handling of missing data in the correlation calculations. Zero will cause all calculations to be executed precisely, which may be significantly slower than calculations without missing data. Progressively higher values will speed up the calculations but introduce progressively larger errors. Without missing data, all column means and variances can be pre-calculated before the covariances are calculated. When missing data are present, exact calculations require the column means and variances to be calculated for each covariance. The approximate calculation uses the pre-calculated mean and variance and simply ignores missing data in the covariance calculation. If the number of missing data is high, the pre-calculated means and variances may be very different from the actual ones, thus potentially introducing large errors. The quick value times the number of rows specifies the maximum difference in the number of missing entries for mean and variance calculations on the one hand and covariance on the other hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing data are treated approximately, the error introduced will be small but the potential speedup can be significant.

References

Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17