consensusRepresentatives: Consensus selection of group representatives

Description

Given multiple data sets corresponding to the same variables and a grouping of variables into groups, the function selects a representative variable for each group using a variety of possible selection approaches. Typical uses include selecting a representative probe for each gene in microarray data.

Usage

consensusRepresentatives(
   mdx, 
   group, 
   colID, 
   consensusQuantile = 0, 
   method = "MaxMean", 
   useGroupHubs = TRUE, 
   calibration = c("none", "full quantile"), 
   selectionStatisticFnc = NULL, 
   connectivityPower = 1, 
   minProportionPresent = 1, 
   getRepresentativeData = TRUE, 
   statisticFncArguments = list(), 
   adjacencyArguments = list(), 
   verbose = 2, indent = 0)

Value

representatives: A named vector giving, for each group, the selected representative (input rowID or the variable (column) name in mdx). Names correspond to groups.
varSelected: A logical vector with one entry per variable (column) in input mdx (possibly after restriction to variables occurring in colID), TRUE if the column was selected as a representative.
representativeData: Only present if getRepresentativeData is TRUE; the input mdx restricted to the representative variables, with column names changed to the corresponding groups.

Arguments

mdx: A multiData structure. All sets must have the same columns.
group: Character vector whose components contain the group label (e.g. a character string) for each entry of colID. This vector must be of the same length as the vector colID. In gene expression applications, this vector could contain the gene symbol (or a co-expression module label).
colID: Character vector of column identifiers. This must include all the column names from mdx, but can include other values as well. Its entries must be unique (no duplicates) and no missing values are permitted.
consensusQuantile: A number between 0 and 1 giving the quantile probability for consensus calculation. 0 means the minimum value (true consensus) will be used.
method: character string for determining which method is used to choose the representative (when useGroupHubs is TRUE, this method is only used for groups with 2 variables). The following values can be used: "MaxMean" (default) or "MinMean" return the variable with the highest or lowest mean value, respectively; "maxRowVariance" return the variable with the highest variance; "absMaxMean" or "absMinMean" return the variable with the highest or lowest mean absolute value; and "function" will call a user-input function (see the description of the argument selectionStatisticFnc). The built-in functions can be instructed to use robust analogs (median and median absolute deviation) by also specifying statisticFncArguments=list(robust = TRUE).
useGroupHubs: Logical: if TRUE, groups with 3 or more variables will be represented by the variable with the highest connectivity according to a signed weighted correlation network adjacency matrix among the corresponding rows. The connectivity is defined as the row sum of the adjacency matrix. The signed weighted adjacency matrix is defined as A=(0.5+0.5*COR)^power where power is determined by the argument connectivityPower and COR denotes the matrix of pairwise correlation coefficients among the corresponding rows. Additional arguments to the underlying function adjacency can be specified using the argument adjacencyArguments below.
calibration: Character string describing the method of calibration of the selection statistic among the data sets. Recognized values are "none" (no calibration) and "full quantile" (quantile normalization).
selectionStatisticFnc: User-supplied function used to calculate the selection statistic when method above equals "function". The function must take argumens x (a matrix) and possibly other arguments that can be specified using statisticFncArguments below. The return value must be a vector with one component per column of x giving the selection statistic for each column.
connectivityPower: Positive number (typically integer) for specifying the soft-thresholding power used to construct the signed weighted adjacency matrix, see the description of useGroupHubs. This option is only used if useGroupHubs is TRUE.
minProportionPresent: A number between 0 and 1 specifying a filter of candidate probes. Specifically, for each group, the variable with the maximum consensus proportion of present data is found. Only variables whose consensus proportion of present data is at least minProportionPresent times the maximum consensus proportion are retained as candidates for being a representative.
getRepresentativeData: Logical: should the representative data, i.e., mdx restricted to the representative variables, be returned?
statisticFncArguments: A list giving further arguments to the selection statistic function. Can be used to supply additional arguments to the user-specified selectionStatisticFnc; the value list(robust = TRUE) can be used with the built-in functions to use their robust variants.
adjacencyArguments: Further arguments to the function adjacency, e.g. adjacencyArguments=list(corFnc = "bicor", corOptions = "use = 'p', maxPOutliers = 0.05") will select the robust correlation bicor with a good set of options. Note that the adjacency arguments type and power cannot be changed.
verbose: Level of verbosity; 0 means silent, larger values will cause progress messages to be printed.
indent: Indent for the diagnostic messages; each unit equals two spaces.

Author

Peter Langfelder, based on code by Jeremy Miller

Details

This function was inspired by collapseRows, but there are also important differences. This function focuses on selecting representatives; when summarization is more important, collapseRows provides more flexibility since it does not require that a single representative be selected.

This function and collapseRows use different input and ouput conventions; user-specified functions need to be tailored differently for collapseRows than for consensusRepresentatives.

Missing data are allowed and are treated as missing at random. If rowID is NULL, it is replaced by the variable names in mdx.

All groups with a single variable are represented by that variable, unless the consensus proportion of present data in the variable is lower than minProportionPresent, in which case the variable and the group are excluded from the output.

For all variables belonging to groups with 2 variables (when useGroupHubs=TRUE) or with at least 2 variables (when useGroupHubs=FALSE), selection statistics are calculated in each set (e.g., the selection statistic may be the mean, variance, etc). This results in a matrix of selection statistics (one entry per variable per data set). The selection statistics are next optionally calibrated (normalized) between sets to make them comparable; currently the only implemented calibration method is quantile normalization.

For each variable, the consensus selection statistic is defined as the consensus of the (calibrated) selection statistics across the data sets is calculated. The 'consensus' of a vector (say 'x') is simply defined as the quantile with probability consensusQuantile of the vector x. Important exception: for the "MinMean" and "absMinMean" methods, the consensus is the quantile with probability 1-consensusQuantile, since the idea of the consensus is to select the worst (or close to worst) value across the data sets.

For each group, the representative is selected as the variable with the best (typically highest, but for "MinMean" and "absMinMean" methods the lowest) consensus selection statistic.