filterRedundant: This functions removes redundant features from a data.frame

Description

Prior computing proportion of overlap between ranked vector of features it is necessary to remove the redundant features. This can be accomplished using a number of methods implemeted in the filterRedundant function, as explained below.

Usage

filterRedundant(object, method=c("maxORmin", "geoMean", "mean", "median","random"), idCol=1, byCol=2, absolute=TRUE, decreasing=TRUE, trim=0, ...)

Arguments

object

a data.frame from which redundant features (rows) must be removed.

method

character. The method used for removing redundancy. Currently available methods are: maxORmin, geoMean, random, mean, median, (see Details below).

idCol

character or numeric. Name or index of the column containing redundant identifiers (e.g. ENTREZID, SYMBOLS, ...).

byCol

character or numeric. Name or index of the column containing the ranking statistics (used only with maxORmin method).

absolute

logical. Indicates whether the absolute statistics, as defined by byCol, should be used when reordering (used only with maxORmin method).

decreasing

logical. Indicates whether reodering should be decreasing or not (used only with maxORmin method).

trim

numeric. Indicates whether a trimmed mean should be computed (used only with mean method).

...

further arguments to be passed (not currently implemented).

Value

A data.frame with fewer rows with respect to the input one, unique by the identifier specified by the idCol argument.

Details

The maxORmin method removes redundant features by selecting the rows that correspond to the maximum or minimum value of a selected statistics. With this approach redundant features are first ranked in increasing or decreasing order, as defined by the decreasing argument, using the ranking statistics defined by byCol, either in their original or absolute scale, as defined by absolute argument. Subsequently data.frame rows corresponding to redundant identifiers are removed, after these have been identified in the column defined by the idCol, using the duplicated function. The mean, median, geoMean, and random methods provide alternative ways for summarizing numerical values corresponding to redundant features, as defined by the idCol argument: mean takes the average, median the median, geoMean the geometric mean, random select a random value.

Examples

Run this code

###load data
data(matchBoxExpression)

###check whether there are redundant identifiers
sapply(matchBoxExpression,nrow)

###the column name for the identifiers
idCol <- "SYMBOL"

###the column name for the ranking statistics
byCol <- "t"

###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)

###recheck number of rows
sapply(newMatchBoxExpression, nrow)