nsFilter: Filtering of Features in an ExpressionSet

Description

The function nsFilter tries to provide a one-stop shop for different options of filtering (removing) features from an ExpressionSet. Filtering features exhibiting little variation, or a consistently low signal, across samples can be advantageous for the subsequent data analysis (Bourgon et al.). Furthermore, one may decide that there is little value in considering features with insufficient annotation.

Usage

nsFilter(eset, require.entrez=TRUE,
    require.GOBP=FALSE, require.GOCC=FALSE,
    require.GOMF=FALSE, require.CytoBand=FALSE,
    remove.dupEntrez=TRUE, var.func=IQR,
    var.cutoff=0.5, var.filter=TRUE,
    filterByQuantile=TRUE, feature.exclude="^AFFX", ...)
varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE)
featureFilter(eset, require.entrez=TRUE,
    require.GOBP=FALSE, require.GOCC=FALSE,
    require.GOMF=FALSE, require.CytoBand=FALSE,
    remove.dupEntrez=TRUE, feature.exclude="^AFFX")

Arguments

eset

an ExpressionSet object

var.func

The function used as the per-feature filtering statistic. This function should return a numeric vector of length one when given a numeric vector as input.

var.filter

A logical indicating whether to perform filtering based on var.func.

filterByQuantile

A logical indicating whether var.cutoff is to be interprested as a quantile of all var.func values (the default), or as an absolute value.

var.cutoff

A numeric value. If var.filter is TRUE, features whose value of var.func is less than either: the var.cutoff-quantile of all var.func values (if filterByQuantile is TRUE), or var.cutoff (if filterByQuantile is FALSE) will be removed.

require.entrez

If TRUE, filter out features without an Entrez Gene ID annotation. If using an annotation package where an identifier system other than Entrez Gene IDs is used as the central ID, then that ID will be required instead.

require.GOBP, require.GOCC, require.GOMF

If TRUE, filter out features whose target genes are not annotated to at least one GO term in the BP, CC or MF ontology, respectively.

require.CytoBand

If TRUE, filter out features whose target genes have no mapping to cytoband locations.

remove.dupEntrez

If TRUE and there are features mapping to the same Entrez Gene ID (or equivalent), then the feature with the largest value of var.func will be retained and the other(s) removed.

feature.exclude

A character vector of regular expressions. Feature identifiers (i.e. value of featureNames(eset)) that match one of the specified patterns will be filtered out. The default value is intended to filter out Affymetrix quality control probe sets.

...

Unused, but available for specializing methods.

Value

For nsFilter a list consisting of:
esetthe filtered ExpressionSet
filter.loga list giving details of how many probe sets where removed for each filtering step performed.
For both varFilter and featureFilter the filtered ExpressionSet.

Details

In this Section, the effect of filtering on the type I error rate estimation / control of subsequent hypothesis testing is explained. See also the paper by Bourgon et al. Marginal type I errors: Filtering on the basis of a statistic which is independent of the test statistic used for detecting differential gene expression can increase the detection rate at the same marginal type I error. This is clearly the case for filter criteria that do not depend on the data, such as the annotation based criteria provided by the nsFilter and featureFilter functions. However, marginal type I error can also be controlled for certain types of data-dependent criteria. Call $U^I$ the stage 1 filter statistic, which is a function that is applied feature by feature, based on whose value the feature is or is not accepted to pass to stage 2, and which depends only on the data for that feature and not any other feature, and call $U^{II}$ the stage 2 test statistic for differential expression. Sufficient conditions for marginal type-I error control are:

$U^I$the overall (across all samples) variance or mean,$U^{II}$the t-statistic (or any other scale and location invariant statistic), data normal distributed and exchangeable across samples.
$U^I$the overall mean,$U^{II}$the moderated t-statistic (as in limma'seBayesfunction), data normal distributed and exchangeable.
$U^I$a sample-class label independent function (e.g. overall mean, median, variance, IQR),$U^{II}$the Wilcoxon rank sum statistic, data exchangeable.

Experiment-wide type I error: Marginal type-I error control provided by the conditions above is sufficient for control of the family wise error rate (FWER). Note, however, that common false discovery rate (FDR) methods depend not only on the marginal behaviour of the test statistics under the null hypothesis, but also on their joint distribution. The joint distribution can be affected by filtering, even when this filtering leaves the marginal distributions of true-null test statistics unchanged. Filtering might, for example, change correlation structure. The effect of this is negligible in many cases in practice, but this depends on the dataset and the filter used, and the assessment is in the responsibility of the data analyst. Annotation Based Filtering Arguments require.entrez, require.GOBP, require.GOCC, require.GOMF and require.CytoBand filter based on available annotation data. The annotation package is determined by calling annotation(eset).

Variance Based Filtering The var.filter, var.func, var.cutoff and varByQuantile arguments control numerical cutoff-based filtering. Probes for which var.func returns NA are removed. The default var.func is IQR, which we here define as rowQ(eset, ceiling(0.75 * ncol(eset))) - rowQ(eset, floor(0.25 * ncol(eset))); this choice is motivated by the observation that unexpressed genes are detected most reliably through low variability of their features across samples. Additionally, IQR is robust to outliers (see note below). The default var.cutoff is 0.5 and is motivated by a rule of thumb that in many tissues only 40% of genes are expressed. Please adapt this value to your data and question.

By default the numerical-filter cutoff is interpreted as a quantile, so with the default settings, 50% of the genes are filtered.

Variance filtering is performed last, so that (if varByQuantile=TRUE and remove.dupEntrez=TRUE) the final number of genes does indeed exclude precisely the var.cutoff fraction of unique genes remaining after all other filters were passed. The stand-alone function varFilter does only var.func-based filtering (and no annotation based filtering). featureFilter does only annotation based filtering and duplicate removal; it always performs duplicate removal to retain the highest-IQR probe for each gene.

References

R. Bourgon, R. Gentleman, W. Huber, Independent filtering increases power for detecting differentially expressed genes, Technical Report.

Examples

Run this code

library("hgu95av2.db")
  library("Biobase")
  data(sample.ExpressionSet)
  ans <- nsFilter(sample.ExpressionSet)
  ans$eset
  ans$filter.log

  ## skip variance-based filtering
  ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE)

  a1 <- varFilter(sample.ExpressionSet)
  a2 <- featureFilter(sample.ExpressionSet)

Run the code above in your browser using DataLab