varFilter: Variation-based Filtering of Features (CpG sites) in a MethyLumiSet or MethyLumiM object

Description

The function varFilter removes features exhibiting little variation across samples. Such non-specific filtering can be advantageous for downstream data analysis.

Usage

varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE, ...)

Arguments

eset

An MethyLumiSet or MethyLumiM object.

var.func

The function used as the per-feature filtering statistics.

var.cutoff

A numeric value indicating the cutoff value for variation. If filterByQuantile is TRUE, features whose value of var.func is less than var.cutoff-quantile of all var.func value will be removed. It FALSE, features whose values are less than var.cutoff will be removed.

filterByQuantile

A logical indicating whether var.cutoff is to be interprested as a quantile of all var.func (the default), or as an absolute value.

...

Unused, but available for specializing methods.

Value

eset: The filtered MethyLumiSet or MethyLumiM object.
filter.log: Shows many low-variant features are removed.

Details

This function is a counterpart of functions nsFilter and varFilter available from the genefilter package. See R. Bourgon et. al. (2010) and nsFilter for detail.

It is proven that non-specific filtering, for which the criteria does not depend on sample class, can increase the number of discoverie. Inappropriate choice of test statistics, however, might have adverse effect. limma's moderated $t$-statistics, for example, is based on empirical Bayes approach which models the conjugate prior of gene-level variance with an inverse of $\chi^2$ distribution scaled by observed global variance. As the variance-based filtering removes the set of genes with low variance, the scaled inverse $\chi^2$ no longer provides a good fit to the data passing the filter, causing the limma algorithm to produce a posterior degree-of-freedom of infinity (Bourgon 2010). This leads to two consequences: (i) gene-level variance estimate will be ignore, and (ii) the $p$-value will be overly optimistic (Bourgon 2010).

References

R. Bourgon, R. Gentleman, W. Huber, Independent filtering increases power for detecting differentially expressed genes, PNAS, vol. 107, no. 21, pp:9546-9551, 2010.

Examples

Run this code

  data(mldat)
  ## keep top 75 percent
  filt <- varFilter(mldat, var.cutoff=0.25)
  filt$filter.log
  dim(filt$eset)

Run the code above in your browser using DataLab