nsFilter tries to provide a one-stop shop for
different options of filtering (removing) features from an ExpressionSet.
Filtering features exhibiting little variation, or a consistently low
signal, across samples can be advantageous for
the subsequent data analysis (Bourgon et al.).
Furthermore, one may decide that there is little value in considering
features with insufficient annotation.nsFilter(eset, require.entrez=TRUE,
require.GOBP=FALSE, require.GOCC=FALSE,
require.GOMF=FALSE, require.CytoBand=FALSE,
remove.dupEntrez=TRUE, var.func=IQR,
var.cutoff=0.5, var.filter=TRUE,
filterByQuantile=TRUE, feature.exclude="^AFFX", ...)varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE)
featureFilter(eset, require.entrez=TRUE,
require.GOBP=FALSE, require.GOCC=FALSE,
require.GOMF=FALSE, require.CytoBand=FALSE,
remove.dupEntrez=TRUE, feature.exclude="^AFFX")
ExpressionSet objectvar.func.var.cutoff
is to be interprested as a quantile of all var.func values
(the default), or as an absolute value.var.filter is TRUE,
features whose value of var.func is less than either:
the var.cutoff-quantile of all var.func values
(if filterByQuantile is TRUE), or
var.cutoff (if filterByQuantile is FALSE)
will be removed.TRUE, filter out features
without an Entrez Gene ID annotation. If using an annotation
package where an identifier system other than Entrez Gene IDs is
used as the central ID, then that ID will be required instead.TRUE, filter out features
whose target genes are not annotated to at least one GO term in
the BP, CC or MF ontology, respectively.TRUE, filter out features
whose target genes have no mapping to cytoband locations.TRUE and there are features
mapping to the same Entrez Gene ID (or equivalent), then the feature with
the largest value of var.func will be retained and the
other(s) removed.featureNames(eset))
that match one of the specified patterns will be filtered out.
The default value is intended to filter out Affymetrix quality control
probe sets.nsFilter a list consisting of:ExpressionSetvarFilter and featureFilter the filtered
ExpressionSet.nsFilter
and featureFilter functions. However, marginal type I error can
also be controlled for certain types of data-dependent criteria.
Call $U^I$ the stage 1 filter statistic, which is a function
that is applied feature by feature,
based on whose value the feature is or is not accepted to
pass to stage 2, and which depends only on the data for that feature
and not any other feature, and call
$U^{II}$ the stage 2 test statistic for differential expression.
Sufficient conditions for marginal type-I error control are:
eBayesfunction),
data normal distributed and exchangeable. Experiment-wide type I error:
Marginal type-I error control provided by the conditions above
is sufficient for control of the family wise error rate (FWER).
Note, however, that common false discovery rate (FDR) methods depend
not only on the marginal behaviour of the test statistics under the
null hypothesis, but also on their joint distribution.
The joint distribution can be affected by filtering,
even when this filtering leaves the marginal distributions of
true-null test statistics unchanged. Filtering might, for example,
change correlation structure. The
effect of this is negligible in many cases in practice, but this
depends on the dataset and the filter used, and the assessment
is in the responsibility of the data analyst.
Annotation Based Filtering Arguments require.entrez,
require.GOBP, require.GOCC, require.GOMF and
require.CytoBand
filter based on available annotation data. The annotation
package is determined by calling annotation(eset).
Variance Based Filtering The var.filter,
var.func, var.cutoff and varByQuantile arguments
control numerical cutoff-based filtering.
Probes for which var.func returns NA are
removed.
The default var.func is IQR, which we here define as
rowQ(eset, ceiling(0.75 * ncol(eset))) - rowQ(eset, floor(0.25 * ncol(eset)));
this choice is motivated by the observation that unexpressed genes are
detected most reliably through low variability of their features
across samples.
Additionally, IQR is robust to outliers (see note below). The
default var.cutoff is 0.5 and is motivated by a rule of
thumb that in many tissues only 40% of genes are expressed.
Please adapt this value to your data and question.
By default the numerical-filter cutoff is interpreted as a quantile, so with the default settings, 50% of the genes are filtered.
Variance filtering is performed last, so that
(if varByQuantile=TRUE and remove.dupEntrez=TRUE) the
final number of genes does indeed exclude precisely the var.cutoff
fraction of unique genes remaining after all other filters were
passed.
The stand-alone function varFilter does only
var.func-based filtering
(and no annotation based filtering).
featureFilter does only
annotation based filtering and duplicate removal; it always
performs duplicate removal to retain the highest-IQR
probe for each gene.
library("hgu95av2.db")
library("Biobase")
data(sample.ExpressionSet)
ans <- nsFilter(sample.ExpressionSet)
ans$eset
ans$filter.log
## skip variance-based filtering
ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE)
a1 <- varFilter(sample.ExpressionSet)
a2 <- featureFilter(sample.ExpressionSet)Run the code above in your browser using DataLab