Last chance! 50% off unlimited learning
Sale ends in
These functions create user-defined (srFitler
) or built-in
instances of SRFilter
objects. Filters can be
applied to objects from ShortRead
, returning a logical vector
to be used to subset the objects to include only those components
satisfying the filter.
srFilter(fun, name = NA_character_, ...)
"srFilter"(fun, name=NA_character_, ...)
"srFilter"(fun, name=NA_character_, ...)
compose(filt, ..., .name)
idFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="idFilter")
occurrenceFilter(min=1L, max=1L, withSread=c(NA, TRUE, FALSE), duplicates=c("head", "tail", "sample", "none"), .name=.occurrenceName(min, max, withSread, duplicates))
nFilter(threshold=0L, .name="CleanNFilter")
polynFilter(threshold=0L, nuc=c("A", "C", "T", "G", "other"), .name="PolyNFilter")
dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter")
srdistanceFilter(subject=character(0), threshold=0L, .name="SRDistanceFilter")
##
## legacy filters for ungapped alignments
##
chromosomeFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="ChromosomeFilter")
positionFilter(min=-Inf, max=Inf, .name="PositionFilter")
strandFilter(strandLevels=character(0), .name="StrandFilter")
alignQualityFilter(threshold=0L, .name="AlignQualityFilter")
alignDataFilter(expr=expression(), .name="AlignDataFilter")
function
to be used as a
filter. fun
must accept a single named argument x
, and
is expected to return a logical vector such that x[fun(x)]
selects only those elements of x
satisfying the conditions of
fun
character(1)
object to be used as the name of the
filter. The name
is useful for debugging and reference.SRFilter
object, to be used with
additional arguments to create a composite filter.character(1)
object used to over-ride
the name applied to default filters.character(0)
or a character(1)
regular expression used as grep(regex, chromosome(x))
to
filter based on chromosome. The default (character(0)
)
performs no filteringlogical(1)
passed to grep
,
influencing how pattern matching occurs.logical(1)
which, when TRUE
, uses
regex
to exclude, rather than include, reads.numeric(1)
numeric(1)
. For positionFilter
, min
and max
define the closed interval in which position must be
found min <= position="" <="max. For occurrenceFilter
,
min
and max
define the minimum and maximum number of
times a read occurs after the filter.
character(0)
or character(1)
containing strand levels to be selected. ShortRead
objects
have standard strand levels NA, "+", "-", "*"
, with NA
meaning strand information not available and "*"
meaning
strand information not relevant.logical(1)
indicating whether uniqueness
includes the read sequence (withSread=TRUE
), is based only on
chromosome, position, and strand (withSread=FALSE
), or only
the read sequence (withSread=NA
), as described for
occurrenceFilter
below..character{1}
, a function name
,
or a function taking a single argument. Influence how duplicates are
handled, as described for occurrenceFilter
below.numeric(1)
value representing a minimum
(srdistanceFilter
, alignQualityFilter
) or maximum
(nFilter
, polynFilter
, dustyFilter
) criterion
for the filter. The minima and maxima are closed-interval (i.e.,
x >= threshold
, x <= threshold<="" code=""> for some property
x
of the object being filtered).
character
vector containing IUPAC symbols for
nucleotides or the value "other"
corresponding to all
non-nucleotide symbols, e.g., N
.NA
or an integer(1)
vector indicating
the number of DNA sequences to be processed simultaneously by
dustyFilter
. By default, all reads are processed
simultaneously. Smaller values use less memory but are
computationally less efficient.character()
of any length, to be used as the
corresponding argument to srdistance
.expression
to be evaluated with
pData(alignData(x))
.srFilter
returns an object of SRFilter
.Built-in filters return a logical vector of length(x)
, with
TRUE
indicating components that pass the filter. srFilter
allows users to construct their own filters. The
fun
argument to srFilter
must be a function accepting a
single argument x
and returning a logical vector that can be
used to select elements of x
satisfying the filter with
x[fun(x)]
The signature(fun="missing")
method creates a default filter
that returns a vector of TRUE
values with length equal to
length(x)
.
compose
constructs a new filter from one or more existing
filter. The result is a filter that returns a logical vector with
indices corresponding to components of x
that pass all
filters. If not provided, the name of the filter consists of the names
of all component filters, each separated by " o "
.
The remaining functions documented on this page are built-in filters
that accept an argument x
and return a logical vector of
length(x)
indicating which components of x
satisfy the
filter.
idFilter
selects elements satisfying
grep(regex, id(x), fixed=fixed)
.
chromosomeFilter
selects elements satisfying
grep(regex, chromosome(x), fixed=fixed)
.
positionFilter
selects elements satisfying
min <= position(x)="" <="max.
strandFilter
selects elements satisfying
match(strand(x), strand, nomatch=0) > 0
.
occurrenceFilter
selects elements that occur >=min
and
<=max< code=""> times.
withSread
determines how reads will be
treated: TRUE
to include the sread, chromosome, strand, and
position when determining occurrence, FALSE
to include
chromosome, strand, and position, and NA
to include only
sread. The default is withSread=NA
. duplicates
determines how reads with more than max
reads are
treated. head
selects the first max
reads of each set of
duplicates, tail
the last max
reads, and sample
a
random sample of max
reads. none
removes all reads
represented more than max
times. The user can also provide a
function (as used by tapply
) of a single argument to
select amongst reads.
nFilter
selects elements with fewer than threshold
'N'
symbols in each element of sread(x)
.
polynFilter
selects elements with fewer than threshold
copies of any nucleotide indicated by nuc
.
dustyFilter
selects elements with high sequence complexity, as
characterized by their dustyScore
. This emulates the
dust
command from WindowMaker
software. Calculations can be memory intensive; use
batchSize
to process the argument to dustyFilter
in
batches of the specified size.
srdistanceFilter
selects elements at an edit distance greater
than threshold
from all sequences in subject
.
alignQualityFilter
selects elements with alignQuality(x)
greater than threshold
.
alignDataFilter
selects elements with
pData(alignData(x))
satisfying expr
. expr
should
be formulated as though it were to be evaluated as
eval(expr, pData(alignData(x)))
.
SRFilter
.sp <- SolexaPath(system.file("extdata", package="ShortRead"))
aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
# a 'chromosome 5' filter
filt <- chromosomeFilter("chr5.fa")
aln[filt(aln)]
# filter during input
readAligned(sp, "s_2_export.txt", filter=filt)
# x- and y- coordinates stored in alignData, when source is SolexaExport
xy <- alignDataFilter(expression(abs(x-500) > 200 & abs(y-500) > 200))
aln[xy(aln)]
# both filters as a single filter
chr5xy <- compose(filt, xy)
aln[chr5xy(aln)]
# both filters as a collection
filters <- c(filt, xy)
subsetByFilter(aln, filters)
summary(filters, aln)
# read, chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=TRUE, duplicates="none")(aln)]
# reads occurring exactly once
aln[occurrenceFilter(withSread=NA, duplicates="none")(aln)]
# chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=FALSE, duplicates="none")(aln)]
# custom filter: minimum calibrated base call quality >20
goodq <- srFilter(function(x) {
apply(as(quality(x), "matrix"), 1, min, na.rm=TRUE) > 20
}, name="GoodQualityBases")
goodq
aln[goodq(aln)]
Run the code above in your browser using DataLab