IntervalTree
from the
ranges in ranges
, an object coercible to
IntervalTree
, such as an IRanges
object.
as(from, "IRanges")
: Imports the ranges in
from
, an IntervalTree
, to an
IRanges
.as(from, "IntervalTree")
: Constructs an
IntervalTree
representing from
, a Ranges
object that is coercible to IRanges
.
length(x)
: Gets the number of ranges stored in the
tree. This is a fast operation that does not bring the ranges into
R.start(x)
: Get the starts of the ranges.end(x)
: Get the ends of the ranges.O(n*lg(n))
, which makes it about as fast as other types of
overlap query algorithms based on sorting. The good news is that the
tree need only be built once per subject; this is useful in situations
of frequent querying. Also, in this implementation the data is stored
outside of R, avoiding needless copying. Of course, external storage
is not always convenient, so it is possible to coerce the tree to an
instance of IRanges
(see the Coercion section). For the query operation, the running time is based on the query size
m
and the average number of hits per query k
. The output
size is then max(mk,m)
, but we abbreviate this as
mk
. Note that when the multiple
parameter is set to
FALSE
, k
is fixed to 1 and drops out of this
analysis. We also assume here that the query is sorted by start
position (the findOverlaps
function sorts the query if it is unsorted). An upper bound for finding overlaps is
O(min(mk*lg(n),n+mk))
. The fastest interval tree algorithm
known is bounded by O(min(m*lg(n),n)+mk)
but is a lot more
complicated and involves two auxillary trees. The lower bound is
Omega(lg(n)+mk)
, which is almost the same as for returning
the answer, Omega(mk)
. The average is of course somewhere in
between. This analysis informs the choice of which set of ranges to process
into a tree, i.e. assigning one to be the subject and the other to be
the query. Note that if m > n
, then the running time is
O(m)
, and the total operation of complexity O(n*lg(n) +
m)
is better than if m
and n
were exchanged. Thus, for
once-off operations, it is often most efficient to choose the smaller
set to become the tree (but k
also affects this). This is
reinforced by the realization that if mk
is about the same in
either direction, the running time depends only on n
, which
should be minimized. Even in cases where a tree has already been
constructed for one of the sets, it can be more efficient to build a
new tree when the existing tree of size n
is much larger than
the query set of size m
, roughly when n > m*lg(n)
. The simplest approach for finding overlaps is to call the
findOverlaps
function on a Ranges or other object
with range information. See the man page of findOverlaps
for how to use this and other related functions.
An IntervalTree
object is a derivative of Ranges and
stores its ranges as a tree that is optimized for overlap queries.
Thus, for repeated queries against the same subject, it is more
efficient to create an IntervalTree
once for the subject using
the constructor described below and then perform the queries against
the IntervalTree
instance.
findOverlaps
for finding/counting interval overlaps between
two "range-based" objects,
Ranges
, the parent of this class,
Hits
, set of hits between 2 vector-like objects.
query <- IRanges(c(1, 4, 9), c(5, 7, 10))
subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
tree <- IntervalTree(subject)
findOverlaps(query, tree)
## query and subject are easily interchangeable
query <- IRanges(c(1, 4, 9), c(5, 7, 10))
subject <- IRanges(c(2, 2), c(5, 4))
tree <- IntervalTree(subject)
t(findOverlaps(query, tree))
# the same as:
findOverlaps(subject, query)
Run the code above in your browser using DataLab