A fast binary-search based overlap join of two data.table
s.
This is very much inspired by findOverlaps
function from the Bioconductor
package IRanges
(see link below under See Also
).
Usually, x
is a very large data.table with small interval ranges, and
y
is much smaller keyed data.table
with relatively larger
interval spans. For a usage in genomics
, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
by.y = key(y), maxgap = 0L, minoverlap = 1L,
type = c("any", "within", "start", "end", "equal"),
mult = c("all", "first", "last"),
nomatch = getOption("datatable.nomatch", NA),
which = FALSE, verbose = getOption("datatable.verbose"))
data.table
s. y
needs to be keyed, but not necessarily
x
. See examples.
A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both by.x
and by.y
should each
correspond to the start
and end
interval columns in x
and
y
respectively. And the start
column should always be <= end
column. If x
is keyed, by.x
is equal to key(x)
, else
key(y)
. by.y
defaults to key(y)
.
It should be a non-negative integer value, >= 0. Default is 0 (no
gap). For intervals [a,b]
and [c,d]
, where a<=b
and
c<=d
, when c > b
or d < a
, the two intervals don't overlap.
If the gap between these two intervals is <= maxgap
, these two intervals
are considered as overlapping. Note: This is not yet implemented.
It should be a positive integer value, > 0. Default is 1. For
intervals [a,b]
and [c,d]
, where a<=b
and c<=d
, when
c<=b
and d>=a
, the two intervals overlap. If the length of overlap
between these two intervals is >= minoverlap
, then these two intervals are
considered to be overlapping. Note: This is not yet implemented.
Default value is any
. Allowed values are any
,
within
, start
, end
and equal
.
The types shown here are identical in functionality to the function
findOverlaps
in the bioconductor package IRanges
. Let [a,b]
and [c,d]
be intervals in x
and y
with a<=b
and
c<=d
. For type="start"
, the intervals overlap iff a == c
.
For type="end"
, the intervals overlap iff b == d
. For
type="within"
, the intervals overlap iff a>=c and b<=d
. For
type="equal"
, the intervals overlap iff a==c and b==d
. For
type="any"
, as long as c<=b and d>=a
, they overlap. In addition
to these requirements, they also have to satisfy the minoverlap
argument
as explained above.
NB: maxgap
argument, when > 0, is to be interpreted according to the type
of the overlap. This will be updated once maxgap
is implemented.
When multiple rows in y
match to the row in x
,
mult=.
controls which values are returned - "all"
(default),
"first"
or "last"
.
When a row (with interval say, [a,b]
) in x
has no
match in y
, nomatch=NA
(default) means NA
is returned for
y
's non-by.y
columns for that row of x
. nomatch=NULL
(or 0
for backward compatibility) means no rows will be returned for that
row of x
. Use options(datatable.nomatch=NULL)
to change the default
value (used when nomatch
is not supplied).
When TRUE
, if mult="all"
returns a two column
data.table
with the first column corresponding to x
's row number
and the second corresponding to y
's. when nomatch=NA
, no matches
return NA
for y
, and if nomatch=NULL
, those rows where no
match is found will be skipped; if mult="first" or "last"
, a vector of
length equal to the number of rows in x
is returned, with no-match entries
filled with NA
or 0
corresponding to the nomatch
argument.
Default is FALSE
, which returns a join with the rows in y
.
TRUE
turns on status and information messages to the
console. Turn this on by default using options(datatable.verbose=TRUE)
.
The quantity and types of verbosity may be expanded in future.
A new data.table
by joining over the interval columns (along with other
additional identifier columns) specified in by.x
and by.y
.
NB: When which=TRUE
: a)
mult="first" or "last"
returns a
vector
of matching row numbers in y
, and b)
when
mult="all"
returns a data.table with two columns with the first
containing row numbers of x
and the second column with corresponding
row numbers of y
.
nomatch=NA or 0
also influences whether non-matching rows are returned
or not, as explained above.
Very briefly, foverlaps()
collapses the two-column interval in y
to one-column of unique values to generate a lookup
table, and
then performs the join depending on the type of overlap
, using the
already available binary search
feature of data.table
. The time
(and space) required to generate the lookup
is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y
is sorted to speed-up
finding overlaps. Therefore y
has to be keyed (see ?setkey
)
prior to running foverlaps()
. A key on x
is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y
. The last
two columns should be the interval columns in both by.x
and
by.y
. The first interval column in by.x
should always be <= the
second interval column in by.x
, and likewise for by.y
. The
storage.mode
of the interval columns must be either double
or integer
. It therefore works with bit64::integer64
type as well.
The lookup
generation step could be quite time consuming if the number
of unique values in y
are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y
.
data.table
,
http://www.bioconductor.org/packages/release/bioc/html/IRanges.html,
setNumericRounding
# NOT RUN {
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=NULL)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")
## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
# }
Run the code above in your browser using DataLab