# foverlaps

##### Fast overlap joins

A *fast* binary-search based *overlap join* of two `data.table`

s.
This is very much inspired by `findOverlaps`

function from the Bioconductor
package `IRanges`

(see link below under `See Also`

).

Usually, `x`

is a very large data.table with small interval ranges, and
`y`

is much smaller *keyed* `data.table`

with relatively larger
interval spans. For a usage in `genomics`

, see the examples section.

NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.

- Keywords
- data

##### Usage

```
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
by.y = key(y), maxgap = 0L, minoverlap = 1L,
type = c("any", "within", "start", "end", "equal"),
mult = c("all", "first", "last"),
nomatch = getOption("datatable.nomatch", NA),
which = FALSE, verbose = getOption("datatable.verbose"))
```

##### Arguments

- x, y
`data.table`

s.`y`

needs to be keyed, but not necessarily`x`

. See examples.- by.x, by.y
A vector of column names (or numbers) to compute the overlap joins. The last two columns in both

`by.x`

and`by.y`

should each correspond to the`start`

and`end`

interval columns in`x`

and`y`

respectively. And the`start`

column should always be <=`end`

column. If`x`

is keyed,`by.x`

is equal to`key(x)`

, else`key(y)`

.`by.y`

defaults to`key(y)`

.- maxgap
It should be a non-negative integer value, >= 0. Default is 0 (no gap). For intervals

`[a,b]`

and`[c,d]`

, where`a<=b`

and`c<=d`

, when`c > b`

or`d < a`

, the two intervals don't overlap. If the gap between these two intervals is`<= maxgap`

, these two intervals are considered as overlapping. Note: This is not yet implemented.- minoverlap
It should be a positive integer value, > 0. Default is 1. For intervals

`[a,b]`

and`[c,d]`

, where`a<=b`

and`c<=d`

, when`c<=b`

and`d>=a`

, the two intervals overlap. If the length of overlap between these two intervals is`>= minoverlap`

, then these two intervals are considered to be overlapping. Note: This is not yet implemented.- type
Default value is

`any`

. Allowed values are`any`

,`within`

,`start`

,`end`

and`equal`

.The types shown here are identical in functionality to the function

`findOverlaps`

in the bioconductor package`IRanges`

. Let`[a,b]`

and`[c,d]`

be intervals in`x`

and`y`

with`a<=b`

and`c<=d`

. For`type="start"`

, the intervals overlap iff`a == c`

. For`type="end"`

, the intervals overlap iff`b == d`

. For`type="within"`

, the intervals overlap iff`a>=c and b<=d`

. For`type="equal"`

, the intervals overlap iff`a==c and b==d`

. For`type="any"`

, as long as`c<=b and d>=a`

, they overlap. In addition to these requirements, they also have to satisfy the`minoverlap`

argument as explained above.NB:

`maxgap`

argument, when > 0, is to be interpreted according to the type of the overlap. This will be updated once`maxgap`

is implemented.- mult
When multiple rows in

`y`

match to the row in`x`

,`mult=.`

controls which values are returned -`"all"`

(default),`"first"`

or`"last"`

.- nomatch
When a row (with interval say,

`[a,b]`

) in`x`

has no match in`y`

,`nomatch=NA`

(default) means`NA`

is returned for`y`

's non-`by.y`

columns for that row of`x`

.`nomatch=NULL`

(or`0`

for backward compatibility) means no rows will be returned for that row of`x`

. Use`options(datatable.nomatch=NULL)`

to change the default value (used when`nomatch`

is not supplied).- which
When

`TRUE`

, if`mult="all"`

returns a two column`data.table`

with the first column corresponding to`x`

's row number and the second corresponding to`y`

's. when`nomatch=NA`

, no matches return`NA`

for`y`

, and if`nomatch=NULL`

, those rows where no match is found will be skipped; if`mult="first" or "last"`

, a vector of length equal to the number of rows in`x`

is returned, with no-match entries filled with`NA`

or`0`

corresponding to the`nomatch`

argument. Default is`FALSE`

, which returns a join with the rows in`y`

.- verbose
`TRUE`

turns on status and information messages to the console. Turn this on by default using`options(datatable.verbose=TRUE)`

. The quantity and types of verbosity may be expanded in future.

##### Details

Very briefly, `foverlaps()`

collapses the two-column interval in `y`

to one-column of *unique* values to generate a `lookup`

table, and
then performs the join depending on the type of `overlap`

, using the
already available `binary search`

feature of `data.table`

. The time
(and space) required to generate the `lookup`

is therefore proportional
to the number of unique values present in the interval columns of `y`

when combined together.

Overlap joins takes advantage of the fact that `y`

is sorted to speed-up
finding overlaps. Therefore `y`

has to be keyed (see `?setkey`

)
prior to running `foverlaps()`

. A key on `x`

is not necessary,
although it *might* speed things further. The columns in `by.x`

argument should correspond to the columns specified in `by.y`

. The last
two columns should be the *interval* columns in both `by.x`

and
`by.y`

. The first interval column in `by.x`

should always be <= the
second interval column in `by.x`

, and likewise for `by.y`

. The
`storage.mode`

of the interval columns must be either `double`

or `integer`

. It therefore works with `bit64::integer64`

type as well.

The `lookup`

generation step could be quite time consuming if the number
of unique values in `y`

are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for `y`

.

##### Value

A new `data.table`

by joining over the interval columns (along with other
additional identifier columns) specified in `by.x`

and `by.y`

.

NB: When `which=TRUE`

: `a)`

`mult="first" or "last"`

returns a
`vector`

of matching row numbers in `y`

, and `b)`

when
`mult="all"`

returns a data.table with two columns with the first
containing row numbers of `x`

and the second column with corresponding
row numbers of `y`

.

`nomatch=NA or 0`

also influences whether non-matching rows are returned
or not, as explained above.

##### See Also

`data.table`

,
http://www.bioconductor.org/packages/release/bioc/html/IRanges.html,
`setNumericRounding`

##### Examples

```
# NOT RUN {
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=NULL)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")
## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
# }
```

*Documentation reproduced from package data.table, version 1.12.6, License: MPL-2.0 | file LICENSE*