foverlaps: Fast overlap joins

Description

A fast binary-search based overlap join of two data.tables. This is very much inspired by findOverlaps function from the bioconductor package IRanges (see link below under See Also). Usually, x is a very large data.table with small interval ranges, and y is much smaller keyed data.table with relatively larger interval spans. For an usage in genomics, please look at the examples section. NOTE: This is still under development, meaning it's stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.

Usage

foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), 
    by.y = key(y), maxgap = 0L, minoverlap = 1L, 
    type = c("any", "within", "start", "end", "equal"), 
    mult = c("all", "first", "last"), 
    nomatch = getOption("datatable.nomatch"), 
    which = FALSE, verbose = getOption("datatable.verbose"))

Arguments

x, y

data.tables. y needs to be keyed, but not necessarily x. See examples.

by.x, by.y

A vector of column names (or numbers) to compute the overlap joins. The last two columns in both by.x and by.y should each correspond to the start and end interval columns in x and y

maxgap

It should be a non-negative integer value >= 0. Default is 0 (no gap). For intervals [a,b] and [c,d], where a<=b< code=""> and c<=d< code="">, when c > b or d < a, the two intervals don't overlap.

minoverlap

It should be a positive integer value > 0. Default is 1. For intervals [a,b] and [c,d], where a<=b< code=""> and c<=d< code="">, when c<=b< code=""> and d>=a, the two intervals overlap. If the length of over

type

Default value is any. Allowed values are any, within, start, end and equal. Note: equal is not yet implemented. But this is just a normal join of the type y[x

mult

When multiple rows in y match to the row in x, mult=. controls which values are returned - "all" (default), first or "last".

nomatch

Same as nomatch in match. When a row (with interval say, [a,b]) in x has no match in y, nomatch=NA (default) means NA is returned

which

When TRUE, if mult="all" returns a two column data.table with the first column corresponding to x's row number and the second corresponding to y's. when nomatch=NA, no matches r

verbose

TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.

Value

A new data.table by joining over the interval columns (along with other additional identifier columns) specified in by.x and by.y. NB: When which=TRUE: a) mult="first" or "last" returns a vector of matching row numbers in y, and b) when mult="all" returns a data.table with two columns with the first containing row numbers of x and the second column with corresponding row numbers of y. nomatch=NA or 0 also influences whether non-matching rows are returned or not, as explained above.

Details

Very briefly, foverlaps() collapses the two-column interval in y to one-column of unique values to generate a lookup table, and then performs the join depending on the type of overlap, using the already available binary search feature of data.table. The time (and space) required to generate the lookup is therefore proportional to the number of unique values present in the interval columns of x when combined together.

Overlap joins takes advantage of the fact that y is sorted to speed-up finding overlaps. Therefore y has to be keyed (see ?setkey) prior to running foverlaps(). A key on x is not necessary, although it might speed things further. The columns in by.x argument should correspond to the columns specified in by.y. The last two columns should be the interval columns in both by.x and by.y. The first interval column in by.x should always be <= the="" second="" interval="" column="" in="" by.x, and likewise for by.y. The storage.mode of the interval columns must be either double or integer. It therefore works with bit64::integer64 type as well.

The lookup generation step could be quite time consuming, but only as long as the number of unique values in y are too large (ex: in the order of millions). This overlap join is developed under the consideration that y will not have too many unique values in most scenarios.

Note that, a range join is a special case of overlap join (or interval join) where the start and end intervals for data.table x are exactly the same.

NB: type="equal" is not yet implemented, but it's just a normal join as long as maxgap and minoverlap arguments are not changed from their default values. Also not implemented yet are maxgap and minoverlap arguments. We hope to implement these by the next release.

Examples

Run this code

require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'

## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), 
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), 
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=0L)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")

## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), 
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), 
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"), 
            type="any", which=TRUE)

Run the code above in your browser using DataLab