merge
Merge two data.tables
Fast merge of two data.table
s. The data.table
method behaves
very similarly to that of data.frame
s except that, by default, it attempts to merge
at first based on the shared key columns, and if there are none,
then based on key columns of the first argument
x
, and if there are none,then based on the common columns between the two
data.table
s.
Set the by
, or by.x
and by.y
arguments explicitly to override this default.
- Keywords
- data
Usage
# S3 method for data.table
merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE,
all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE,
allow.cartesian=getOption("datatable.allow.cartesian"), # default FALSE
...)
Arguments
- x, y
data table
s.y
is coerced to adata.table
if it isn't one already.- by
A vector of shared column names in
x
andy
to merge on. This defaults to the shared key columns between the two tables. Ify
has no key columns, this defaults to the key ofx
.- by.x, by.y
Vectors of column names in
x
andy
to merge on.- all
logical;
all = TRUE
is shorthand to save setting bothall.x = TRUE
andall.y = TRUE
.- all.x
logical; if
TRUE
, then extra rows will be added to the output, one for each row inx
that has no matching row iny
. These rows will have 'NA's in those columns that are usually filled with values fromy
. The default isFALSE
, so that only rows with data from bothx
andy
are included in the output.- all.y
logical; analogous to
all.x
above.- sort
logical. If
TRUE
(default), the mergeddata.table
is sorted by setting the key to theby / by.x
columns. IfFALSE
, the result is not sorted.- suffixes
A
character(2)
specifying the suffixes to be used for making non-by
column names unique. The suffix behaviour works in a similar fashion as themerge.data.frame
method does.- no.dups
logical indicating that
suffixes
are also appended to non-by.y
column names iny
when they have the same column name as anyby.x
.- allow.cartesian
See
allow.cartesian
in[.data.table
.- …
Not used at this time.
Details
merge
is a generic function in base R. It dispatches to either the
merge.data.frame
method or merge.data.table
method depending on
the class of its first argument. Note that, unlike SQL
, NA
is
matched against NA
(and NaN
against NaN
) while merging.
In versions <= v1.9.4
, if the specified columns in by
were not the
key (or head of the key) of x
or y
, then a copy
is
first re-keyed prior to performing the merge. This was less performant as well as memory
inefficient. The concept of secondary keys (implemented in v1.9.4
) was
used to overcome this limitation from v1.9.6
+. No deep copies are made
any more, thereby improving performance and memory efficiency. Also, there is better
control for providing the columns to merge on with the help of the newly implemented
by.x
and by.y
arguments.
For a more data.table
-centric way of merging two data.table
s, see
[.data.table
; e.g., x[y, ...]
. See FAQ 1.12 for a detailed
comparison of merge
and x[y, ...]
.
If any column names provided to by.x
also occur in names(y)
but not in by.y
,
then this data.table
method will add the suffixes
to those column names. As of
R v3.4.3, the data.frame
method will not (leading to duplicate column names in the result) but a patch has
been proposed (see r-devel thread here)
which is looking likely to be accepted for a future version of R.
Value
A new data.table
based on the merged data table
s, and sorted by the
columns set (or inferred for) the by
argument if argument sort
is
set to TRUE
.
See Also
Examples
# NOT RUN {
(dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"))
(dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"))
merge(dt1, dt2)
merge(dt1, dt2, all = TRUE)
(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
merge(dt1, dt2, allow.cartesian=TRUE)
(dt1 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(1:3, 2)], X = 1:6, key = "A,B"))
(dt2 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(2:4, 2)], Y = 6:1, key = "A,B"))
merge(dt1, dt2)
merge(dt1, dt2, by="B", allow.cartesian=TRUE)
# test it more:
d1 <- data.table(a=rep(1:2,each=3), b=1:6, key="a,b")
d2 <- data.table(a=0:1, bb=10:11, key="a")
d3 <- data.table(a=0:1, key="a")
d4 <- data.table(a=0:1, b=0:1, key="a,b")
merge(d1, d2)
merge(d2, d1)
merge(d1, d2, all=TRUE)
merge(d2, d1, all=TRUE)
merge(d3, d1)
merge(d1, d3)
merge(d1, d3, all=TRUE)
merge(d3, d1, all=TRUE)
merge(d1, d4)
merge(d1, d4, by="a", suffixes=c(".d1", ".d4"))
merge(d4, d1)
merge(d1, d4, all=TRUE)
merge(d4, d1, all=TRUE)
# new feature, no need to set keys anymore
set.seed(1L)
d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6)
d2 <- data.table(a=2:0, z=10:12)
merge(d1, d2, by="a")
merge(d1, d2, by="a", all=TRUE)
# new feature, using by.x and by.y arguments
setnames(d2, "a", "b")
merge(d1, d2, by.x="a", by.y="b")
merge(d1, d2, by.x="a", by.y="b", all=TRUE)
merge(d2, d1, by.x="b", by.y="a")
# }