data.tablepackage
Enhanced data.frame
data.table
inherits from data.frame
. It offers fast and nemory efficient: file reader and writer, aggregations, updates, equi, nonequi, rolling, range and interval joins, in a short and flexible syntax, for faster development.
It is inspired by A[B]
syntax in R where A
is a matrix and B
is a 2column matrix. Since a data.table
is a data.frame
, it is compatible with R functions and packages that accept only data.frame
s.
Type vignette(package="data.table")
to get started. The ../doc/datatableintro.html vignette introduces data.table
's x[i, j, by]
syntax and is a good place to start. If you have read the vignettes and the help page below, please feel free to ask questions on Stack Overflow http://stackoverflow.com/questions/tagged/data.table or on http://r.789695.n4.nabble.com/datatablehelpf2315188.html mailing list. To report a bug please type: bug.report(package = "data.table")
.
Please check the https://github.com/Rdatatable/data.table/wiki for up to the minute live NEWS.
Tip: one of the quickest ways to learn the features is to type example(data.table)
and study the output at the prompt.
 Keywords
 data
Usage
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE)# S3 method for data.table
[(x, i, j, by, keyby, with = TRUE,
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
mult = "all",
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE),
which = FALSE,
.SDcols,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL, on = NULL)
Arguments
 …
 Just as
…
indata.frame
. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.  keep.rownames
 If
…
is amatrix
ordata.frame
,TRUE
will retain the rownames of that object in a column namedrn
.  check.names
 Just as
check.names
indata.frame
.  key
 Character vector of one or more column names which is passed to
setkey
. It may be a single comma separated string such askey="x,y,z"
, or a vector of names such askey=c("x","y","z")
.  stringsAsFactors
 Logical (default is
FALSE
). Convert allcharacter
columns tofactor
s?  x
 A
data.table
.  i
 Integer, logical or character vector, single column numeric
matrix
, expression of column names,list
,data.frame
ordata.table
.integer
andlogical
vectors work the same way they do in[.data.frame
except logicalNA
s are treated as FALSE.expression
is evaluated within the frame of thedata.table
(i.e. it sees column names as if they are variables) and can evaluate to any of the other types.character
,list
anddata.frame
input toi
is converted into adata.table
internally usingas.data.table
.If
i
is adata.table
, the columns ini
to be matched againstx
can be specified using one of these ways:on
argument (see below). It allows for bothequi
and the newly implementednonequi
joins.
 If not,
x
must be keyed. Key can be set usingsetkey
. Ifi
is also keyed, then first key column ofi
is matched against first key column ofx
, second against second, etc..If
i
is not keyed, then first column ofi
is matched against first key column ofx
, second column ofi
against second key column ofx
, etc...This is summarised in code as
min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))
.Using
on=
is recommended (even during keyed joins) as it helps understand the code better and also allows for nonequi joins. When the binary operator==
alone is used, an equi join is performed. In SQL terms,x[i]
then performs a right join by default.i
prefixed with!
signals a notjoin or notselect.Support for nonequi join was recently implemented, which allows for other binary operators
>=, >, <= and <
.See ../doc/datatablekeysfastsubset.html and ../doc/datatablesecondaryindicesandautoindexing.html.
Advanced: When
i
is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.  j
 When
with=TRUE
(default),j
is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns inj
, but alsocompute
on them e.g.,x[, a]
andx[, sum(a)]
returnsx$a
andsum(x$a)
as a vector respectively.x[, .(a, b)]
andx[, .(sa=sum(a), sb=sum(b))]
returns a two column data.table each, the first simply selecting columnsa, b
and the second computing their sums.The expression `.()` is a shorthand alias to
list()
; they both mean the same. As long asj
returns alist
, each element of the list becomes a column in the resultingdata.table
. This is the default enhanced mode.When
with=FALSE
,j
can only be a vector of column names or positions to select (as indata.frame
).Advanced:
j
also allows the use of special readonly symbols:.SD
,.N
,.I
,.GRP
,.BY
.Advanced: When
i
is adata.table
, the columns ofi
can be referred to inj
by using the prefixi.
, e.g.,X[Y, .(val, i.val)]
. Hereval
refers toX
's column andi.val
Y
's.Advanced: Columns of
x
can now be referred to using the prefixx.
and is particularly useful during joining to refer tox
's join columns as they are otherwise masked byi
's. For example,X[Y, .(x.ai.a, b), on="a"]
.See ../doc/datatableintro.html vignette and examples.
 by
 Column names are seen as if they are variables (as in
j
whenwith=TRUE
). Thedata.table
is then grouped by theby
andj
is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups.by
accepts: A single unquoted column name: e.g.,
DT[, .(sa=sum(a)), by=x]
 a
list()
of expressions of column names: e.g.,DT[, .(sa=sum(a)), by=.(x=x>0, y)]
 a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g.,
DT[, sum(a), by="x,y,z"]
 a character vector of column names: e.g.,
DT[, sum(a), by=c("x", "y")]
 or of the form
startcol:endcol
: e.g.,DT[, sum(a), by=x:z]
Advanced: When
i
is alist
(ordata.frame
ordata.table
),DT[i, j, by=.EACHI]
evaluatesj
for the groups in `DT` that each row ini
joins to. That is, you can join (ini
) and aggregate (inj
) simultaneously. We call this grouping by each i. See http://stackoverflow.com/a/27004566/559784 for a more detailed explanation until we https://github.com/Rdatatable/data.table/issues/944.Advanced: In the
X[Y, j]
form of grouping, thej
expression sees variables inX
first, thenY
. We call this join inherited scope. If the variable is not inX
orY
then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.  A single unquoted column name: e.g.,
 keyby
 Same as
by
, but with an additionalsetkey()
run on theby
columns of the result, for convenience. It is common practice to use `keyby=` routinely when you wish the result to be sorted.  with
 By default
with=TRUE
andj
is evaluated within the frame ofx
; column names can be used as variables.When
with=FALSE
j
is a character vector of column names, a numeric vector of column positions to select or of the formstartcol:endcol
, and the value returned is always adata.table
.with=FALSE
is often useful indata.table
to select columns dynamically. Note thatx[, cols, with=FALSE]
is equivalent tox[, .SD, .SDcols=cols]
.  nomatch
 Same as
nomatch
inmatch
. When a row ini
has no match tox
,nomatch=NA
(default) meansNA
is returned.0
means no rows will be returned for that row ofi
. Useoptions(datatable.nomatch=0)
to change the default value (used whennomatch
is not supplied).  mult
 When
i
is alist
(ordata.frame
ordata.table
) and multiple rows inx
match to the row ini
,mult
controls which are returned:"all"
(default),"first"
or"last"
.  roll
 When
i
is adata.table
and its row matches to all but the lastx
join column, and its value in the lasti
join column falls in a gap (including after the last observation inx
for that group), then:+Inf
(orTRUE
) rolls the prevailing value inx
forward. It is also known as last observation carried forward (LOCF).Inf
rolls backwards instead; i.e., next observation carried backward (NOCB). finite positive or negative number limits how far values are carried forward or backward.
 "nearest" rolls the nearest value instead.
Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search. A common idiom is to select a contemporaneous regular time series (
dts
) across a set of identifiers (ids
):DT[CJ(ids,dts),roll=TRUE]
whereDT
has a 2column key (id,date) andCJ
stands for cross join.  rollends
 A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well.
 If
rollends[2]=TRUE
, it will roll the last value forward.TRUE
by default for LOCF andFALSE
for NOCB rolls.  If
rollends[1]=TRUE
, it will roll the first value backward.TRUE
by default for NOCB andFALSE
for LOCF rolls.
When
roll
is a finite number, that limit is also applied when rolling the ends.  If
 which
TRUE
returns the row numbers ofx
thati
matches to. IfNA
, returns the row numbers ofi
that have no match inx
. By defaultFALSE
and the rows inx
that match are returned. .SDcols
 Specifies the columns of
x
to be included in the special symbol.SD
which stands forSubset of data.table
. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g.,DT[, lapply(.SD, sum), by="x,y", .SDcols=301:350]
. For convenient interactive use, the formstartcol:endcol
is also allowed (as inby
), e.g.,DT[, lapply(.SD, sum), by=x:y, .SDcols=a:f]
 verbose

TRUE
turns on status and information messages to the console. Turn this on by default usingoptions(datatable.verbose=TRUE)
. The quantity and types of verbosity may be expanded in future.  allow.cartesian

FALSE
prevents joins that would result in more thannrow(x)+nrow(i)
rows. This is usually caused by duplicate values ini
's join columns, each of which join to the same group in `x` over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve indata.table
: where every row ini
joins to every row inx
(anrow(x)*nrow(i)
row result). 'cartesian' is just meant in a 'large multiplicative' sense.  drop
 Never used by
data.table
. Do not use. It needs to be here becausedata.table
inherits fromdata.frame
. See ../doc/datatablefaq.html.  on
 Indicate which columns in
i
should be joined with columns inx
along with the type of binary operator to join with. When specified, this overrides the keys set onx
andi
. There are multiple ways of specifyingon
argument: As a character vector, e.g.,
X[Y, on=c("a", "b")]
. This assumes both these columns are present inX
andY
.  As a named character vector, e.g.,
X[Y, on=c(x="a", y="b")]
. This is useful when column names to join by are different between the two tables.NB:
X[Y, on=c("a", y="b")]
is also possible if column"a"
is common between the two tables.  For convenience during interactive scenarios, it is also possible to use
.()
syntax asX[Y, on=.(a, b)]
.  From v1.9.8, (nonequi) joins using binary operators
>=, >, <=, <
are also possible, e.g.,X[Y, on=c("x>=a", "y<=b")]
, or for interactive use asX[Y, on=.(x>=a, y<=b)]
.
See examples as well as ../doc/datatablesecondaryindicesandautoindexing.html.
 As a character vector, e.g.,
Details
data.table
builds on base R functionality to reduce 2 types of time:
 programming time (easier to write, read, debug and maintain), and
 compute time (fast and memory efficient).
The general form of data.table syntax is:
DT[ i, j, by ] # + extra arguments      > grouped by what?  > what to do? > on which rows?
The way to read this out loud is: "Take DT
, subset rows by i
, then compute j
grouped by by
. Here are some basic usage examples expanding on this definition. See the vignette (and examples) for working examples.
X[, a] # return col 'a' from X as vector. If not found, search in parent frame. X[, .(a)] # same as above, but return as a data.table. X[, sum(a)] # return sum(a) as a vector (with same scoping rules as above) X[, .(sum(a)), by=c] # get sum(a) grouped by 'c'. X[, sum(a), by=c] # same as above, .() can be ommitted in by on single expression for convenience X[, sum(a), by=c:f] # get sum(a) grouped by all columns in between 'c' and 'f' (both inclusive)
X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that result by the grouping column 'b' X[, sum(a), by=b][order(b)] # same order as above, but by chaining compound expressions X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those rows, get sum(a) grouped by 'c' X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select columns 'X$a' and 'X$b' for those rows X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select 'X$a' and 'Y$a' (=i.a) X[Y, sum(a*i.a), on="c" by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on matching rows in 'X$c'
X[, plot(a, b), by=c] # j accepts any expression, generates plot for each group and returns no data # see ?assign to add/update/delete columns by reference using the same consistent interface
A data.table
is a list
of vectors, just like a data.frame
. However :
 it never has or uses rownames. Rownames based indexing can be done by setting a key of one or more columns or done adhoc using the
on
argument (now preferred).  it has enhanced functionality in
[.data.table
for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
See the see also
section for the several other methods that are available for operating on data.tables efficiently.
Note
If keep.rownames
or check.names
are supplied they must be written in full because R does not allow partial argument names after `…
`. For example, data.table(DF, keep=TRUE)
will create a
column called "keep"
containing TRUE
and this is correct behaviour; data.table(DF, keep.rownames=TRUE)
was intended.
POSIXlt
is not supported as a column type because it uses 40 bytes to store a single datetime. They are implicitly converted to POSIXct
type with warning. You may also be interested in IDateTime
instead; it has methods to convert to and from POSIXlt
.
References
https://github.com/Rdatatable/data.table/wiki (data.table
homepage)
http://crantastic.org/packages/datatable (User reviews)
http://en.wikipedia.org/wiki/Binary_search
See Also
specialsymbols
, data.frame
, [.data.frame
, as.data.table
, setkey
, setorder
, setDT
, setDF
, J
, SJ
, CJ
, merge.data.table
, tables
, test.data.table
, IDateTime
, unique.data.table
, copy
, :=
, alloc.col
, truelength
, rbindlist
, setNumericRounding
, datatableoptimize
, fsetdiff
, funion
, fintersect
, fsetequal
, anyDuplicated
, uniqueN
, rowid
, rleid
, na.omit
, frank
Examples
## Not run: 
# example(data.table) # to run these examples at the prompt
## 
DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT), dim(DF)) # TRUE
identical(DF$a, DT$a) # TRUE
is.list(DF) # TRUE
is.list(DT) # TRUE
is.data.frame(DT) # TRUE
tables()
# basic row subset operations
DT[2] # 2nd row
DT[3:2] # 3rd and 2nd row
DT[order(x)] # no need for order(DT$x)
DT[order(x), ] # same as above. The ',' is optional
DT[y>2] # all rows where DT$y > 2
DT[y>2 & v>5] # compound logical expressions
DT[!2:4] # all rows other than 2:4
DT[(2:4)] # same
# selectcompute columns data.table way
DT[, v] # v column (as vector)
DT[, list(v)] # v column (as data.table)
DT[, .(v)] # same as above, .() is a shorthand alias to list()
DT[, sum(v)] # sum of column v, returned as vector
DT[, .(sum(v))] # same, but return data.table (column autonamed V1)
DT[, .(sv=sum(v))] # same, but column named "sv"
DT[, .(v, v*2)] # return two column data.table, v and v*2
# subset rows and selectcompute data.table way
DT[2:3, sum(v)] # sum(v) over rows 2 and 3, return vector
DT[2:3, .(sum(v))] # same, but return data.table with column V1
DT[2:3, .(sv=sum(v))] # same, but return data.table with column sv
DT[2:5, cat(v, "\n")] # just for j's side effect
# select columns the data.frame way
DT[, 2, with=FALSE] # 2nd column, returns a data.table always
colNum = 2
DT[, colNum, with=FALSE] # same, equivalent to DT[, .SD, .SDcols=colNum]
DT[["v"]] # same as DT[, v] but much faster
# grouping operations  j and by
DT[, sum(v), by=x] # ad hoc by, order of groups preserved in result
DT[, sum(v), keyby=x] # same, but order the result on by cols
DT[, sum(v), by=x][order(x)] # same but by chaining expressions together
# fast ad hoc row subsets (subsets as joins)
DT["a", on="x"] # same as x == "a" but uses binary search (fast)
DT["a", on=.(x)] # same, for convenience, no need to quote every column
DT[.("a"), on="x"] # same
DT[x=="a"] # same, single "==" internally optimised to use binary search (fast)
DT[x!="b"  y!=3] # not yet optimized, currently vector scan subset
DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast)
DT[.("b", 3), on=.(x, y)] # same, but using on=.()
DT[.("b", 1:2), on=c("x", "y")] # no match returns NA
DT[.("b", 1:2), on=.(x, y), nomatch=0] # no match row is not returned
DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row
DT[.("b", 1:2), on=.(x, y), roll=Inf] # nocb, nomatch row gets rolled by next row
DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y)
# all together now
DT[x!="a", sum(v), by=x] # get sum(v) by "x" for each i != "a"
DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsetsasjoins
DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.()
# joins as subsets
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X
DT[X, on="x"] # right join
X[DT, on="x"] # left join
DT[X, on="x", nomatch=0] # inner join
DT[!X, on="x"] # not join
DT[X, on=.(y<=foo)] # NEW nonequi join (v1.9.8+)
DT[X, on="y<=foo"] # same as above
DT[X, on=c("y<=foo")] # same as above
DT[X, on=.(y>=foo)] # NEW nonequi join (v1.9.8+)
DT[X, on=.(x, y<=foo)] # NEW nonequi join (v1.9.8+)
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # Select x's join columns as well
DT[X, on="x", mult="first"] # first row of each group
DT[X, on="x", mult="last"] # last row of each group
DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i
DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope
DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW nonequi join with by=.EACHI (v1.9.8+)
# setting keys
kDT = copy(DT) # (deep) copy DT to kDT to work with it.
setkey(kDT,x) # set a 1column key. No quotes, for convenience.
setkeyv(kDT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(kDT,v) # same
# key(kDT)<"x" # copies whole table, please use set* functions instead
haskey(kDT) # TRUE
key(kDT) # "x"
# fast *keyed* subsets
kDT["a"] # subsetasjoin on *key* column 'x'
kDT["a", on="x"] # same, being explicit using 'on=' (preferred)
# all together
kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a"
# multicolumn key
setkey(kDT,x,y) # 2column key
setkeyv(kDT,c("x","y")) # same
# fast *keyed* subsets on multicolumn key
kDT["a"] # join to 1st column of key
kDT["a", on="x"] # on= is optional, but is preferred
kDT[.("a")] # same, .() is an alias for list()
kDT[list("a")] # same
kDT[.("a", 3)] # join to 2 columns
kDT[.("a", 3:6)] # join 4 rows (2 missing)
kDT[.("a", 3:6), nomatch=0] # remove missing
kDT[.("a", 3:6), roll=TRUE] # locf rolling join
kDT[.("a", 3:6), roll=Inf] # same as above
kDT[.("a", 3:6), roll=Inf] # nocb rolling join
kDT[!.("a")] # not join
kDT[!"a"] # same
# more on special symbols, see also ?"specialsymbols"
DT[.N] # last row
DT[, .N] # total number of rows in DT
DT[, .N, by=x] # number of rows in each group
DT[, .SD, .SDcols=x:y] # select columns 'x' and 'y'
DT[, .SD[1]] # first row of all columns
DT[, .SD[1], by=x] # first row of 'y' and 'v' for each group in 'x'
DT[, c(.N, lapply(.SD, sum)), by=x] # get rows *and* sum columns 'v' and 'y' by group
DT[, .I[1], by=x] # row number in DT corresponding to each group
DT[, grp := .GRP, by=x] # add a group counter column
X[, DT[.BY, y, on="x"], by=x] # join within each group
# add/update/delete by reference (see ?assign)
print(DT[, z:=42L]) # add new column by reference
print(DT[, z:=NULL]) # remove column by reference
print(DT["a", v:=42L, on="x"]) # subassign to existing v column by reference
print(DT["b", v2:=84L, on="x"]) # subassign to new column by reference (NA padded)
DT[, m:=mean(v), by=x][] # add new column by reference by group
# NB: postfix [] is shortcut to print()
# advanced usage
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT[, sum(v), by=.(y%%2)] # expressions in by
DT[, sum(v), by=.(bool = y%%2)] # same, using a named list to change by column name
DT[, .SD[2], by=x] # get 2nd row of each group
DT[, tail(.SD,2), by=x] # last 2 rows of each group
DT[, lapply(.SD, sum), by=x] # sum of all (other) columns for each group
DT[, .SD[which.min(v)], by=x] # nested query by group
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)] # by 2 expressions
DT[, .(a = .(a), b = .(b)), by=x] # list columns
DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations
DT[, sum(v), by=x][V1<20] # compound query
DT[, sum(v), by=x][order(V1)] # ordering results
DT[, c(.N, lapply(.SD,sum)), by=x] # get number of observations and sum per group
DT[, {tmp < mean(y);
.(a = atmp, b = btmp)
}, by=x] # anonymous lambdain 'j', j accepts any valid
# expression. TO REMEMBER: every element of
# the list becomes a column in result.
pdf("new.pdf")
DT[, plot(a,b), by=x] # can also plot in 'j'
dev.off()
# using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
# Follow rhelp posting guide, SUPPORT is here (*not* rhelp) :
# http://stackoverflow.com/questions/tagged/data.table
# or
# datatablehelp@lists.rforge.rproject.org
## Not run: 
# vignette("datatableintro")
# vignette("datatablereferencesemantics")
# vignette("datatablekeysfastsubset")
# vignette("datatablesecondaryindicesandautoindexing")
# vignette("datatablereshape")
# vignette("datatablefaq")
#
#
# test.data.table() # over 5700 low level tests
#
# # keep up to date with latest stable version on CRAN
# update.packages()
# # get the latest devel (needs Rtools for windows, xcode for mac)
# install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
#
## 