data.table
Enhanced data.frame
data.table
inherits from data.frame
. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B]
syntax in Rwhere A
is a matrix and B
is a 2-column matrix. Since a data.table
is a data.frame
, it is compatible with Rfunctions and packages that only accept data.frame
.
The 10 minute quick start guide to data.table
may be a good place to start: vignette("datatable-intro")
}. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: vignette("datatable-faq")
}. If you have read and searched these documents and the help page below, please feel free to ask questions on bug.report(package="data.table")
.
Please check the example(data.table)
and study the output at the prompt.
*NEW* :
- Keywords
- data
Usage
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)## S3 method for class 'data.table':
[(x, i, j, by, keyby, with=TRUE,
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
mult = "all",
roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else { if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE) },
which = FALSE,
.SDcols,
verbose=getOption("datatable.verbose"), # default: FALSE
allow.cartesian=getOption("datatable.allow.cartesian"), # default: FALSE
drop=NULL,
rolltolast = FALSE # deprecated
)
Arguments
- ...
- Just as
...
indata.frame
. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors. - keep.rownames
- If
...
is amatrix
ordata.frame
,TRUE
will retain the rownames of that object in a column namedrn
. - check.names
- Just as
check.names
indata.frame
. - key
- Character vector of one or more column names which is passed to
setkey
. It may be a single comma separated string such askey="x,y,z"
, or a vector of names such askey=c("x","y","z")
- x
- A
data.table
. - i
- Integer, logical or character vector, expression of column names,
list
ordata.table
.integer and logical vectors work the same way they do in
[.data.frame
. Other than - j
- A single column name, single expresson of column names,
list()
of expressions of column names, an expression or function call that evaluates tolist
(includingdata.frame
anddata.table
which arel
- by
- A single unquoted column name, a
list()
of expressions of column names, a single character string containing comma separated column names, or a character vector of column names.The
list()
of expressions is evaluated within th - keyby
- An ad hoc by just as
by
but with an additionalsetkey()
on theby
columns of the result, for convenience. Not to be confused with a keyed by as defined above. - with
- By default
with=TRUE
andj
is evaluated within the frame ofx
. The column names can be used as variables. Whenwith=FALSE
,j
works as it does in[.data.frame
. - nomatch
- Same as
nomatch
inmatch
. When a row ini
has no match tox
's key,nomatch=NA
(default) meansNA
is returned forx
's non-join colu - mult
- When multiple rows in
x
match to the row ini
,mult
controls which are returned:"all"
(default),"first"
or"last"
. - roll
- Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
roll=TRUE
andi
's row matches to all but the lastx
join column, and its value in the lasti
- rollends
- A logical vector length 2 (a single logical is recycled). When rolling forward (e.g.
roll=TRUE
) if a value is past the last observation within each group defined by the join columns,rollends[2]=TRUE
will roll the last v - which
TRUE
returns the row numbers ofx
thati
matches to.NA
returns the row numbers ofi
that have no match inx
. By defaultFALSE
and the rows inx
that m- .SDcols
- Advanced. Specifies the columns of
x
included in.SD
. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g.,DT[,lapply(
- verbose
TRUE
turns on status and information messages to the console. Turn this on by default usingoptions(datatable.verbose=TRUE)
. The quantity and types of verbosity may be expanded in future.- allow.cartesian
FALSE
prevents joins that would result in more thanmax(nrow(x),nrow(i))
rows. This is usually caused by duplicate values ini
's join columns, each of which join to the same group in `x` over and over again: a mi- drop
- Never used by
data.table
. Do not use. It needs to be here becausedata.table
inherits fromdata.frame
. Seevignette("datatable-faq")
. - rolltolast
- Deprecated. Setting
rolltolast=TRUE
is converted toroll=TRUE;rollends=FALSE
for backwards compatibility.
Details
data.table
builds on base Rfunctionality to reduce 2 types of time :
- programming time (easier to write, read, debug and maintain)
- compute time
It combines database like operations such as subset
, with
and by
and provides similar joins that merge
provides but faster. This is achieved by using R's column based ordered in-memory data.frame
structure, eval
within the environment of a list
, the [.data.table
mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
As with [.data.frame
, compound queries can be concatenated on one line; e.g.,
DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under="" j expression does not have to return data; e.g.,
DT[,plot(colB,colC),by=colA]
# produce a set of plots (likely to pdf) returning no data
Multiple
data.table
s (e.g. X
, Y
and Z
) can be joined in many ways; e.g.,
X[Y][Z]
X[Z][Y]
X[Y[Z]]
X[Z[Y]]
A data.table
is a list
of vectors, just like a data.frame
. However :
300][tail(order(v1))]>[.data.table
for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
Since a list
is a vector
, data.table
columns may be type list
. Columns of type list
can contain mixed types. Each item in a column of type list
may be different lengths. This is true of data.frame
, too.
Several methods are provided for data.table
, including is.na
, na.omit
,
t
, rbind
, cbind
, merge
and others.
Note
If keep.rownames
or check.names
are supplied they must be written in full because Rdoes not allow partial argument names after `...
`. For example, data.table(DF,keep=TRUE)
will create a
column called "keep"
containing TRUE
and this is correct behaviour; data.table(DF,keep.rownames=TRUE)
was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see IDateTime
instead. IDateTime has methods to convert to and from POSIXlt.
References
data.table
homepage:
See Also
data.frame
, [.data.frame
, as.data.table
, setkey
, J
, SJ
, CJ
, merge.data.table
, tables
, test.data.table
, IDateTime
, unique.data.table
, copy
, :=
, alloc.col
, truelength
, rbindlist
Examples
example(data.table) # to run these examples at the prompt
DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a) # TRUE
is.list(DF) # TRUE
is.list(DT) # TRUE
is.data.frame(DT) # TRUE
tables()
DT[2] # 2nd row
DT[,v] # v column (as vector)
DT[,list(v)] # v column (as data.table)
DT[2:3,sum(v)] # sum(v) over rows 2 and 3
DT[2:5,cat(v,"")] # just for j's side effect
DT[c(FALSE,TRUE)] # even rows (usual recycling)
DT[,2,with=FALSE] # 2nd column
colNum = 2
DT[,colNum,with=FALSE] # same
setkey(DT,x) # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v) # same
# key(DT)<-"x" # copies whole table, please use set* functions instead
DT["a"] # binary search (fast)
DT[x=="a"] # vector scan (slow)
DT[,sum(v),by=x] # keyed by
DT[,sum(v),by=key(DT)] # same
DT[,sum(v),by=y] # ad hoc by
DT["a",sum(v)] # j for one group
DT[c("a","b"),sum(v)] # j for two groups
X = data.table(c("b","c"),foo=c(4,2))
X
DT[X] # join
DT[X,sum(v)] # join and eval j for each row in i
DT[X,mult="first"] # first row of each group
DT[X,mult="last"] # last row of each group
DT[X,sum(v)*foo] # join inherited scope
setkey(DT,x,y) # 2-column key
setkeyv(DT,c("x","y")) # same
DT["a"] # join to 1st column of key
DT[J("a")] # same. J() stands for Join, an alias for list()
DT[list("a")] # same
DT[.("a")] # same. In the style of package plyr.
DT[J("a",3)] # join to 2 columns
DT[.("a",3)] # same
DT[J("a",3:6)] # join 4 rows (2 missing)
DT[J("a",3:6),nomatch=0] # remove missing
DT[J("a",3:6),roll=TRUE] # rolling join (locf)
DT[,sum(v),by=list(y%%2)] # by expression
DT[,.SD[2],by=x] # 2nd row of each group
DT[,tail(.SD,2),by=x] # last 2 rows of each group
DT[,lapply(.SD,sum),by=x] # apply through columns by group
DT[,list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=list(x,y%%2)] # by 2 expressions
DT[,sum(v),x][V1<20] # compound query
DT[,sum(v),x][order(-V1)] # ordering results
print(DT[,z:=42L]) # add new column by reference
print(DT[,z:=NULL]) # remove column by reference
print(DT["a",v:=42L]) # subassign to existing v column by reference
print(DT["b",v2:=84L]) # subassign to new column by reference (NA padded)
DT[,m:=mean(v),by=x][] # add new column by reference by group
# NB: postfix [] is shortcut to print()
DT[,.SD[which.min(v)],by=x][] # nested query by group
DT[!J("a")] # not join
DT[!"a"] # same
DT[!2:4] # all rows other than 2:4
DT[x!="b" | y!=3] # multiple vector scanning approach, slow
DT[!J("b",3)] # same result but much faster
# Follow r-help posting guide, support is here (*not* r-help) :
# datatable-help@lists.r-forge.r-project.org
# or
# http://stackoverflow.com/questions/tagged/data.table
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
test.data.table() # over 700 low level tests
update.packages() # keep up to date