data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by
A[B]syntax in Rwhere
Ais a matrix and
Bis a 2-column matrix. Since a
data.frame, it is compatible with Rfunctions and packages that only accept
data.frame. The 10 minute quick start guide to
data.tablemay be a good place to start:
vignette("datatable-intro")}. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation:
vignette("datatable-faq")}. If you have read and searched these documents and the help page below, please feel free to ask questions on
bug.report(package="data.table"). Please check the
example(data.table)and study the output at the prompt. *NEW* : help page for
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)
## S3 method for class 'data.table': [(x, i, j, by, keyby, with=TRUE, nomatch = getOption("datatable.nomatch"), # default: NA_integer_ mult = "all", roll = FALSE, rolltolast = FALSE, which = FALSE, .SDcols, verbose=getOption("datatable.verbose"), # default: FALSE drop=NULL)
data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
TRUEwill retain the rownames of that object in a column named
setkey. It may be a single comma separated string such as
key="x,y,z", or a vector of names such as
integer and logical vectors work the same way they do in
[.data.frame. Other than
list()of expressions of column names, an expression or function call that evaluates to
list()of expressions of column names, or a single character string containing comma separated column names, or a character vector of column names.
list() of expressions is evaluated within t
bybut with an additional
bycolumns of the result, for convenience. Not to be confused with a keyed by as defined above.
jis evaluated within the frame of
x. The column names can be used as variables. When
jworks as it does in
match. When a row in
ihas no match to
NAis returned for
x's non-join colu
xmatch to the row in
multcontrols which are returned:
i's row matches to all but the last
xjoin column, and its value in the last
rollbut the data is not rolled forward past the last observation. The value of
imust fall in a gap in
xbut not after the end of the data for that group defined by all but the last join column.
TRUEreturns the integer row numbers of
.SD. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g.,
TRUEturns on status and information messages to the console. Turn this on by default using
options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.
data.table. Do not use. It needs to be here because
data.tablebuilds on base Rfunctionality to reduce 2 types of time :
It combines database like operations such as
by and provides similar joins that
merge provides but faster. This is achieved by using R's column based ordered in-memory
eval within the environment of a
[.data.table mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
[.data.frame, compound queries can be concatenated on one line; e.g.,
DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under=""
j expression does not have to return data; e.g.,
# produce a set of plots (likely to pdf) returning no data
Z) can be joined in many ways; e.g.,
data.table is a
list of vectors, just like a
data.frame. However :
[.data.tablefor fast joins of keyed tables, fast aggregation, and fast last observation carried forward (LOCF).
list is a
data.table columns may be type
list. Columns of type
list can contain mixed types. Each item in a column of type
list may be different lengths. This is true of
Several methods are provided for
merge and others.
example(data.table) # to run these examples at the prompt DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT),dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() DT # 2nd row DT[,v] # v column (as vector) DT[,list(v)] # v column (as data.table) DT[2:3,sum(v)] # sum(v) over rows 2 and 3 DT[2:5,cat(v,"")] # just for j's side effect DT[c(FALSE,TRUE)] # even rows (usual recycling) DT[,2,with=FALSE] # 2nd column colNum = 2 DT[,colNum,with=FALSE] # same setkey(DT,x) # set a 1-column key. No quotes, for convenience. setkeyv(DT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(DT,v) # same # key(DT)<-"x" # copies whole table, please use set* functions instead DT["a"] # binary search (fast) DT[x=="a"] # vector scan (slow) DT[,sum(v),by=x] # keyed by DT[,sum(v),by=key(DT)] # same DT[,sum(v),by=y] # ad hoc by DT["a",sum(v)] # j for one group DT[c("a","b"),sum(v)] # j for two groups X = data.table(c("b","c"),foo=c(4,2)) X DT[X] # join DT[X,sum(v)] # join and eval j for each row in i DT[X,mult="first"] # first row of each group DT[X,mult="last"] # last row of each group DT[X,sum(v)*foo] # join inherited scope J("a",2) # J() is alias for data.table() data.table("a",2) # same setkey(DT,x,y) # 2-column key setkeyv(DT,c("x","y")) # same DT["a"] # join to 1st column of key DT[J("a")] # same DT[J("a",3)] # join to 2 columns DT[J("a",3:6)] # join 4 rows (2 missing) DT[J("a",3:6),nomatch=0] # remove missing DT[J("a",3:6),roll=TRUE] # rolling join (locf) DT[,sum(v),by=list(y%%2)] # by expression DT[,.SD,by=x] # 2nd row of each group DT[,tail(.SD,2),by=x] # last 2 rows of each group DT[,lapply(.SD,sum),by=x] # applying through columns by group DT[,list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=list(x,y%%2)] # by 2 expressions DT[,sum(v),x][V1<20] # compound query DT[,sum(v),x][order(-V1)] # ordering results DT[,z:=42L] # add new column by reference DT[,z:=NULL] # remove column DT["a",v:=42L] # subassign v by reference DT[,transform(.SD,m=mean(v)),by=x] DT[,.SD[which.min(v)],by=x] # Follow posting guide, support is here (not r-help) : maintainer("data.table") vignette("datatable-intro") vignette("datatable-faq") vignette("datatable-timings") test.data.table() # over 300 low level tests update.packages() # keep up to date