data.table inherits from
data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by
A[B] syntax in Rwhere
A is a matrix and
B is a 2-column matrix. Since a
data.table is a
data.frame, it is compatible with Rfunctions and packages that only accept
The 10 minute quick start guide to
data.table may be a good place to start:
vignette("datatable-intro")}. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation:
vignette("datatable-faq")}. If you have read and searched these documents and the help page below, please feel free to ask questions on
Please check the
example(data.table) and study the output at the prompt.
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL) ## S3 method for class 'data.table': [(x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"), # default: NA_integer_ mult = "all", roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"), # default: FALSE allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE drop = NULL, rolltolast = FALSE # deprecated )
- Just as
data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
TRUEwill retain the rownames of that object in a column named
- Just as
- Character vector of one or more column names which is passed to
setkey. It may be a single comma separated string such as
key="x,y,z", or a vector of names such as
- Integer, logical or character vector, expression of column names,
data.table. integer and logical vectors work the same way they do in
[.data.frame. Other than
- A single column name, single expresson of column names,
list()of expressions of column names, an expression or function call that evaluates to
- A single unquoted column name, a
list()of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a char
- An ad hoc by just as
bybut with an additional
bycolumns of the result, for convenience. Not to be confused with a keyed by as defined above.
- By default
jis evaluated within the frame of
x. The column names can be used as variables. When
jis a vector of names or positions to select.
- Same as
match. When a row in
ihas no match to
NAis returned for
x's non-join colu
- When multiple rows in
xmatch to the row in
multcontrols which are returned:
- Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
i's row matches to all but the last
xjoin column, and its value in the last
- A logical vector length 2 (a single logical is recycled). When rolling forward (e.g.
roll=TRUE) if a value is past the last observation within each group defined by the join columns,
rollends=TRUEwill roll the last v
TRUEreturns the row numbers of
NAreturns the row numbers of
ithat have no match in
x. By default
FALSEand the rows in
- Advanced. Specifies the columns of
.SD. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g.,
TRUEturns on status and information messages to the console. Turn this on by default using
options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.
FALSEprevents joins that would result in more than
max(nrow(x),nrow(i))rows. This is usually caused by duplicate values in
i's join columns, each of which join to the same group in `x` over and over again: a mi
- Never used by
data.table. Do not use. It needs to be here because
- Deprecated. Setting
rolltolast=TRUEis converted to
roll=TRUE;rollends=FALSEfor backwards compatibility.
data.table builds on base Rfunctionality to reduce 2 types of time :
- programming time (easier to write, read, debug and maintain)
- compute time
It combines database like operations such as
byand provides similar joins that
mergeprovides but faster. This is achieved by using R's column based ordered in-memory
evalwithin the environment of a
[.data.tablemechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
[.data.frame, compound queries can be concatenated on one line; e.g., DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under=""
j expression does not have to return data; e.g., DT[,plot(colB,colC),by=colA] # produce a set of plots (likely to pdf) returning no data Multiple
Z) can be joined in many ways; e.g., X[Y][Z] X[Z][Y] X[Y[Z]] X[Z[Y]] A
listof vectors, just like a
data.frame. However :
- it never has rownames. Instead it may have onekeyof one or more columns. This key can be used for row indexing instead of rownames.
- it has enhanced functionality in
[.data.tablefor fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
data.tablecolumns may be type
list. Columns of type
listcan contain mixed types. Each item in a column of type
listmay be different lengths. This is true of
Several methods are provided for
check.names are supplied they must be written in full because Rdoes not allow partial argument names after `
...`. For example,
data.table(DF,keep=TRUE) will create a
TRUE and this is correct behaviour;
data.table(DF,keep.rownames=TRUE) was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see
IDateTime instead. IDateTime has methods to convert to and from POSIXlt.
example(data.table) # to run these examples at the prompt DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT),dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() DT # 2nd row DT[,v] # v column (as vector) DT[,list(v)] # v column (as data.table) DT[2:3,sum(v)] # sum(v) over rows 2 and 3 DT[2:5,cat(v,"")] # just for j's side effect DT[c(FALSE,TRUE)] # even rows (usual recycling) DT[,2,with=FALSE] # 2nd column colNum = 2 DT[,colNum,with=FALSE] # same setkey(DT,x) # set a 1-column key. No quotes, for convenience. setkeyv(DT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(DT,v) # same # key(DT)<-"x" # copies whole table, please use set* functions instead DT["a"] # binary search (fast) DT[x=="a"] # vector scan (slow) DT[,sum(v),by=x] # keyed by DT[,sum(v),by=key(DT)] # same DT[,sum(v),by=y] # ad hoc by DT["a",sum(v)] # j for one group DT[c("a","b"),sum(v)] # j for two groups X = data.table(c("b","c"),foo=c(4,2)) X DT[X] # join DT[X,sum(v)] # join and eval j for each row in i DT[X,mult="first"] # first row of each group DT[X,mult="last"] # last row of each group DT[X,sum(v)*foo] # join inherited scope setkey(DT,x,y) # 2-column key setkeyv(DT,c("x","y")) # same DT["a"] # join to 1st column of key DT[J("a")] # same. J() stands for Join, an alias for list() DT[list("a")] # same DT[.("a")] # same. In the style of package plyr. DT[J("a",3)] # join to 2 columns DT[.("a",3)] # same DT[J("a",3:6)] # join 4 rows (2 missing) DT[J("a",3:6),nomatch=0] # remove missing DT[J("a",3:6),roll=TRUE] # rolling join (locf) DT[,sum(v),by=list(y%%2)] # by expression DT[,.SD,by=x] # 2nd row of each group DT[,tail(.SD,2),by=x] # last 2 rows of each group DT[,lapply(.SD,sum),by=x] # apply through columns by group DT[,list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=list(x,y%%2)] # by 2 expressions DT[,sum(v),x][V1<20] # compound query DT[,sum(v),x][order(-V1)] # ordering results print(DT[,z:=42L]) # add new column by reference print(DT[,z:=NULL]) # remove column by reference print(DT["a",v:=42L]) # subassign to existing v column by reference print(DT["b",v2:=84L]) # subassign to new column by reference (NA padded) DT[,m:=mean(v),by=x] # add new column by reference by group # NB: postfix  is shortcut to print() DT[,.SD[which.min(v)],by=x] # nested query by group DT[!J("a")] # not join DT[!"a"] # same DT[!2:4] # all rows other than 2:4 DT[x!="b" | y!=3] # multiple vector scanning approach, slow DT[!J("b",3)] # same result but much faster # Follow r-help posting guide, support is here (*not* r-help) : # firstname.lastname@example.org # or # http://stackoverflow.com/questions/tagged/data.table vignette("datatable-intro") vignette("datatable-faq") vignette("datatable-timings") test.data.table() # over 700 low level tests update.packages() # keep up to date