data.table (version 1.6.2)

data.table: Enhanced data.frame

Description

data.table inherits from data.frame. It offers fast subset, fast grouping and fast ordered joins in a short and flexible syntax, for faster development. It was inspired by A[B] syntax in Rwhere A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame it is compatible with Rfunctions and packages that only accept data.frame. The 10 minute quick start guide to data.table may be a good place to start; type vignette("datatable-intro").

Usage

data.table(..., keep.rownames=FALSE, check.names=TRUE, key=NULL)

## S3 method for class 'data.table': [(x, i, j, by=NULL, with=TRUE, nomatch = NA, mult = "all", roll = FALSE, rolltolast = FALSE, which = FALSE, bysameorder = FALSE, verbose=getOption("datatable.verbose",FALSE), drop=NULL)

Arguments

...
Just as ... in data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
keep.rownames
If ... is a matrix or data.frame, TRUE will retain the rownames of that object in a column named rn.
check.names
Just as check.names in data.frame.
key
Character vector of length 1 containing one or more column names separated by comma which is passed to setkey.
x
A data.table.
i
Integer, logical or character vector, expression of column names, or data.table.

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in lo

j
list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) sam
by
list() of expressions of column names, or a single character string containing comma separated column names, or a character vector of column names.

The list() of expressions is evaluated within the frame of the data.table (i.e

with
By default with=TRUE and j is evaluated within the frame of x; the column names can be used as variables. When with=FALSE, j works as it does in [.data.frame.
nomatch
Same as nomatch in match. When a row in i has no match to x's key, nomatch=NA means NA is returned for x's non-join columns for th
mult
When multiple rows in x match to the row in i, mult controls which are returned: "all" (default), "first" or "last".
roll
Applies to the last column of x's key, which is generally a date but can be any ordered variable, with gaps. When roll=TRUE if i's row matches to all but the last column of x's key, and the value of the
rolltolast
Like roll but the data is not rolled forward past the last observation. The value of i must fall in a gap in x but not after the end of the data for that group defined by the length(key(x))-1 co
which
TRUE returns the integer row numbers of x that i matches to.
bysameorder
Advanced. TRUE tells [.data.table that the expressions in by are an order preserving map of x, allowing some efficiency gains. In most situations bysameorder is set to TRUE inte
verbose
TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.
drop
Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame. See vignette("datatable-faq").

Details

data.table builds on base Rfunctionality to reduce 2 types of time :
  1. programming time (easier to write, read, debug and maintain)
  2. compute time

It combines database like operations such as subset, with and by and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame structure, eval within the environment of a list, the [.data.table mechanism to condense the features, and compiled C to make certain operations fast.

The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.

As with [.data.frame, compound queries can be concatenated on one line, e.g. DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under="" j expression does not have to return data, e.g. DT[,plot(colB,colC),by=colA] # produce a set of plots (likely to pdf) returning no data Multiple data.tables (e.g. X, Y and Z) can be joined in many ways, e.g. : X[Y][Z] X[Z][Y] X[Y[Z]] X[Z[Y]] A data.table is a list of vectors, just like a data.frame. However :

  1. it never has rownames. Instead it may have onekeyof one or more columns. This key can be used for row indexing instead of rownames.
  2. it has enhanced functionality in[.data.tablefor fast joins of keyed tables, fast aggregation, and fast last observation carried forward (LOCF).

Since a list is a vector, data.table columns may be type list. Columns of type list can contain mixed types. Each item in a column of type list may be different lengths. This is true of data.frame, too.

Several methods are provided for data.table, including is.na, na.omit, t, rbind, cbind, merge and others.

References

data.table homepage: http://datatable.r-forge.r-project.org/ User reviews: http://crantastic.org/packages/data-table http://en.wikipedia.org/wiki/Binary_search http://en.wikipedia.org/wiki/Radix_sort

See Also

data.frame, [.data.frame, as.data.table, setkey, J, SJ, CJ, merge.data.table, tables, test.data.table, IDateTime html{}

Examples

Run this code
example(data.table)  # to run these examples at the prompt

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a)      # TRUE
is.list(DF)                # TRUE
is.list(DT)                # TRUE

is.data.frame(DT)          # TRUE

tables()

DT[2]                      # 2nd row
DT[,v]                     # v column (as vector)
DT[,list(v)]               # v column (as data.table)
DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
DT[2:5,cat(v,"")]        # just for j's side effect
DT[c(FALSE,TRUE)]          # even rows (usual recycling)

DT[,2,with=FALSE]          # 2nd column
colNum = 2
DT[,colNum,with=FALSE]     # same

setkey(DT,x)               # set a 1-column key
key(DT) = "x"              # same

DT["a"]                    # binary search (fast)
DT[x=="a"]                 # vector scan (slow)

DT[,sum(v),by=x]           # keyed by
DT[,sum(v),by=key(DT)]     # same
DT[,sum(v),by=y]           # ad hoc by

DT["a",sum(v)]             # j for one group
DT[c("a","b"),sum(v)]      # j for two groups

X = data.table(c("b","c"),foo=c(4,2))
X

DT[X]                      # join
DT[X,sum(v)]               # join and eval j for each row in i
DT[X,mult="first"]         # first row of each group
DT[X,mult="last"]          # last row of each group
DT[X,sum(v)*foo]           # join inherited scope

J("a",2)                   # J() is alias for data.table()
data.table("a",2)          # same

setkey(DT,x,y)             # 2-column key
key(DT) = c("x","y")       # same

DT["a"]                    # join to 1st column of key
DT[J("a")]                 # same
DT[J("a",3)]               # join to 2 columns
DT[J("a",3:6)]             # join 4 rows (2 missing)
DT[J("a",3:6),nomatch=0]   # remove missing
DT[J("a",3:6),roll=TRUE]   # rolling join (locf)

DT[,sum(v),by=list(y%%2)]  # by expression
DT[,.SD[2],by=x]           # 2nd row of each group
DT[,tail(.SD,2),by=x]      # last 2 rows of each group
DT[,lapply(.SD,sum),by=x]  # applying through columns by group

DT[,list(MySum=sum(v),
         MyMin=min(v),
         MyMax=max(v)),
    by=list(x,y%%2)]       # by 2 expressions

DT[,sum(v),x][V1<20]       # compound query
DT[,sum(v),x][order(-V1)]  # ordering results

DT[,transform(.SD,m=mean(v)),by=x] 
DT[,.SD[which.min(v)],by=x]

# Follow posting guide, support is here (not r-help) :
maintainer("data.table")

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

test.data.table()          # over 200 low level tests

update.packages()          # keep up to date

Run the code above in your browser using DataCamp Workspace