data.table: Enhanced data.frame

Description

data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in Rwhere A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame, it is compatible with Rfunctions and packages that only accept data.frame. The 10 minute quick start guide to data.table may be a good place to start: ../doc/datatable-intro.pdf{vignette("datatable-intro")}. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: ../doc/datatable-faq.pdf{vignette("datatable-faq")}. If you have read and searched these documents and the help page below, please feel free to ask questions on http://r.789695.n4.nabble.com/datatable-help-f2315188.html{datatable-help} or the Stack Overflow http://stackoverflow.com/questions/tagged/data.table{data.table tag}. To report a bug please type: bug.report(package="data.table"). Please check the http://datatable.r-forge.r-project.org/{homepage} for up to the minute http://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable{news}. Tip: one of the quickest ways to learn the features is to type example(data.table) and study the output at the prompt. *NEW* :

help page for:=
keybyargument
characterandnumericnow allowed as key column types
:=by group

Usage

data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)
## S3 method for class 'data.table':
[(x, i, j, by, keyby, with=TRUE,
  nomatch = getOption("datatable.nomatch"),   # default: NA_integer_
  mult = "all",
  roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else { if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE) },
  which = FALSE,
  .SDcols,
  verbose=getOption("datatable.verbose"),     # default: FALSE
  allow.cartesian=getOption("datatable.allow.cartesian"),   # default: FALSE
  drop=NULL,
  rolltolast = FALSE   # deprecated
  )

Arguments

...

Just as ... in data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.

keep.rownames

If ... is a matrix or data.frame, TRUE will retain the rownames of that object in a column named rn.

check.names

Just as check.names in data.frame.

key

Character vector of one or more column names which is passed to setkey. It may be a single comma separated string such as key="x,y,z", or a vector of names such as key=c("x","y","z")

A data.table.

Integer, logical or character vector, expression of column names, list or data.table.

integer and logical vectors work the same way they do in [.data.frame. Other than

A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are l

A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names, or a character vector of column names.

The list() of expressions is evaluated within th

keyby

An ad hoc by just as by but with an additional setkey() on the by columns of the result, for convenience. Not to be confused with a keyed by as defined above.

with

By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. When with=FALSE, j works as it does in [.data.frame.

nomatch

Same as nomatch in match. When a row in i has no match to x's key, nomatch=NA (default) means NA is returned for x's non-join colu

mult

When multiple rows in x match to the row in i, mult controls which are returned: "all" (default), "first" or "last".

roll

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i

rollends

A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. roll=TRUE) if a value is past the last observation within each group defined by the join columns, rollends[2]=TRUE will roll the last v

which

TRUE returns the row numbers of x that i matches to. NA returns the row numbers of i that have no match in x. By default FALSE and the rows in x that m

.SDcols

Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., DT[,lapply(

verbose

TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.

allow.cartesian

FALSE prevents joins that would result in more than max(nrow(x),nrow(i)) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in `x` over and over again: a mi

drop

Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame. See vignette("datatable-faq").

rolltolast

Deprecated. Setting rolltolast=TRUE is converted to roll=TRUE;rollends=FALSE for backwards compatibility.

`Details`

data.table builds on base Rfunctionality to reduce 2 types of time :
programming time (easier to write, read, debug and maintain)
compute time
It combines database like operations such as subset, with and by and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame structure, eval within the environment of a list, the [.data.table mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
As with [.data.frame, compound queries can be concatenated on one line; e.g., 
DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under="" j expression does not have to return data; e.g.,
DT[,plot(colB,colC),by=colA]
    # produce a set of plots (likely to pdf) returning no data
Multiple data.tables (e.g. X, Y and Z) can be joined in many ways; e.g.,
X[Y][Z]
    X[Z][Y]
    X[Y[Z]]
    X[Z[Y]]
A data.table is a list of vectors, just like a data.frame. However :
it never has rownames. Instead it may have onekeyof one or more columns. This key can be used for row indexing instead of rownames.
it has enhanced functionality in[.data.tablefor fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
Since a list is a vector, data.table columns may be type list. Columns of type list can contain mixed types. Each item in a column of type list may be different lengths. This is true of data.frame, too.
Several methods are provided for data.table, including is.na, na.omit,
t, rbind, cbind, merge and others.

`References`

data.table homepage: http://datatable.r-forge.r-project.org/
User reviews: http://crantastic.org/packages/data-table
http://en.wikipedia.org/wiki/Binary_search
http://en.wikipedia.org/wiki/Radix_sort

`See Also`

data.frame, [.data.frame, as.data.table, setkey, J, SJ, CJ, merge.data.table, tables, test.data.table, IDateTime, unique.data.table, copy, :=, alloc.col, truelength, rbindlist
html{}

`Examples`

Run this codeexample(data.table)  # to run these examples at the prompt

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a)      # TRUE
is.list(DF)                # TRUE
is.list(DT)                # TRUE

is.data.frame(DT)          # TRUE

tables()

DT[2]                      # 2nd row
DT[,v]                     # v column (as vector)
DT[,list(v)]               # v column (as data.table)
DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
DT[2:5,cat(v,"")]        # just for j's side effect
DT[c(FALSE,TRUE)]          # even rows (usual recycling)

DT[,2,with=FALSE]          # 2nd column
colNum = 2
DT[,colNum,with=FALSE]     # same

setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x")            # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v)              # same
# key(DT)<-"x"             # copies whole table, please use set* functions instead

DT["a"]                    # binary search (fast)
DT[x=="a"]                 # vector scan (slow)

DT[,sum(v),by=x]           # keyed by
DT[,sum(v),by=key(DT)]     # same
DT[,sum(v),by=y]           # ad hoc by

DT["a",sum(v)]             # j for one group
DT[c("a","b"),sum(v)]      # j for two groups

X = data.table(c("b","c"),foo=c(4,2))
X

DT[X]                      # join
DT[X,sum(v)]               # join and eval j for each row in i
DT[X,mult="first"]         # first row of each group
DT[X,mult="last"]          # last row of each group
DT[X,sum(v)*foo]           # join inherited scope

setkey(DT,x,y)             # 2-column key
setkeyv(DT,c("x","y"))     # same

DT["a"]                    # join to 1st column of key
DT[J("a")]                 # same. J() stands for Join, an alias for list()
DT[list("a")]              # same
DT[.("a")]                 # same. In the style of package plyr.
DT[J("a",3)]               # join to 2 columns
DT[.("a",3)]               # same
DT[J("a",3:6)]             # join 4 rows (2 missing)
DT[J("a",3:6),nomatch=0]   # remove missing
DT[J("a",3:6),roll=TRUE]   # rolling join (locf)

DT[,sum(v),by=list(y%%2)]  # by expression
DT[,.SD[2],by=x]           # 2nd row of each group
DT[,tail(.SD,2),by=x]      # last 2 rows of each group
DT[,lapply(.SD,sum),by=x]  # apply through columns by group

DT[,list(MySum=sum(v),
         MyMin=min(v),
         MyMax=max(v)),
    by=list(x,y%%2)]       # by 2 expressions

DT[,sum(v),x][V1<20]       # compound query
DT[,sum(v),x][order(-V1)]  # ordering results

print(DT[,z:=42L])         # add new column by reference
print(DT[,z:=NULL])        # remove column by reference
print(DT["a",v:=42L])      # subassign to existing v column by reference
print(DT["b",v2:=84L])     # subassign to new column by reference (NA padded)

DT[,m:=mean(v),by=x][]     # add new column by reference by group
                           # NB: postfix [] is shortcut to print()

DT[,.SD[which.min(v)],by=x][]  # nested query by group

DT[!J("a")]                # not join
DT[!"a"]                   # same
DT[!2:4]                   # all rows other than 2:4
DT[x!="b" | y!=3]          # multiple vector scanning approach, slow
DT[!J("b",3)]              # same result but much faster


# Follow r-help posting guide, support is here (*not* r-help) :
# datatable-help@lists.r-forge.r-project.org
# or
# http://stackoverflow.com/questions/tagged/data.table

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

test.data.table()          # over 700 low level tests

update.packages()          # keep up to date
Run the code above in your browser using DataLab