data.table-package: Enhanced data.frame

Description

data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame, it is compatible with R functions and packages that only accept data.frame. The 10 minute quick start guide to data.table may be a good place to start: vignette("datatable-intro"). Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: vignette("datatable-faq"). If you have read and searched these documents and the help page below, please feel free to ask questions on datatable-help or the Stack Overflow data.table tag. To report a bug please type: bug.report(package="data.table"). Please check the homepage for up to the minute news. Tip: one of the quickest ways to learn the features is to type example(data.table) and study the output at the prompt.

Usage

data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)
"["(x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"),                   # default: NA_integer_ mult = "all", roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"),                   # default: FALSE allow.cartesian = getOption("datatable.allow.cartesian"),   # default: FALSE drop = NULL,  on = NULL # join without setting keys, new feature from v1.9.6+ )

Arguments

...

Just as ... in data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.

keep.rownames

If ... is a matrix or data.frame, TRUE will retain the rownames of that object in a column named rn.

check.names

Just as check.names in data.frame.

key

Character vector of one or more column names which is passed to setkey. It may be a single comma separated string such as key="x,y,z", or a vector of names such as key=c("x","y","z").

A data.table.

Integer, logical or character vector, single column numeric matrix, expression of column names, list or data.table.

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.

character is matched to the first column of x's key.

expression is evaluated within the frame of the data.table (i.e. it sees column names as if they are variables) and can evaluate to any of the other types.

When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 of i's key to column 2 of x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)). All types of `i` may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of `i`, we mean the type of `i` after the `!`, if present. See examples.

Advanced: When i is an expression of column names that evaluates to data.table or list, a join is performed. We call this a self join.

Advanced: When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.

A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) a vector of names or positions to select.

j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. Use j=list(...) to return multiple columns and/or expressions of columns. A single column or single expression returns that type, usually a vector. See the examples.

A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.

The list() of expressions is evaluated within the frame of the data.table (i.e. it sees column names as if they are variables). The data.table is then grouped by the by and j is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. j=list(...) may be omitted when there is just one expression, for convenience, typically a single expression such as sum(colB); e.g., DT[,sum(colB),by=colA].

When by contains the first n columns of x's key, we call this a keyed by. In a keyed by the groups appear contiguously in RAM and memory is copied in bulk internally, for extra speed. Otherwise, we call it an ad hoc by. Ad hoc by is still many times faster than tapply, for example, but just not as fast as keyed by when datasets are very large, in particular when the size of each group is large. Not to be confused with keyby= defined below.

Advanced: When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups in `DT` that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i. It is particularly memory efficient as you don't have to materialise the join result only to aggregate later. Please refer to this Stackoverflow answer for a more detailed explanation until we roll out vignettes.

Advanced: When grouping, symbols .SD, .BY, .N, .I and .GRP may be used in the j expression, defined as follows. .SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby). .BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable. .N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by i, .N is the number of rows in x matched to, for each row of i, regardless of whether nomatch is NA or 0. It is renamed to N (no dot) in the result (otherwise a column called ".N" could conflict with the .N variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g., DT[,list(total=.N),by=a]. .I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, it's row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp]. .GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc. .SD, .BY, .N, .I and .GRP are read only. Their bindings are locked and attempting to assign to them will generate an error. If you wish to manipulate .SD before returning it, take a copy(.SD) first (see FAQ 4.5). Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group (even when groups are not contiguous in an ad hoc by).

Advanced: In the X[Y,j] form of grouping, the j expression sees variables in X first, then Y. We call this join inherited scope. If the variable is not in X or Y then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.

keyby

An ad-hoc-by or keyed-by (just as by= defined above) but with an additional setkey() run on the by columns of the result afterwards, for convenience. It is common practice to use `keyby=` routinely when you wish the result to be sorted. Out loud we read keyby= as by= then setkey. Otherwise, `by=` can be relied on to return the groups in the order they appear in the data.

with

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. When with=FALSE j is a character vector of column names or a numeric vector of column positions to select, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically.

nomatch

Same as nomatch in match. When a row in i has no match to x's key, nomatch=NA (default) means NA is returned for x's non-join columns for that row of i. 0 means no rows will be returned for that row of i. The default value (used when nomatch is not supplied) can be changed from NA to 0 using options(datatable.nomatch=0).

mult

When multiple rows in x match to the row in i, mult controls which are returned: "all" (default), "first" or "last".

roll

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join. When roll is a positive number, this limits how far values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is joined to.

rollends

A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. roll=TRUE) if a value is past the last observation within each group defined by the join columns, rollends[2]=TRUE will roll the last value forwards. rollends[1]=TRUE will roll the first value backwards if the value is before it. If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column. When roll is a finite number, that limit is also applied when rolling the ends.

which

TRUE returns the row numbers of x that i matches to. NA returns the row numbers of i that have no match in x. By default FALSE and the rows in x that match are returned.

.SDcols

Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350].

verbose

TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.

allow.cartesian

FALSE prevents joins that would result in more than nrow(x)+nrow(i) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in `x` over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.

drop

Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame. See vignette("datatable-faq").

A named atomic vector of column names indicating which columns in i should be joined to which columns in x. See Examples.

Details

data.table builds on base R functionality to reduce 2 types of time :

programming time (easier to write, read, debug and maintain)
compute time

It combines database like operations such as subset, with and by and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame structure, eval within the environment of a list, the [.data.table mechanism to condense the features, and compiled C to make certain operations fast.

The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.

As with [.data.frame, compound queries can be concatenated on one line; e.g.,

    DT[,sum(v),by=colA][V1<300][tail(order(v1))] 6="" 300="" #="" sum(v)="" by="" cola="" then="" return="" the="" largest="" which="" are="" under="" <="" pre="">
The j expression does not have to return data; e.g.,
    DT[,plot(colB,colC),by=colA]
    # produce a set of plots (likely to pdf) returning no data

Multiple data.tables (e.g. X, Y and Z) can be joined in many ways; e.g.,
    X[Y][Z]
    X[Z][Y]
    X[Y[Z]]
    X[Z[Y]]

A data.table is a list of vectors, just like a data.frame. However :

 it never has rownames. Instead it may have one key of one or more columns. This key can be used for row indexing instead of rownames.

 it has enhanced functionality in [.data.table for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.

Since a list is a vector, data.table columns may be type list. Columns of type list can contain mixed types. Each item in a column of type list may be different lengths. This is true of data.frame, too.

Several methods are provided for data.table, including is.na, na.omit, t, rbind, cbind, merge and others.

References

data.table homepage: http://datatable.r-forge.r-project.org/ User reviews: http://crantastic.org/packages/data-table http://en.wikipedia.org/wiki/Binary_search http://en.wikipedia.org/wiki/Radix_sort

Examples

Run this code

## Not run: 
# example(data.table)  # to run these examples at the prompt## End(Not run)

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a)      # TRUE
is.list(DF)                # TRUE
is.list(DT)                # TRUE

is.data.frame(DT)          # TRUE

tables()

DT[2]                      # 2nd row
DT[,v]                     # v column (as vector)
DT[,list(v)]               # v column (as data.table)
DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
DT[2:5,cat(v,"\n")]        # just for j's side effect
DT[c(FALSE,TRUE)]          # even rows (usual recycling)

DT[,2,with=FALSE]          # 2nd column
colNum = 2
DT[,colNum,with=FALSE]     # same

setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x")            # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v)              # same
# key(DT)<-"x"             # copies whole table, please use set* functions instead

DT["a"]                    # binary search (fast)
DT[x=="a"]                 # same; i.e. binary search (fast)

DT[,sum(v),by=x]           # keyed by
DT[,sum(v),by=key(DT)]     # same
DT[,sum(v),by=y]           # ad hoc by

DT["a",sum(v)]                    # j for one group
DT[c("a","b"),sum(v),by=.EACHI]   # j for two groups

X = data.table(c("b","c"),foo=c(4,2))
X

DT[X]                      # join
DT[X,sum(v),by=.EACHI]     # join and eval j for each row in i
DT[X,mult="first"]         # first row of each group
DT[X,mult="last"]          # last row of each group
DT[X,sum(v)*foo,by=.EACHI] # join inherited scope

setkey(DT,x,y)             # 2-column key
setkeyv(DT,c("x","y"))     # same

DT["a"]                    # join to 1st column of key
DT[.("a")]                 # same, .() is an alias for list()
DT[list("a")]              # same
DT[.("a",3)]               # join to 2 columns
DT[.("a",3:6)]             # join 4 rows (2 missing)
DT[.("a",3:6),nomatch=0]   # remove missing
DT[.("a",3:6),roll=TRUE]   # rolling join (locf)

DT[,sum(v),by=.(y%%2)]   # by expression
DT[,.SD[2],by=x]           # 2nd row of each group
DT[,tail(.SD,2),by=x]      # last 2 rows of each group
DT[,lapply(.SD,sum),by=x]  # apply through columns by group

DT[,list(MySum=sum(v),
         MyMin=min(v),
         MyMax=max(v)),
    by=.(x,y%%2)]        # by 2 expressions

DT[,sum(v),x][V1<20]       # compound query
DT[,sum(v),x][order(-V1)]  # ordering results

print(DT[,z:=42L])         # add new column by reference
print(DT[,z:=NULL])        # remove column by reference
print(DT["a",v:=42L])      # subassign to existing v column by reference
print(DT["b",v2:=84L])     # subassign to new column by reference (NA padded)

DT[,m:=mean(v),by=x][]     # add new column by reference by group
                           # NB: postfix [] is shortcut to print()

DT[,.SD[which.min(v)],by=x][]  # nested query by group

DT[!.("a")]                # not join
DT[!"a"]                   # same
DT[!2:4]                   # all rows other than 2:4
DT[x!="b" | y!=3]          # not yet optimized, currently vector scans
DT[!.("b",3)]              # same result but much faster


# new feature: 'on' argument, from v1.9.6+
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), mul=6:8)

DT1[DT2, on=c(x="x")] # join on columns 'x'
DT1[DT2, on="x"] # same as above
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x"] # using by=.EACHI
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x", nomatch=0L] # using by=.EACHI

# Follow r-help posting guide, support is here (*not* r-help) :
# http://stackoverflow.com/questions/tagged/data.table
# or
# datatable-help@lists.r-forge.r-project.org

## Not run: 
# vignette("datatable-intro")
# vignette("datatable-faq")
# 
# test.data.table()          # over 1,300 low level tests
# 
# update.packages()          # keep up to date
# ## End(Not run)

Run the code above in your browser using DataLab