Extract.data.table: Query a data table

Description

Like [.data.frame but i and j can be expressions of column names directly. i may also be a data.table and this invokes a fast table join using binary search in O(log n) time. Allowing i to be data.table is consistent with subsetting an n-dimension array by an n-column matrix in base R.

Usage

## S3 method for class 'data.table':
[(x, i, j, by=NULL, with=TRUE, nomatch = NA,
  mult = "first", roll = FALSE, rolltolast = FALSE,
  which = FALSE, bysameorder = FALSE,
  verbose=getOption("datatable.verbose",FALSE))

Arguments

A data.table

Optional. i may be integer, logical, expression, data.table or character.

integer and logical work in the same way as for data.frame row selection, see with below. However, i is generally either an expression or a data.table ...

When i

Optional. j is normally an expression of column names or list of expressions enclosed in list(). It may also be a single integer column position, a single character column name, or a function. To specify a vector of integer column positions or vector

Optional. An unquoted column name, or a list() of unquoted column names, or a list() of expressions of column names. If by is supplied, the table is split up by the by criteria and the j expression run on each group.

with

by default j is evaluated within the frame of x so the column names are seen as variables. See with. DT[,"ColA"] returns the column as expected, as does DT[,1]. However DT[,1:2], DT[,c(1,2)] and DT[,c("a","b")

nomatch

Same as nomatch in match. nomatch=NA is like an outer join in SQL i.e. rows containing NA are returned for rows in i not in x. nomatch=0 is like an inner join in SQL i.e. no rows are returned for rows in i no

mult

"first","last" or "all". Controls what to return when there is more than one row in x matching the row in i. "all" is analogous to inner join behaviour in SQL. SQL has no first and last concept since it's data is inherently unordered (note that select top

roll

By default an equi-join is performed on each column in turn. For example given x=data.table(id,datetime), where the datetime is irregular, we look for the exact datetime in the i table occurring in the x table. In R we often roll data on through missing p

rolltolast

Like roll but the time series data is not rolled past the last observation. In finance this is useful when a stock is delisted and we do not want to roll the last price forward creating a flat line. nomatch determines whether NA is returned,

which

By default the subset of rows in x matching i are returned. TRUE returns the row numbers only as a vector. These row numbers can be stored and passed in as the i directly in further data.table queries for efficiency, or used in further logic.

bysameorder

Advanced. Tells data.table that the by expression keeps the groups in the same order as the key, allowing some speed ups. Do not set if unsure.

verbose

TRUE turns on status and information mesages to the console.

Value

A data.table when j is missing, even when this is one row. Unlike data.frame, if any columns are factor, unused factor levels in the subset are not retained in the result to save memory. When j is provided, the type of j determines the type of the result. When j returns a data.table with grouping either using by or passing groups into i with mult="all", then simplify controls whether the groups of data.tables are bound into one data.table. When j is used for its side effects only, NULL is returned.

Details

Builds on base R functionality to reduce 2 types of time :

programming time (easier to write, read, debug and maintain)
compute time

when combining database like operations (subset, with and by) and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame, eval within the environment of a list (i.e. with), the [.data.table mechanism to condense the features and compiled C to make certain operations fast.

The package can be used solely for development time benefits on small datasets. Main compute time benefits are on 64bit platforms with plentiful RAM, or by using the ff package, or both.

Like a data.frame, the comma is optional inside [] when j is missing. However unlike with a data.frame a single unnamed argument refers to i rather than j. For example DT[3] returns the 3rd row as a 1 row data.table rather than DF[3] which returns the 3rd column as a vector. DT[3] is identical to DT[3,] unlike data.frame's. The i argument of matrix, data.frame and data.table subsetting using '[' is analogous to the 'where' clause in SQL. In data.table's when long expressions of column names appear as the i, or a join expression, we do not have to remember the comma at the end of the line. In a data.table if the i is missing, thats when you have to remember the comma, but at the beginning, so that the argument aligns to the j (j is analogous to 'select' clause in SQL). data.table's can be treated as a list (since they are a list as are data.frame's) by column index using [[ just as a data.frame e.g. DT[[3]] is identical to DF[3] and DF[[3]].

As with a data.frame a 1 row subset returns a 1 row data.table. As with data.frame a 1 column subset (or in data.table an expression of column names returning a vector) returns a vector. However unlike data.frame there is no drop argument. The type of j's result determines the result e.g. DT[,b] returns a vector, DT[,list(b)] returns a 1-column data.table. When no j clause is present, a data.table subset will always return a data.table even if only one row is returned (unlike matrix subsetting, but like data.frame subsetting).

As with data.frame subsetting mutliple queries can be concatenated on one line e.g. DT[a>3][order(b)] is analogous to 'select * from DT where a>3 order by b' in SQL. Ordering is a select of all the rows, but in a different order. Another analogy is DT[,sum(b),by="c"][c=="foo"] compared to "select sum(b) from DT group by c having c=='foo'". However, as noted under 'by' above this is more efficiently implemented using setkey(DT,c);DT["foo",sum(b),mult="all"].

The j expression does not have to return data. Data.tables 'queries' can be used for their side effects. For example DT[,plot(colB,colC),by="colA"] produces a set of plots, perhaps to a PDF file, and returns no data. If the j expression returns data and has side-effects (e.g. hist()) but only the side-effects are required, the j expression can be wrapped with invisible().

The j expression 'sees' variables in the calling frame and above including .GlobalEnv, see the examples. This is base R functionality from eval() and with().

The j expression can also be a function allowing access to the full data.table as in Hadley's plyr package.

When i is a logical expression e.g. DT[a==3], R must first create new memory to hold a logical vector as long as the rows in DT to hold the result of a==3. This is then used to subset DT. We call this a vector scan since every value of column a must be read. For large datasets (or repetitive algorithms on small datasets) vector scans have poor performance. Instead join to DT using setkey(DT,a) then DT[J(3),mult="all"].

References

http://en.wikipedia.org/wiki/Binary_search

Examples

Run this code

DF = data.frame(a=1:5, b=6:10)
DT = data.table(a=1:5, b=6:10)

DT[2]             # select * from DT where row number = 2
DT[2:3,sum(b)]    # select sum(b) from DT where row number in (2,3)
DT[2:5,plot(a)]   # used for j's side effect only i.e. displaying the plot
DT[c(FALSE,TRUE)] # extract all even numbered rows via standard R recycling

flush.console()

tt = subset(DF,a==3)
ss = DT[a==3]
identical(as.data.table(tt), ss)

tt = subset(DF,a==3,b)[[1]]+1
ss = DT[a==3,b+1]
identical(tt, ss)

tt = with(subset(DF,a==3),a+b+1)
ss = DT[a==3,a+b+1]
identical(tt, ss)

Lkp=1:3
tt = DF[with(DF,a %in% Lkp),]
ss = DT[a %in% Lkp]
identical(as.data.table(tt), ss)

# Examples above all use vector scans (bad)
# Examples below all use binary search (good)

DT = data.table(a=letters[1:5], b=6:10)
setkey(DT,a)
identical(DT["d"],DT[4])
identical(DT[J("d")], DT[4])
identical(DT[c("c","d")], DT[J(c("c","d"))])


DT = data.table(id=rep(c("A","B"),each=3), date=c(20080501L,20080502L,20080506L), v=1:6)
setkey(DT,id,date)
DT
DT["A"]                                    # all 3 rows for A
DT[J("A",20080502L)]                       # date matches exactly
DT[J("A",20080505L)]                       # NA since 5 May missing (outer join)
DT[J("A",20080505L),nomatch=0]             # inner join
dts = c(20080501L, 20080502L, 20080505L, 20080506L, 20080507L, 20080508L)
DT[J("A",dts)]                             # 3 dates match exactly
DT[J("A",dts),roll=TRUE]                   # roll previous data forward
DT[J("A",dts),rolltolast=TRUE]             # roll all but last observation forward
DT[J("A",dts),rolltolast=TRUE,nomatch=0]   # remove time series after last
cbind(DT[J("A",dts),roll=TRUE],dts)        # joined to date from dts


dts = rev(seq(as.Date("2008-06-30"), by=-1, length=5000))
dts = as.integer(gsub("-","",dts))
ids = paste(rep(LETTERS,each=26),LETTERS,sep="")
DT = data.table(CJ(id=ids, date=dts), v=rnorm(length(ids)*length(dts)))
setkey(DT,id,date)
system.time(tt <<- DT[id=="FD"])  # vector scan.   user 1.16  system 0.11  elapsed 1.27
system.time(ss <<- DT["FD"])      # binary search. user 0.02  system 0.00  elapsed 0.02
identical({setkey(tt,id,date);tt}, ss)
tables()


tt = DT[,mean(v),by="id"][c("FD","FE")]   # select mean(v) from DT group by id having id in ('FD','FE')
ss = DT[c("FD","FE"),mean(v)]             # more efficient way to group for known subgroups
identical(tt, ss)

tt = DT[c("FD","FE")][,mean(v),by="id,month=as.integer(date/100)"]

dt <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
dt[, head(.SD,n=2), by="B"]
dt[, transform(.SD,D=mean(A)), by="B"]
dt[, subset(.SD,A==min(A)), by="B"]

# Ensure you have at least 2.7GB free disk space comfortably before running this
# See Notes section above regarding ff
require(ff)
n=180000000L
DT = data.table(id=ff(0L,length=n), date=ff(1L,length=n), val=ff(0,length=n))
DT$id[167000001L] = 20
DT$val[167000002L] = 3.14
DT$date[167000003L] = 42
DT[167000000:167000005]
physical(DT$id)
rm(DT)
gc()  # return memory to OS

# See over 150 further examples in \link{test.data.table}()
test.data.table()

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

Run the code above in your browser using DataLab