data.table: Just like a data.frame, but without row names

Description

Same as data.frame() but the result has no row names. In some cases (see example) rownames alone are responsible for 90% of the memory used by a data.frame. Removing them can therefore mean up to 10 times less memory, and be 10 times faster to create, and 10 times faster to copy. 1:nrow stored in character form is inefficient since rows can be indexed by their integer position. For example, DF[6,] and DF["6",] both work but the former is more efficient.

Usage

data.table(..., keep.rownames = FALSE, check.names = TRUE)

Arguments

...

Just as ... in data.frame()

keep.rownames

If ... is a data.frame itself, TRUE will retain the rownames in the first column

check.names

Just as in data.frame()

Value

Identical to the result of data.frame, but without the row.names attribute.

Details

This class really does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames. Most of the code is copied from base functions with the code manipulating row.names removed. A data.table is identical to a data.frame other than: it doesn't have rownames [,drop] by default is FALSE, so selecting a single row will always return a single row data.table not a vector The comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table [] is like a call to subset() [,...], is like a call to with(). (not yet implemented) Motivation: up to 10 times less memory up to 10 times faster to create, and copy simpler R code by allowing column name expressions within [] the white book defines rownames, so data.frame itself can't be changed ... => new class

References

http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html

Examples

Run this code

nr = 1000000
D = rep(1:5,nr/5)
system.time(DF <<- data.frame(colA=D, colB=D))  # 2.08 
system.time(DT <<- data.table(colA=D, colB=D))  # 0.15  (over 10 times faster to create)
identical(as.data.table(DF), DT)
identical(dim(DT),dim(DF))
object.size(DF)/object.size(DT)                 # 10 times less memory

tt = subset(DF,colA>3)
ss = DT[colA>3]
identical(as.data.table(tt), ss)

mean(subset(DF,colA+colB>5,"colB"))
mean(DT[colA+colB>5]$colB)

tt = with(subset(DF,colA>3),colA+colB)
ss = with(DT[colA>3],colA+colB)                 # but could be:  DT[colA>3,colA+colB]  (not yet implemented)
identical(tt, ss)

tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row grouping by colB
ss = DT[tapply(1:nrow(DT),colB,last)]           # but could be:  DT[last,group=colB]  (not yet implemented)
identical(as.data.table(tt), ss)

Lkp=1:3
tt = DF[with(DF,colA %in% Lkp),]              
ss = DT[colA %in% Lkp]                        # expressions inside the [] can see objects in the calling frame
identical(as.data.table(tt), ss)

Run the code above in your browser using DataLab