:=: Assignment by reference

Description

Fast add, remove and modify subsets of columns, by reference.

Usage

LHS := RHS         # in j only i.e. DT[i,LHS:=RHS]

Arguments

LHS

A single column name. Or, when with=FALSE, a vector of column names or numeric positions (or a variable that evaluates as such). If the column doesn't exist, it is added, by reference.

RHS

A vector of replacement values. It is recycled in the usual way to fill the number of rows satisfying i, if any. Or, when with=FALSE, a list of replacement vectors which are applied (the list is recycled

Value

DT is modified by reference and the new value is returned. If you require a copy, take a copy first (using DT2=copy(DT)). Recall that this package is for large data (of mixed column types, with multi-column keys) where updates by reference can be many orders of magnitude faster than copying the entire table.

Details

:= is defined for use in j only. This syntax updates the column(s) by reference. It makes no copies of any part of memory at all. Typical usages are : DT[i,colname:=value] DT[i,"colname":=value,with=FALSE] DT[i,(3:6):=value,with=FALSE] DT[i,colnamevector:=value,with=FALSE] The following all result in a friendly error (by design) : x := 1L # friendly error DT[i,colname] := value # friendly error DT[i]$colname := value # friendly error

:= in j can be combined with all types of i, such as binary search. When the LHS is a factor column and the RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods. Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type. data.tables are not copied-on-write by setkey, key<- or :=. See copy. Additional resources: search for ":=" in the ../doc/datatable-faq.pdf{FAQs vignette} (3 FAQs mention :=), search Stack Overflow's http://stackoverflow.com/search?q=%5Bdata.table%5D+reference{data.table tag for "reference"} (6 questions) and search data.table's http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table{wiki}. Advanced (internals) : sub assigning to columns is easy to see how that is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). Adding columns is more tricky to see how that can be grown by reference: the list vector of column pointers is over-allocated, see truelength. By defining := in j we believe update synax is natural, and scales, but also it bypasses [<- dispatch via *tmp* and allows := to update by reference with no copies of any part of memory at all.

Examples

Run this code

DT = data.table(a=LETTERS[c(1,1:3)],b=4:7,key="a")
    DT[,c:=8]        # add a numeric column, 7 for all rows
    DT[,d:=9L]       # add an integer column, 8L for all rows
    DT[,c:=NULL]     # remove the c column
    DT[2,d:=10L]     # subassign by reference to column d
    DT               # DT changed by reference
    
    DT[b>4,b:=d*2L]  # subassign to b using d, where b>4
    DT["A",b:=0L]    # binary search for group "A" and set column b

DT[,newcol:=sum(v),by=group]  # like fast transform() by group (not yet implemented)
   
    # Speed example ...
        
    m = matrix(1,nrow=100000,ncol=100)
    DF = as.data.frame(m)
    DT = as.data.table(m)    
system.time(for (i in 1:1000) DF[i,1] <- i)
    # 591 seconds        
    system.time(for (i in 1:1000) DT[i,V1:=i])
    # 1.16 seconds  ( 509 times faster )

Run the code above in your browser using DataLab