data.table (version 1.7.8)

setkey: Create key on a data table

Description

Sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.

Usage

setkey(x, ..., verbose=getOption("datatable.verbose",FALSE))
setkeyv(x, cols, verbose=getOption("datatable.verbose",FALSE))
key(x)
key(x) <- value
haskey(x)
copy(x)
setattr(x,name,value)

Arguments

x
An unquoted name of a data.table.
...
The columns to sort by. Do not quote the column names. If ... is missing (i.e. setkey(DT)), all the columns are used.
cols
A character vector (only) of column names.
value
In (deprecated) key<-, a character vector (only) of column names. In setattr, the value to assign to the attribute or NULL removes the attribute, if present.
name
The character attribute name.
verbose
Output status and information.

Value

  • The data.table is modified by reference, and returned so it can be used in compound statements; e.g., setkey(DT,a)[J("foo")]. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference. See ?copy.

Details

The sort is attempted with the very fast "radix" method in sort.list. If that fails, the sort reverts to the default method in order. That logic is repeated column by column. The sort is stable; i.e., the order of ties (if any) is preserved. If v=NULL, the key is removed. In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of no way to avoid that copy without a change in Ritself. Please use the set* functions instead, which make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv accepts a vector of column names. The problem (for data.table) with the copy by key<- (other than being slower) is that Rdoesn't maintain the over allocated truelength, but it looks as though it has. Adding a column by reference using := after a key<- was therefore a memory overwrite and eventually a seg fault; the over allocated memory wasn't really there after key<-'s copy. data.tables have a new attribute .internal.selfref to catch and warn about such copies in future. This attribute has been implemented in way that is friendly with identical() and object.size().

For the same reason, please use setattr() rather than attr(x,name)<-value. And (TO DO) some set interface to names and colnames. It isn't good programming practice, in general, to use column numbers rather than names. This is why setkey and setkeyv only accept column names. If you use column numbers, bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey by column number will then refer to a different column, possibly returning incorrect results with no warning. (A similar concept exists in SQL, where "select * from ..." is considered poor programming style when a robust, maintainable system is required.) If you wish to use column numbers, it's possible but a little harder; e.g., setkeyv(DT,colnames(DT)[1:2]).

References

http://en.wikipedia.org/wiki/Radix_sort http://en.wikipedia.org/wiki/Counting_sort

See Also

data.table, tables, J, sort.list, copy, := html{}

Examples

Run this code
DT = data.table(A=5:1,B=letters[5:1])
    DT # before
    setkey(DT,B)  # re-orders table and marks it sorted.
    DT # after
    tables()      # KEY column reports the key'd columns
    key(DT)
    setkeyv(DT,c("A"))  # rather than key(DT)<-c("A")
    
    DT = data.table(A=5:1,B=letters[5:1])
    DT2 = DT              # not enough to copy
    setkey(DT2,B)         # does not copy on write to DT2
    identical(DT,DT2)     # TRUE. DT and DT2 are two names for the same keyed table
    
    DT = data.table(A=5:1,B=letters[5:1])
    DT2 = copy(DT)        # explicit copy is required for data.table
    setkey(DT2,B)         # just changes DT2
    identical(DT,DT2)     # FALSE. DT and DT2 are now different tables

Run the code above in your browser using DataLab