Create key on a data table
Sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.
setkey(x, ..., verbose=getOption("datatable.verbose",FALSE)) setkeyv(x, cols, verbose=getOption("datatable.verbose",FALSE)) key(x) haskey(x) copy(x) setattr(x,name,value) setnames(x,old,new) setcolorder(x,neworder) key(x) <- value # deprecated, please use setkey or setkeyv instead.
- The columns to sort by. Do not quote the column names. If
...is missing (i.e.
setkey(DT)), all the columns are used.
- A character vector (only) of column names.
- In (deprecated)
key<-, a character vector (only) of column names. In
setattr, the value to assign to the attribute or
NULLremoves the attribute, if present.
- The character attribute name.
- Output status and information.
newis provided, character names or numeric positions of column names to change. When
newis not provided, the new column names, which must be the same length as the number of columns. See examples.
- Optional. New column names, the same length as
- Character vector of the new column name ordering. May also be column numbers.
The sort is attempted with the very fast
"radix" method in
sort.list. If that fails, the sort reverts to the default
order. That logic is repeated column by column.
The sort is stable; i.e., the order of ties (if any) is preserved.
v=NULL, the key is removed.
In v1.7.8, the
key<- syntax was deprecated. The
<- method copies the whole table and we know of no way to avoid that copy without a change in Ritself. Please use the
set* functions instead, which make no copy at all.
setkey accepts unquoted column names for convenience, whilst
setkeyv accepts one vector of column names.
The problem (for
data.table) with the copy by
key<- (other than being slower) is that Rdoesn't maintain the over allocated truelength, but it looks as though it has. Adding a column by reference using
:= after a
key<- was therefore a memory overwrite and eventually a seg fault; the over allocated memory wasn't really there after
data.tables have a new attribute
.internal.selfref to catch and warn about such copies in future. This attribute has been implemented in way that is friendly with
For the same reason, please use
setattr() rather than
setnames() rather than
setcolorder() rather than
It isn't good programming practice, in general, to use column numbers rather than names. This is why
setkeyv only accept column names, and why
setnames() is recommended to be names. If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a
setkey by column number will then refer to a different column, possibly returning incorrect results with no warning. (A similar concept exists in SQL, where
"select * from ..." is considered poor programming style when a robust, maintainable system is required.) If you wish to use column numbers, it's possible but a little harder; e.g.,
data.tableis modified by reference, and returned (invisibly) so it can be used in compound statements; e.g.,
setkey(DT,a)[J("foo")]. If you require a copy, take a copy first (using
copy()may also sometimes be useful before
:=is used to subassign to a column by reference. See
base::sort.list(x,method="radix") actually invokes a counting sort, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. Anyway, this is one reason data.table 'likes' integers and factors.
# Type 'example(setkey)' to run these at prompt and browse output DT = data.table(A=5:1,B=letters[5:1]) DT # before setkey(DT,B) # re-orders table and marks it sorted. DT # after tables() # KEY column reports the key'd columns key(DT) keycols = c("A","B") setkeyv(DT,keycols) # rather than key(DT)<-keycols (which copies entire table) DT = data.table(A=5:1,B=letters[5:1]) DT2 = DT # does not copy setkey(DT2,B) # does not copy-on-write to DT2 identical(DT,DT2) # TRUE. DT and DT2 are two names for the same keyed table DT = data.table(A=5:1,B=letters[5:1]) DT2 = copy(DT) # explicit copy() needed to copy a data.table setkey(DT2,B) # now just changes DT2 identical(DT,DT2) # FALSE. DT and DT2 are now different tables DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies try(tracemem(DF)) # try() for non-Windows where R is faster without memory profiling colnames(DF) <- "A" # 4 copies of entire object names(DF) <- "A" # 3 copies of entire object `names<-`(DF,c("A","b")) # 1 copy of entire object x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method) # What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name? DT = data.table(a=1:2,b=3:4,c=5:6) try(tracemem(DT)) setnames(DT,"b","B") # by name; no match() needed setnames(DT,3,"C") # by position setnames(DT,2:3,c("D","E")) # multiple setnames(DT,c("a","E"),c("A","F")) # multiple by name setnames(DT,c("X","Y","Z")) # replace all # And, no copy of DT was made by setnames() at all
DT = data.table(A=5:1,B=letters[5:1]) DT # before setkey(DT,B) # re-orders table and marks it sorted. DT # after