# setkey

by M Dowle
0th

Percentile

##### Create key on a data table

setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column. All set* functions similarly change their input by reference with no copy at all, and are documented here. Other than set(), which is documented in :=.

Keywords
data
##### Usage
setkey(x, ..., verbose=getOption("datatable.verbose"))
setkeyv(x, cols, verbose=getOption("datatable.verbose"))
key(x)
copy(x)
setattr(x,name,value)
setnames(x,old,new)
setcolorder(x,neworder)
key(x) <- value   #  DEPRECATED, please use setkey or setkeyv instead.
##### Arguments
x
A data.table. Other than setattr which accepts any input; e.g, columns of a data.frame or data.table, and setnames which accepts data.frame, too.
...
The columns to sort by. Do not quote the column names. If ... is missing (i.e. setkey(DT)), all the columns are used. NULL removes the key.
cols
A character vector (only) of column names.
value
In (deprecated) key<-, a character vector (only) of column names. In setattr, the value to assign to the attribute or NULL removes the attribute, if present.
name
The character attribute name.
verbose
Output status and information.
old
When new is provided, character names or numeric positions of column names to change. When new is not provided, the new column names, which must be the same length as the number of columns. See examples.
new
Optional. New column names, the same length as old.
neworder
Character vector of the new column name ordering. May also be column numbers.
##### Details

The sort is attempted with the very fast "radix" method in sort.list. If that fails, the sort reverts to the default method in order. That logic is repeated column by column. The sort is stable; i.e., the order of ties (if any) is preserved. In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of no way to avoid that copy without a change in Ritself. Please use the set* functions instead, which make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv accepts one vector of column names. The problem (for data.table) with the copy by key<- (other than being slower) is that Rdoesn't maintain the over allocated truelength, but it looks as though it has. Adding a column by reference using := after a key<- was therefore a memory overwrite and eventually a segfault; the over allocated memory wasn't really there after key<-'s copy. data.tables now have an attribute .internal.selfref to catch and warn about such copies. This attribute has been implemented in a way that is friendly with identical() and object.size().

For the same reason, please use setattr() rather than attr(x,name)<-value, setnames() rather than names(x)<-value or colnames(x)<-value, and setcolorder() rather than DT<-DT[,neworder,with=FALSE]. In particular, setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables. It isn't good programming practice, in general, to use column numbers rather than names. This is why setkey and setkeyv only accept column names, and why old in setnames() is recommended to be names. If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey by column number will then refer to a different column, possibly returning incorrect results with no warning. (A similar concept exists in SQL, where "select * from ..." is considered poor programming style when a robust, maintainable system is required.) If you really wish to use column numbers, it's possible but deliberately a little harder; e.g., setkeyv(DT,colnames(DT)[1:2]).

##### Value

• The input is modified by reference, and returned (invisibly) so it can be used in compound statements; e.g., setkey(DT,a)[J("foo")]. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference. See ?copy. Note that setattr is also in package bit. Both packages merely expose R's internal setAttrib function at C level, but differ in return value. bit::setattr returns NULL (invisibly) to remind you the function is used for its side effect. data.table::setattr returns the changed object (invisibly), for use in compound statements.

##### Note

Despite its name, base::sort.list(x,method="radix") actually invokes a counting sort in R, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R's internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

##### References

data.table, tables, J, sort.list, copy, := html{

}

• setkey
• setkeyv
• key
• key<-
• copy
• setattr
• setnames
• setcolorder
##### Examples
# Type 'example(setkey)' to run these at prompt and browse output

DT = data.table(A=5:1,B=letters[5:1])
DT # before
setkey(DT,B)          # re-orders table and marks it sorted.
DT # after
tables()              # KEY column reports the key'd columns
key(DT)
keycols = c("A","B")
setkeyv(DT,keycols)    # rather than key(DT)<-keycols (which copies entire table)

DT = data.table(A=5:1,B=letters[5:1])
DT2 = DT              # does not copy
setkey(DT2,B)         # does not copy-on-write to DT2
identical(DT,DT2)     # TRUE. DT and DT2 are two names for the same keyed table

DT = data.table(A=5:1,B=letters[5:1])
DT2 = copy(DT)        # explicit copy() needed to copy a data.table
setkey(DT2,B)         # now just changes DT2
identical(DT,DT2)     # FALSE. DT and DT2 are now different tables

DF = data.frame(a=1:2,b=3:4)       # base data.frame to demo copies, as of R 2.15.1
try(tracemem(DF))                  # try() for non-Windows where R is faster without memory profiling
colnames(DF)[1] <- "A"             # 4 copies of entire object
names(DF)[1] <- "A"                # 3 copies of entire object
names(DF) <- c("A", "b")           # 1 copy of entire object
names<-(DF,c("A","b"))           # 1 copy of entire object

# What if DF is large, say 10GB in RAM. Copy 10GB, even once, just to change a column name?

DT = data.table(a=1:2,b=3:4,c=5:6)
try(tracemem(DT))
setnames(DT,"b","B")               # by name; no match() needed
setnames(DT,3,"C")                 # by position
setnames(DT,2:3,c("D","E"))        # multiple
setnames(DT,c("a","E"),c("A","F")) # multiple by name
setnames(DT,c("X","Y","Z"))        # replace all

# And, no copy of DT was made by setnames() at all.
Documentation reproduced from package data.table, version 1.8.8, License: GPL (>= 2)

### Community examples

raf.v.asuncion@gmail.com at Feb 27, 2018 data.table v1.10.4-2

DT = data.table(A=5:1,B=letters[5:1]) DT # before setkey(DT,B) # re-orders table and marks it sorted. DT # after