setkey sorts a data.table and marks it as sorted with an
attribute "sorted". The sorted columns are the key. The key can be any
number of columns. The data is always sorted in ascending order with NAs
(if any) always first. The table is changed by reference and there is
no memory used for the key (other than marking which columns the data is sorted by).
There are three reasons setkey is desirable:
binary search and joins are faster when they detect they can use an existing key
grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
simpler shorter syntax; e.g. DT["id",] finds the group "id" in the first column of DT's key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type.
NAs are always first because:
NA is internally INT_MIN (a large negative number) in R. Keys and indexes are always in increasing order so if NAs are first, no special treatment or branch is needed in many data.table internals involving binary search. It is not optional to place NAs last for speed, simplicity and rubustness of internals at C level.
if any NAs are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing NAs are present.
In data.table parlance, all set* functions change their input
by reference. That is, no copy is made at all other than for temporary
working memory, which is as large as one column. The only other data.table
operator that modifies input by reference is :=. Check out the
See Also section below for other set* functions data.table
provides.
setindex creates an index for the provided columns. This index is simply an
ordering vector of the dataset's rows according to the provided columns. This order vector
is stored as an attribute of the data.table and the dataset retains the original order
of rows in memory. See the vignette("datatable-secondary-indices-and-auto-indexing") for more details.
key returns the data.table's key if it exists; NULL if none exists.
haskey returns TRUE/FALSE if the data.table has a key.