duplicated: Determine Duplicate Rows

Description

duplicated returns a logical vector indicating which rows of a data.table have duplicate rows (by key).

unique returns a data table with duplicated rows (by key) removed, or (when no key) duplicated rows by all columns removed.

anyDuplicated returns the index i of the first duplicated entry if there is one, and 0 otherwise.

Usage

## S3 method for class 'data.table':
duplicated(x, incomparables=FALSE, fromLast=FALSE, by=key(x), ...)
## S3 method for class 'data.table':
unique(x, incomparables=FALSE, fromLast=FALSE, by=key(x), ...)
## S3 method for class 'data.table':
anyDuplicated(x, incomparables=FALSE, fromLast=FALSE, by=key(x), ...)

Arguments

A data.table.

...

Not used at this time.

incomparables

Not used. Here for S3 method consistency.

fromLast

logical indicating if duplication should be considered from the reverse side, i.e., the last (or rightmost) of identical elements would correspond to duplicated = FALSE.

character or integer vector indicating which combinations of columns form x to use for uniqueness checks. Defaults to key(x)) which, by default, only uses the keyed columns. by=NULL

Value

duplicated returns a logical vector of length nrow(x) indicating which rows are duplicates.
unique returns a data table with duplicated rows removed.
anyDuplicated returns a integer value with the index of first duplicate. If none exists, 0L is returned.

Details

Because data.tables are usually sorted by key, tests for duplication are especially quick when only the keyed columns are considred. Unlike unique.data.frame, paste is not used to ensure equality of floating point data. This is done directly (for speed) whilst still respecting tolerance in the same spirit as all.equal.

Any combination of columns can be used to test for uniqueness (not just the key columns) and are specified via the by parameter. To get the analagous data.frame functionality for unique and duplicated, set by to NULL. From v1.9.4, both duplicated and unique methods also gain the logical argument fromLast, as in base, and by default is FALSE. Conceptually duplicated(x, by=cols, fromLast=TRUE) is equivalent to rev(duplicated(rev(x), by=cols)), but is much faster. rev(x) is used just to illustrate the concept, as it clearly applies only to vectors. In the context of data.table, rev(x) would mean rearranging the rows of all columns in reverse order. v1.9.4 also implements anyDuplicated method for data.table. It calculates the duplicate entries and returns the first duplicated index, if one exists, and 0 otherwise. It's very similar to any(duplicated(DT)) except that this returns TRUE or FALSE.

Examples

Run this code

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B")
    duplicated(DT)
    unique(DT)

    duplicated(DT, by="B")
    unique(DT, by="B")

    duplicated(DT, by=c("A", "C"))
    unique(DT, by=c("A", "C"))

    DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L))   # no key
    unique(DT)                   # rows 1 and 2 (row 3 is a duplicate of row 1)

    DT = data.table(a=c(3.142, 4.2, 4.2, 3.142, 1.223, 1.223), b=rep(1,6))
    unique(DT)                   # rows 1,2 and 5

    DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10))   # example from ?all.equal
    length(unique(DT$a))         # 10 strictly unique floating point values
    all.equal(DT$a,rep(1,10))    # TRUE, all within tolerance of 1.0
    DT[,which.min(a)]            # row 10, the strictly smallest floating point value
    identical(unique(DT),DT[1])  # TRUE, stable within tolerance
    identical(unique(DT),DT[10]) # FALSE

    # fromLast=TRUE
    DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B")
    duplicated(DT, by="B", fromLast=TRUE)
    unique(DT, by="B", fromLast=TRUE)

    # anyDuplicated
    anyDuplicated(DT, by=c("A", "B"))    # 3L
    any(duplicated(DT, by=c("A", "B")))  # TRUE

Run the code above in your browser using DataLab