duplicated: Determine Duplicate Rows

Description

duplicated returns a logical vector indicating which rows of a data.table (by key columns or when no key all columns) are duplicates of a row with smaller subscripts.

unique returns a data.table with duplicated rows (by key) removed, or (when no key) duplicated rows by all columns removed.

anyDuplicated returns the index i of the first duplicated entry if there is one, and 0 otherwise.

uniqueN is equivalent to length(unique(x)) when x is an atomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.

Usage

"duplicated"(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)
"unique"(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)
"anyDuplicated"(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)
uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)

Arguments

A data.table. uniqueN accepts atomic vectors and data.frames as well.

...

Not used at this time.

incomparables

Not used. Here for S3 method consistency.

fromLast

logical indicating if duplication should be considered from the reverse side, i.e., the last (or rightmost) of identical elements would correspond to duplicated = FALSE.

character or integer vector indicating which combinations of columns form x to use for uniqueness checks. Defaults to key(x)) which, by default, only uses the keyed columns. by=NULL uses all columns and acts like the analogous data.frame methods.

na.rm

Logical (default is FALSE). Should missing values (including NaN) be removed?

Value

duplicated returns a logical vector of length nrow(x) indicating which rows are duplicates.unique returns a data table with duplicated rows removed.anyDuplicated returns a integer value with the index of first duplicate. If none exists, 0L is returned.uniqueN returns the number of unique elements in the vector, data.frame or data.table.

Details

Because data.tables are usually sorted by key, tests for duplication are especially quick when only the keyed columns are considered. Unlike unique.data.frame, paste is not used to ensure equality of floating point data. It is instead accomplished directly and is therefore quite fast. data.table provides setNumericRounding to handle cases where limitations in floating point representation is undesirable.

v1.9.4 introduces anyDuplicated method for data.tables and is similar to base in functionality. It also implements the logical argument fromLast for all three functions, with default value FALSE.

Any combination of columns can be used to test for uniqueness (not just the key columns) and are specified via the by parameter. To get the analagous data.frame functionality, set by to NULL.

Examples

Run this code

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), 
                  C = rep(1:2, 6), key = "A,B")
duplicated(DT)
unique(DT)

duplicated(DT, by="B")
unique(DT, by="B")

duplicated(DT, by=c("A", "C"))
unique(DT, by=c("A", "C"))

DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L))   # no key
unique(DT)                   # rows 1 and 2 (row 3 is a duplicate of row 1)

DT = data.table(a=c(3.142, 4.2, 4.2, 3.142, 1.223, 1.223), b=rep(1,6))
unique(DT)                   # rows 1,2 and 5

DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10))   # example from ?all.equal
length(unique(DT$a))         # 10 strictly unique floating point values
all.equal(DT$a,rep(1,10))    # TRUE, all within tolerance of 1.0
DT[,which.min(a)]            # row 10, the strictly smallest floating point value
identical(unique(DT),DT[1])  # TRUE, stable within tolerance
identical(unique(DT),DT[10]) # FALSE

# fromLast=TRUE
DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), 
                    C = rep(1:2, 6), key = "A,B")
duplicated(DT, by="B", fromLast=TRUE)
unique(DT, by="B", fromLast=TRUE)

# anyDuplicated
anyDuplicated(DT, by=c("A", "B"))    # 3L
any(duplicated(DT, by=c("A", "B")))  # TRUE

# uniqueN, unique rows on key columns
uniqueN(DT)
# uniqueN, unique rows on all all columns
uniqueN(DT, by=NULL)
# uniqueN while grouped by "A"
DT[, .(uN=uniqueN(.SD)), by=A]

# uniqueN's na.rm=TRUE
x = sample(c(NA, NaN, runif(3)), 10, TRUE)
uniqueN(x, na.rm = FALSE) # 5, default
uniqueN(x, na.rm=TRUE) # 3

Run the code above in your browser using DataLab