
Last chance! 50% off unlimited learning
Sale ends in
fNdistinct
is a generic function that (column-wise) computes the number of distinct values in x
, (optionally) grouped by g
. It is significantly faster than length(unique(x))
. The TRA
argument can further be used to transform x
using its (grouped) distinct value count.
fNdistinct(x, ...)# S3 method for default
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, ...)
# S3 method for matrix
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, ...)
# S3 method for data.frame
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, ...)
# S3 method for grouped_df
fNdistinct(x, TRA = NULL, na.rm = TRUE,
use.g.names = FALSE, keep.group_vars = TRUE, ...)
a vector, matrix, data.frame or grouped tibble (dplyr::grouped_df
).
an integer or quoted operator indicating the transformation to perform:
1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA
.
logical. TRUE
: Skip missing values in x
(faster computation). FALSE
: Also consider 'NA' as one distinct value.
make group-names and add to the result as names (vector method) or row-names (matrix and data.frame method). No row-names are generated for data.tables and grouped tibbles.
matrix and data.frame method: drop dimensions and return an atomic vector if g = NULL
and TRA = NULL
.
grouped_df method: Logical. FALSE
removes grouping variables after computation.
arguments to be passed to or from other methods.
Integer. The number of distinct values in x
, grouped by g
, or (if TRA
is used) x
transformed by its distinct value count, grouped by g
.
fNdistinct
implements a fast algorithm to find the number of distinct values utilizing index- hashing implemented in the Rcpp::sugar::IndexHash
class.
If na.rm = TRUE
(the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = TRUE
, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA)
will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
Grouped computations are currently performed by mapping the data to a sparse-array directed by g
and then hash-mapping each group. This is often not much slower than using a larger hash-map for the entire data when g = NULL
.
fNdistinct
preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
# NOT RUN {
## default vector method
fNdistinct(airquality$Solar.R) # Simple distinct value count
fNdistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
## data.frame method
fNdistinct(airquality)
fNdistinct(airquality, airquality$Month)
fNdistinct(wlddev) # Works with data of all types!
head(fNdistinct(wlddev, wlddev$iso3c))
## matrix method
aqm <- qM(airquality)
fNdistinct(aqm) # Also works for character or logical matrices
fNdistinct(aqm, airquality$Month)
## method for grouped tibbles - for use with dplyr:
library(dplyr)
airquality %>% group_by(Month) %>% fNdistinct
wlddev %>% group_by(country) %>%
select(PCGDP,LIFEEX,GINI,ODA) %>% fNdistinct
# }
Run the code above in your browser using DataLab