distinct.data_request: Keep distinct/unique rows

Description

Keep only unique/distinct rows from a data frame. This is similar to unique.data.frame() but considerably faster. It is evaluated lazily.

Usage

# S3 method for data_request
distinct(.data, ..., .keep_all = FALSE)

Arguments

.data: A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
...: Variables to use when determining uniqueness. Unlike the dplyr implementation this must be set for the function to do anything, and only a single variable is used.
.keep_all: If TRUE, keep all variables in .data. Defaults to FALSE

Details

This function has several potential uses. In it's default mode, it simply shows the unique values for a supplied field:

galah_call() |>
  distinct(basisOfRecord) |> 
  collect()
# A tibble: 9 × 1
  basisOfRecord      
  <chr>              
1 HUMAN_OBSERVATION  
2 PRESERVED_SPECIMEN 
3 OCCURRENCE         
4 MACHINE_OBSERVATION
5 OBSERVATION        
6 MATERIAL_SAMPLE    
7 LIVING_SPECIMEN    
8 FOSSIL_SPECIMEN    
9 MATERIAL_CITATION

This is the same result as you would get using show_values():

search_all(fields, "basisOfRecord") |> 
  show_values()

Using distinct() is somewhat more reliable, however, as it doesn't rely on searching the tibble returned by show_all(fields). It is also more efficient, particularly when caching is turned off. If the goal is to retrieve the number of levels of a factor, use:

galah_call() |>
  distinct(basisOfRecord) |> 
  count() |>
  collect()
# A tibble: 1 × 1
  count
  <int>
1     9

When the variable passed to distinct() in the above example is speciesID, this is identical to calling:

atlas_counts(type = "species")

You can also pass group_by() to find the number of facets per level of a second variable:

galah_call() |>
  identify("Perameles") |>
  distinct(speciesID) |> 
  group_by(basisOfRecord) |>
  count() |>
  collect()
# A tibble: 8 × 2
  basisOfRecord       count
  <chr>               <int>
1 Human observation       7
2 Preserved specimen      9
3 Machine observation     2
4 Observation             3
5 Occurrence              3
6 Material Sample         4
7 Fossil specimen         1
8 Living specimen         1

By setting .keep_all = TRUE, we get more information on each record. Due to limits on the APIs this is not a perfect analogy for running dplyr::distinct() on raw occurrences; but it does allow us to generalise atlas_species() to use any taxonomic identifier. For example, we might choose to show data by family instead of species:

galah_call() |>
  identify("Coleoptera") |>
  distinct(familyID, .keep_all = TRUE) |> 
  collect()

Using group_by() is also valid:

galah_call() |>
    filter(year == 2024,
           genus == "Crinia") |>
    group_by(speciesID) |>
    distinct(.keep_all = TRUE) |>
    collapse()

In this case, collect() and atlas_species() are synonymous, with the exception that the latter does not require you to set the .keep_all argument to TRUE. So you could instead use:

galah_call() |>
  identify("Coleoptera") |>
  distinct(familyID) |> 
  atlas_species()

Examples

Run this code

if (FALSE) {
galah_call() |>
  distinct(basisOfRecord) |>
  count() |>
  collect()
}

Run the code above in your browser using DataLab