NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true") knitr::opts_chunk$set( comment = "#>", collapse = TRUE, warning = FALSE, message = FALSE, purl = NOT_CRAN, eval = NOT_CRAN )
rgbif now has the ability to clean data retrieved from GBIF based on GBIF issues. These issues are returned in data retrieved from GBIF, e.g., through the
occ_search() function. Inspired by
magrittr, we've setup a workflow for cleaning data based on using the operator
%>%. You don't have to use it, but as we show below, it can make the process quite easy.
Note that you can also query based on issues, e.g.,
occ_search(taxonKey=1, issue='DEPTH_UNLIKELY'). However, we imagine it's more likely that you want to search for occurrences based on a taxonomic name, or geographic area, not based on issues, so it makes sense to pull data down, then clean as needed using the below workflow with
occ_issues() only affects the data element in the gbif class that is returned from a call to
occ_search(). Maybe in a future version we will remove the associated records from the hierarchy and media elements as they are remove from the data element.
occ_issues() also works with data from
Install from CRAN
Or install the development version from GitHub
Get some data
Get taxon key for Helianthus annuus
(key <- name_suggest(q='Helianthus annuus', rank='species')$key)
Then pass to
(res <- occ_search(taxonKey=key, limit=100))
gbifissues can be retrieved using the function
gbif_issues(). The dataset's first column
code is a code that is used by default in the results from
occ_search(), while the second column
issue is the full issue name given by GBIF. The third column is a full description of the issue.
You can query to get certain issues
gbif_issues()[ gbif_issues()$code %in% c('cdround','cudc','gass84','txmathi'), ]
cdround represents the GBIF issue
COORDINATE_ROUNDED, which means that
Original coordinate modified by rounding to 5 decimals.
The content for this information comes from https://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/OccurrenceIssue.html
Parse data based on issues
Now that we know a bit about GBIF issues, you can parse your data based on issues. Using the data generated above, and using the function
%>% imported from
magrittr, we can get only data with the issue
GEODETIC_DATUM_ASSUMED_WGS84 (Note how the records returned goes down to 98 instead of the initial 100).
res %>% occ_issues(gass84)
Note also that we've set up
occ_issues() so that you can pass in issue names without having to quote them, thereby speeding up data cleaning.
Next, we can remove data with certain issues just as easily by using a
- sign in front of the variable, like this, removing data with issues
res %>% occ_issues(-depunl, -mdatunl)
Expand issue codes to full names
Another thing we can do with
occ_issues() is go from issue codes to full issue names in case you want those in your dataset (here, showing only a few columns to see the data better for this demo):
out <- res %>% occ_issues(mutate = "expand") head(out$data[,c(1,5)])
Sometimes you may want to have each type of issue as a separate column.
Split out each issue type into a separate column, with number of columns equal to number of issue types
out <- res %>% occ_issues(mutate = "split") head(out$data[,c(1,5:10)])
Expand and add columns
Or you can expand each issue type into its full name, and split each issue into a separate column.
out <- res %>% occ_issues(mutate = "split_expand") head(out$data[,c(1,5:10)])
We hope this helps users get just the data they want, and nothing more. Let us know if you have feedback on data cleaning functionality in
rgbif at firstname.lastname@example.org or at https://github.com/ropensci/rgbif/issues.