idig_search_records: Searching of iDigBio records

Description

Function to query the iDigBio API for specimen records

Usage

idig_search_records(rq, fields = FALSE, max_items = 1e+05, limit = 0, offset = 0, sort = FALSE, ...)

Arguments

iDigBio record query in nested list format

fields

vector of fields that will be contained in the data.frame, limited set returned by default, use "all" to get all indexed fields

max_items

maximum number of results allowed to be retrieved (fail -safe)

limit

maximum number of results returned

offset

number of results to skip before returning results

sort

vector of fields to use for sorting, UUID is always appended to make paging safe

...

additional parameters

Value

a data frame

Details

Wraps idig_search to provide defaults specific to searching specimen records. Using this function instead of idig_search directly is recommened.

Queries need to be specified as a nested list structure that will serialize to an iDigBio query object's JSON as expected by the iDigBio API: https://github.com/iDigBio/idigbio-search-api/wiki/Query-Format

As an example, the first sample query looks like this in JSON in the API documentation:

{
  "scientificname": {
    "type": "exists"
  },
  "family": "asteraceae"
}

To rewrite this in R for use as the rq parameter to idig_search_records or idig_search_media, it would look like this:

rq <- list("scientificname"=list("type"="exists"), 
           "family"="asteraceae"
           )

An example of a more complex JSON query with nested structures:

{
  "geopoint": {
   "type": "geo_bounding_box",
   "top_left": {
     "lat": 19.23,
     "lon": -130
    },
    "bottom_right": {
      "lat": -45.1119,
      "lon": 179.99999
    }
   }
 }

To rewrite this in R for use as the rq parameter, use nested calls to the list() function:

rq <- list(geopoint=list(
                         type="geo_bounding_box", 
                         top_left=list(lat=19.23, lon=-130), 
                         bottom_right=list(lat=-45.1119, lon= 179.99999)
                        )
           )

See the Examples section below for more samples of simpler and more complex queries. Please refer to the API documentation for the full functionality availible in queries.

All matching results are returned up to the max_items cap (default 100,000). If more results are wanted, a higher max_items can be passed as an option. This API loads records 5,000 at a time using HTTP so performance with large sets of data is not very good. Expect result sets over 50,000 records to take tens of minutes. You can use the idig_count_records or idig_count_media functions to find out how many records a query will return; these are fast.

The iDigBio API will only return 5,000 records at a time but this function will automatically page through the results and return them all. Limit and offset are availible if manual paging of results is needed though the max_items cap still applies. The item count comes from the results header not the count of actual records in the limit/offset window.

Return is a data.frame containing the requested fields (or the default fields). The columns in the data frame are untyped and no factors are pre- built. Attribution and other metadata is attached to the dataframe in the data.frame's attributes. (I.e. attributes(df))

Examples

Run this code

## Not run: 
# # Simple example of retriving records in a genus:
# idig_search_records(rq=list(genus="acer"), limit=10)
# 
# # This complex query shows that booleans passed to the API are represented
# # as strings in R, fields used in the query don't have to be returned, and
# # the syntax for accessing raw data fields:
# idig_search_records(rq=list("hasImage"="true", genus="acer"), 
#             fields=c("uuid", "data.dwc:verbatimLatitude"), limit=100)
# 
# # Searching inside a raw data field for a string, note that raw data fields
# # are searched as full text, indexed fields are search with exact matches:
# 
# idig_search_records(rq=list("data.dwc:dynamicProperties"="parasite"), 
#             fields=c("uuid", "data.dwc:dynamicProperties"), limit=100)
# 
# # Retriving a data.frame for use with MaxEnt. Notice geopoint is expanded
# # to two columns in the data.frame: gepoint.lat and geopoint.lon:
# df <- idig_search_records(rq=list(genus="acer", geopoint=list(type="exists")), 
#           fields=c("uuid", "geopoint"), limit=10)
# write.csv(df[c("uuid", "geopoint.lon", "geopoint.lat")], 
#           file="acer_occurrences.csv", row.names=FALSE)
#           
# ## End(Not run)

Run the code above in your browser using DataLab