bold.fetch: Retrieve data from the BOLD database

Description

Retrieves public and private user data based on different parameter (processid, sampleid, dataset or project codes & bin_uris) input.

Usage

bold.fetch(
  get_by,
  identifiers,
  cols = NULL,
  export = NULL,
  na.rm = FALSE,
  filt_taxonomy = NULL,
  filt_geography = NULL,
  filt_latitude = NULL,
  filt_longitude = NULL,
  filt_shapefile = NULL,
  filt_institutes = NULL,
  filt_identified.by = NULL,
  filt_seq_source = NULL,
  filt_marker = NULL,
  filt_collection_period = NULL,
  filt_basecount = NULL,
  filt_altitude = NULL,
  filt_depth = NULL
)

Value

A data frame containing all the information related to the processids/sampleids and the filters applied (if/any).

Arguments

get_by: A character string specifying the parameter used to fetch data (“processid”, “sampleid”, "bin_uris", "dataset_codes" or "project_codes")
identifiers: A vector (or a data frame column) pointing to the get_by parameter specified.
cols: A single or multiple character vector specifying columns needed in the final dataframe. Default value is NULL.
export: A character string specifying the data path where the file should be exported locally along with the name of the file with extension (csv or tsv). Default value is NULL.
na.rm: A logical value specifying whether NA values should be removed from the BCDM dataframe. Default value is FALSE.
filt_taxonomy: A single or multiple character vector of taxonomic names at any hierarchical level. Default value is NULL.
filt_geography: A single or multiple character vector specifying any of the country/province/state/region/sector/site names/codes. Default value is NULL.
filt_latitude: A single or a vector of two numbers specifying the latitudinal range in decimal degrees. Values should be separated by a comma. Default value is NULL.
filt_longitude: A single or a vector of two numbers specifying the longitudinal range in decimal degrees. Values should be separated by a comma. Default value is NULL.
filt_shapefile: A file path pointing to a shapefile. Default value is NULL.
filt_institutes: A single or multiple character vector specifying names of institutes. Default value is NULL.
filt_identified.by: A single or multiple character vector specifying names of people responsible for identifying the organism. Default value is NULL.
filt_seq_source: A single or multiple character vector specifying the data portals from where the (sequence) data was mined. Default value is NULL.
filt_marker: A single or multiple character vector specifying gene names. Default value is NULL.
filt_collection_period: A single or a vector of two date values specifying the collection period range (start, end). Values should be separated by a comma. Default value is NULL.
filt_basecount: A single or a vector of two numbers specifying range of number of basepairs. Val- ues should be separated by a comma. Default value is NULL.
filt_altitude: A single or a vector of two numbers specifying the altitude range in meters. Values should be separated by a comma. Default value is NULL.
filt_depth: A single or a vector of two numbers specifying the depth range. Values should be separated by a comma. Default value is NULL.

Details

bold.fetch retrieves both public as well as private user data, where private data refers to data that the user has permission to access. The data is downloaded in the Barcode Core Data Model (BCDM) format. It supports effective download data in bulk using search parameters like ‘processid’, ‘sampleid’, ‘bin_uris’, ‘dataset_codes’ and 'project_codes' through the get_by argument. Users must specify only one of the parameters at a time for retrieval. Multi-parameter searches combining fields like ‘processid’+ ‘sampleid’ + ‘bin_uris’ are not supported, regardless of the parameters available. Data input is via the identifier argument and it can either be a single or multiple character vector containing data for one of the parameters. A dataframe column can be used as an input using the '$' operator (e.g., df$column_name). It is important to correctly match the get_by and identifiers arguments to avoid getting any errors. The filt_ or filter parameter arguments provide further data sorting by which a specific user defined data can be obtained. Note that any/all filt_argument names must be written explicitly to avoid any errors (Ex. filt_institutes = ’CBG’ instead of just ’CBG’). Using the cols argument allows users to select specific columns for inclusion in the final data frame. If this argument is left as NULL all columns will be downloaded. Providing a data path for the export argument will save the data locally. Data path with the name of the output file with the corresponding file extension (csv or tsv) should be provided (Ex. 'C:/Users/xyz/Desktop/fetch_data_output.csv' for Windows). There is a hard limit of 1 million records that can be downloaded in a single instance. Download speeds for very large requests for bin_uris, dataset_codes and project_codes will be throttled, resulting in more time for fetching the data. Download speed would also depend on the user’s internet connection and computer specifications. Downloaded data includes information (wherever available) for the columns given in the field column of the bold.fields.info() in the BCDM format. Metadata on the columns fetched in the downloaded data can also be obtained using bold.fields.info().

Important Note: bold.apikey() should be run prior to running bold.fetch to setup the apikey which is needed for the latter.

Examples

Run this code

if (FALSE) {
#Test data with processids
data(test.data)

# Fetch the data using the ids.
#1. api_key must be obtained from BOLD support before using `bold.fetch()` function.
#2. Use the `bold.apikey()` function  to set the apikey in the global env.

bold.apikey('apikey')

# With processids
res <- bold.fetch(get_by = "processid",
                  identifiers = test.data$processid)


# With sampleids
res<-bold.fetch(get_by = "sampleid",
                identifiers = test.data$sampleid)

# With datasets (publicly available dataset provided)
res<-bold.fetch(get_by = "dataset_codes",
                identifiers = "DS-IBOLR24")

## Using filters

# Geography
res <- bold.fetch(get_by = "processid",
                  identifiers = test.data$processid,
                  filt_geography = "Churchill")

# Sequence length
res <- bold.fetch(get_by = "processid",
                  identifiers = test.data$processid,
                  filt_basecount = c(500,600))

# Gene marker & sequence length
res<-bold.fetch(get_by = "processid",
                identifiers = test.data$processid,
                filt_marker = "COI-5P",
                filt_basecount = c(500, 600))
}

Run the code above in your browser using DataLab