Learn R Programming

BOLDconnectR (version 1.0.0)

bold.data.summarize: Generate specific summaries from the downloaded BCDM data

Description

The function is used to obtain a different types of data summaries for the downloaded BCDM data via bold.fetch function.

Usage

bold.data.summarize(
  bold_df,
  summary_type = c("concise_summary", "detailed_taxon_counts", "barcode_summary",
    "data_completeness"),
  primer_f = NULL,
  primer_r = NULL,
  rem_na_bin = FALSE
)

Value

An output list containing:

  • A data frame of detailed summary based on the summary_type

  • A bar chart in case summary_type = data_completeness in addition to the dataframe.

Arguments

bold_df

the data.frame retrieved from the bold.fetch() function.

summary_type

A character string specifying the type of summary required ('concise_summary', 'detailed_taxon_counts','barcode_summary','data_completeness','all')

primer_f

A character string specifying the forward primer. Default value is NULL.

primer_r

A character string specifying the reverse primer. Default value is NULL.

rem_na_bin

A logical value specifying whether NA BINs should be removed from the BCDM dataframe. Default value is FALSE.

Details

bold.data.summarize provides different types of data summaries for the downloaded BCDM dataset. Current options include:

  • concise_summary = A high level overview of the downloaded data that would include total records, counts of unique BINs, countries , institutes etc.

  • data_completeness = A data profile that includes information on missing data, proportion of complete cases for each field in the BCDM data along with data type specific insights like distribution, average and median values for numeric data. Also provides a bar chart visualizing the missing data and total records.

  • detailed_taxon_counts = Taxonomy focused counts of total records with and without BINs, unique countries and institutes.

  • barcode_summary = BIN focused summary of nucleotide basepair length, ambiguous basepair number (if present), presence of primer sequences (forward and/or reverse) in the sequence along with the processid, country and institute associated with the BIN. rem_na_bin= TRUE removes all records that don’t have a BIN (Please note that this might result into empty data frames sometimes due to lot of missing data). The forward or reverse primer also needs to be specified. Details on all/specific fields can be checked using the bold.field.info().

Note: . Users are required to install and load the Biostrings package in case they want to generate the barcode_summary before running this function. For the data in the nuc_basecount column in the barcode_summary, please refer to the bold.field.info() for details.

Examples

Run this code
if (FALSE) {
bold_data.ids <- bold.public.search(taxonomy = list("Oreochromis"))

# Fetch the data using the ids.
#1. api_key must be obtained from BOLD support before using `bold.fetch()` function.
#2. Use the `bold.apikey()` function  to set the apikey in the global env.

bold.apikey('apikey')

bold.data <- bold.fetch(get_by = "processid",
                        identifiers = bold_data.ids$processid)

#1. Generate a concise summary of the data

test.data.summary.concise <- bold.data.summarize(bold_df=bold.data,
                                                 summary_type = "concise_summary")
# Result
test.data.summary.concise$concise_summary


#2. Generate a detailed taxon counts summary

test.data.summary <- bold.data.summarize(bold_df=bold.data,
                                         summary_type = "detailed_taxon_counts")

# Result
test.data.summary$detailed_taxon_counts


#3. Generate data completeness profile

test.data.summary.completeness <- bold.data.summarize(bold_df=bold.data,
                                                      summary_type = "data_completeness")

# Results
# Summary
test.data.summary.completeness$completeness_summary

# Plot
test.data.summary.completeness$completeness_plot


#4. Barcode summary (forward primer LCO1490)

# Users need to first load the package `Biostrings`

test.data.summary.barcode <- bold.data.summarize(bold_df=bold.data,
                                                 summary_type = "barcode_summary",
                                                 primer_f='GGTCAACAAATCATAAAGATATTGG')

# Results
test.data.summary.barcode$barcode_summary
}

Run the code above in your browser using DataLab