bold.analyze.align: Transform and align the sequence data retrieved from BOLD

Description

Function designed to transform and align the sequence data retrieved from the function bold.fetch.

Usage

bold.analyze.align(
  bold_df,
  marker = NULL,
  align_method = c("ClustalOmega", "Muscle"),
  cols_for_seq_names = NULL,
  ...
)

Value

bold_df.mod = A modified BCDM data frame with two additional columns (’aligned_seq’ and ’msa.seq.name’).

Arguments

bold_df: A data frame obtained from bold.fetch().
marker: A single character value specifying the gene marker for which the output is generated. Default is NULL (all data is used).
align_method: Character vector specifying the type of multiple sequence alignment algorithm to be used (ClustalOmega and Muscle available).
cols_for_seq_names: A single or multiple character vector specifying the column headers to be used to name each sequence in the fasta file. Default is NULL in which case, only the processid is used as a name.
...: additional arguments that can be passed to msa::msa() function.

Details

bold.analyze.align takes the sequence information obtained using bold.fetch() function and performs a multiple sequence alignment. It uses the msa::msa() function with default settings but additional arguments from the msa function can be passed through the ... argument. The clustering method can be specified using the align_method argument, with options including Muscle and ClustalOmega (available via the msa package). The provided marker name must match the standard marker names (Ex. COI-5P) available on the BOLD webpage (Ratnasingham et al. 2024; pg.404). The name for individual sequences in the output can be customized by using the cols_for_seq_names argument. If multiple fields are specified, the sequence name will follow the order of fields given in the vector. Performing a multiple sequence alignment on large sequence data might slow (or crash) the system. Additionally, users are responsible for verifying the sequence quality and integrity, as the function does not automatically check for issues like STOP codons and indels within the data.

Note: . Users are required to install and load the Biostrings, msa and muscle packages using BiocManager before running this function.

References

Ratnasingham S, Wei C, Chan D, Agda J, Agda J, Ballesteros-Mejia L, Ait Boutou H, El Bastami Z M, Ma E, Manjunath R, Rea D, Ho C, Telfer A, McKeowan J, Rahulan M, Steinke C, Dorsheimer J, Milton M, Hebert PDN . "BOLD v4: A Centralized Bioinformatics Platform for DNA-Based Biodiversity Data." In DNA Barcoding: Methods and Protocols, pp. 403-441. Chapter 26. New York, NY: Springer US, 2024.

Examples

Run this code

if (FALSE) {
# Search for ids
seq.data.ids <- bold.public.search(taxonomy = list("Oreochromis tanganicae",
                                                "Oreochromis karongae"))
# Fetch the data using the ids.
#1. api_key must be obtained from BOLD support before using `bold.fetch()` function.
#2. Use the `bold.apikey()` function  to set the apikey in the global env.

bold.apikey('apikey')

seq.data<-bold.fetch(get_by = "processid",
                     identifiers = seq.data.ids$processid)

# R packages `msa` and `Biostrings` are required for this function to run.
# For `align_method` = "Muscle", package `muscle` is required as well.

# Both the packages are installed using `BiocManager`.

# Align the data (using  bin_uri as the name for each sequence)
seq.align <- bold.analyze.align(seq.data,
                                cols_for_seq_names = c("bin_uri"),
                                align_method="ClustalOmega")

# Dataframe of the sequences (aligned) with their corresponding names
head(seq.align[,c("aligned_seq","msa.seq.name")])
 }

Run the code above in your browser using DataLab