bold.analyze.tree: Analyze and visualize the multiple sequence alignment

Description

Calculates genetic distances and performs a Neighbor Joining (NJ) tree estimation of the multiple sequence alignment output obtained from bold.analyze.align().

Usage

bold.analyze.tree(
  bold_df,
  dist_model,
  clus_method = c("nj", "njs"),
  save_dist_mat = FALSE,
  newick_tree_export = NULL,
  tree_plot = FALSE,
  tree_plot_type,
  ...
)

Value

An 'output' list containing:

dist_mat = A distance matrix based on the model selected if save_dist_mat=TRUE.
base_freq = Overall base frequencies of the align.seq result.
plot = Neighbor Joining clustering visualization (if tree_plot=TRUE).
data_for_plot = A phylo object used for the plot.
NJ/NJS tree in a newick format (only if newick_tree_export=TRUE).

Arguments

bold_df: A modified BCDM data frame obtained from bold.analyze.align().
dist_model: A character string specifying the model to generate the distances.
clus_method: A character string specifying either nj (neighbour joining) or njs (neighbour joining with NAs) clustering algorithm.
save_dist_mat: A logical value specifying whether the distance matrix should be saved in the output. Default value is FALSE.
newick_tree_export: A character string specifying the folder path where the file should be saved along with the name for the file. Default value is NULL.
tree_plot: Logical value specifying if a neighbor joining plot should be generated. Default value is FALSE.
tree_plot_type: A character string specifying the layout of the tree. Needs to be provided by default.
...: additional arguments from ape::dist.dna.

Details

bold.analyze.tree analyzes the multiple sequence alignment output of the bold.analyze.align() function to generate a distance matrix using the models available in the ape::dist.dna(). The default dist_model is K80 (Kimura 1980 model). Two forms of Neighbor Joining clustering are currently available (ape::nj() & ape::njs()). save_dist_mat= TRUE will store the underlying distance matrix in the output; however, the default value for the argument is deliberately kept at FALSE to avoid potential memory issues with large data. newick_tree_export will save the tree in a newick format locally. Data path with the name of the file should be provided (Ex. 'C:/Users/xyz/Desktop/newickoutput' for Windows). Setting tree_plot= TRUE generates a basic visualization of the Neighbor Joining (NJ) tree using the distance matrix from ape::dist.dna() and the ape::plot.phylo() function. tree_plot_type specifies the type of tree and has the following options ("phylogram", "cladogram", "fan", "unrooted", "radial", "tidy" based on type argument of ape::plot.phylo(); The first alphabet can be used instead of the whole word). Both ape::nj() and ape::njs() are available for generating the tree. Additional arguments for calculating distances can be passed to ape::dist.dna() using the ... argument (arguments such as gamma, pairwise.deletion & base.freq). The function also provides base frequencies from the data.

Examples

Run this code

if (FALSE) {
#Download the data ids
seq.data.ids <- bold.public.search(taxonomy = list("Oreochromis tanganicae",
"Oreochromis karongae"))

# Fetch the data using the ids.
#1. api_key must be obtained from BOLD support before using `bold.fetch()` function.
#2. Use the `bold.apikey()` function  to set the apikey in the global env.

bold.apikey('apikey')

seq.data <- bold.fetch(get_by = "processid",
                       identifiers = seq.data.ids$processid,
                       filt_marker = "COI-5P")

# Remove rows without species name information
seq <- seq.data[seq.data$species!="", ]

# Align the data
# Users need to install and load packages `msa` and `Biostrings`.
# For `align_method` = "Muscle", package `muscle` is required as well.

seq.align<-bold.analyze.align(bold_df=seq.data,
                              marker="COI-5P",
                              align_method="ClustalOmega",
                              cols_for_seq_names = c("species","bin_uri"))

#Analyze the data to get a tree

seq.analysis<-bold.analyze.tree(bold_df=seq.align,
                                dist_model = "K80",
                                clus_method="nj",
                                tree_plot=TRUE,
                                tree_plot_type='p',
                                save_dist_mat = T,
                                pairwise.deletion=T)

# Output
# A ‘phylo’ object of the plot
seq.analysis$data_for_plot
# A distance matrix based on the distance model selected
seq.analysis$save_dist_mat
# Base frequencies of the sequences
seq.analysis$base_freq
}

Run the code above in your browser using DataLab