Kmultparallel: Parallel implementation of Adams' Kmult with additional support for multiple datasets and tree sets

Description

Parallel implementation of Kmult, a measure of phylogenetic signal which is a multivariate equivalent of Blomberg's K. This version supports multiple datasets and tree sets, computing Kmult for all combinations.

Usage

Kmultparallel(data, trees, burninpercent = 0, iter = 0, verbose = TRUE)

Value

The function outputs a data.frame with classes "parallel_Kmult" and "data.frame" containing columns:

Kmult: Value of Kmult for each tree-dataset combination
p value: p value for the significance of the test (only if iter > 0)
treeset: Identifier for the tree set (name from list or number)
dataset: Identifier for the dataset (name from list or number)
tree_index: Index of the tree within its tree set

Arguments

data: Either a data.frame/matrix with continuous (multivariate) phenotypes, or a list where each element is a data.frame/matrix representing a separate dataset. Row names should match species names in the phylogenetic trees.
trees: Either a multiPhylo object containing a collection of trees (single tree set), or a list where each element is a multiPhylo object representing a separate tree set.
burninpercent: percentage of trees in each tree set to discard as burn-in (by default no tree is discarded)
iter: number of permutations to be used in the permutation test (this should normally be left at the default value of 0 as permutations slow down computation and are of doubtful utility when analyzing tree distributions)
verbose: logical, whether to print progress information (default TRUE)

Parallelization

This function automatically uses parallel processing via the future framework when beneficial. The parallelization strategy is determined by the user's choice of future plan, providing flexibility across different computing environments (local multicore, cluster, etc.). The function performs parallelization at the level of individual trees within each treeset, which is optimal for analyzing distributions of many trees. The future plan should be set up by the user before calling this function using future::plan() (see also examples).

Citation

If you use this function please kindly cite both Fruciano et al. 2017 (because you're using this parallelized function) and Adams 2014 (because the function computes Adams' Kmult)

S3 Methods

The returned object has specialized S3 methods:

print.parallel_Kmult: Provides a summary of Kmult ranges for each dataset-treeset combination
plot.parallel_Kmult: Creates density plots of Kmult values grouped by dataset-treeset combinations
summary.parallel_Kmult: Provides detailed summary statistics for the analysis results

Details

This is an updated and improved version of the function included in Fruciano et al. 2017. It performs the computation of Adams' Kmult (Adams 2014) in parallel with the aim of facilitating computation on a distribution of trees rather than a single tree. This version uses cross-platform parallel processing that works on Windows, Mac, and Linux systems. If one wanted to perform a computation of Kmult on a single tree, he/she would be advised to use the version implemented in the package geomorph, which receives regular updates.

This function uses the future framework for parallel processing. Users should set up their preferred parallelization strategy using future::plan() before calling this function. For example:

future::plan(future::sequential) for sequential processing
future::plan(future::multisession, workers = 4) for parallel processing with 4 workers (works in most platforms including Windows)
future::plan(future::multicore, workers = 4) for forked processes (Unix-like systems)
future::plan(future::cluster, workers = c("host1", "host2")) for cluster computing

If no plan is set, the function will use the default sequential processing.

References

Adams DC. 2014. A Generalized K Statistic for Estimating Phylogenetic Signal from Shape and Other High-Dimensional Multivariate Data. Systematic Biology 63:685-697.

Fruciano C, Celik MA, Butler K, Dooley T, Weisbecker V, Phillips MJ. 2017. Sharing is caring? Measurement error and the issues arising from combining 3D morphometric datasets. Ecology and Evolution 7:7034-7046.

Examples

Run this code

# \donttest{
# Load required packages for data simulation
library(phytools)
library(MASS)
library(mvMORPH)
library(ape)  # for drop.tip function
library(future)
library(future.apply)

# Generate 20 random phylogenetic trees with 100 tips each
all_trees = replicate(20, pbtree(n = 100), simplify = FALSE)
class(all_trees) = "multiPhylo"
# Create a collection of 20 random trees

# Split trees into 2 tree sets
treeset1 = all_trees[1:5]
treeset2 = all_trees[6:20]
class(treeset1) = class(treeset2) = "multiPhylo"
# Split the 20 trees into 2 separate tree sets

# Get tip names from the first tree for consistent naming
tip_names = all_trees[[1]]$tip.label[1:40]
# Use first 40 tip names for consistent data generation

# Generate 1 random dataset using multivariate normal distribution
dataset_random = mvrnorm(n = 40, mu = rep(0, 5), Sigma = diag(5))
rownames(dataset_random) = tip_names
# Create one random dataset which should not display phylogenetic signal

# Generate 1 dataset using Brownian motion evolution on the first tree
tree_temp = treeset1[[1]]
# Get only the first 40 tips to match our data size
tips_to_keep = tree_temp$tip.label[1:40]
tree_pruned = ape::drop.tip(tree_temp,
                            setdiff(tree_temp$tip.label, tips_to_keep))

# Simulate data under Brownian motion
sim_data = mvSIM(tree = tree_pruned, nsim = 1, model = "BM1", 
                 param = list(sigma = diag(5), theta = rep(0, 5)))
# Convert to matrix and ensure proper row names
if (is.list(sim_data)) sim_data = sim_data[[1]]
dataset_bm = as.matrix(sim_data)
rownames(dataset_bm) = tree_pruned$tip.label
# Generate 1 dataset evolving under Brownian motion
# This dataset should display strong phylogenetic signal when combined
# with treeset1

# Example 1: Single dataset and single treeset analysis (sequential
# processing)
future::plan(future::sequential)  # Use sequential processing
result_single = Kmultparallel(dataset_bm, treeset1)
# Analyze BM dataset with first treeset (sequential processing)

# Use S3 methods to examine results
print(result_single)
# Display summary of Kmult values
# Notice how the range is very broad because we have high
# phylogenetic signal for the case in which the dataset has been
# simulated under Brownian motion with the first tree, but low
# phylogenetic signal when we use the other trees in the treeset.

plot(result_single)
# Create density plot of Kmult distribution
# Notice the bimodal distribution with low phylogenetic signal
# corresponding to a mismatch between the tree used and the true
# evolutionary history of the traits, and the high phylogenetic
# signal when the correct tree is used.

# Example 2: Multiple datasets and multiple treesets analysis with
# parallel processing
# Set up parallel processing with future
future::plan(future::multisession, workers = 4)

# Combine datasets into a list
all_datasets = list(random = dataset_random, brownian = dataset_bm)
# Combine random and BM datasets

# Combine treesets into a list
all_treesets = list(treeset1 = treeset1, treeset2 = treeset2)
# Create list of both tree sets

# Run comprehensive analysis on all combinations
result_multiple = Kmultparallel(all_datasets, all_treesets)
# Analyze all dataset-treeset combinations with parallel processing

# Examine results using S3 methods
print(result_multiple)
# Display summary showing ranges for each combination

plot(result_multiple)
# Create grouped density plots by combination
# Notice how the distribution of Kmult when we use the random dataset
# has a strong peak at small values (no phylogenetic signal, as
# expected)

# Custom plotting with different transparency
plot(result_multiple, alpha = 0.5,
     title = "Kmult Distribution Across All Combinations")
# Customize the plot appearance

# Example 3: Setting up parallel processing with future
future::plan(future::multisession, workers = 4)
result_parallel = Kmultparallel(dataset_bm, treeset1)
# Use 4 worker processes for parallel processing

# Clean up: Reset to sequential processing to close parallel workers
future::plan(future::sequential)
# }

Run the code above in your browser using DataLab