disparity_resample: Resampling-based estimates (bootstrap or rarefaction) of disparity / morphospace occupation

Description

Provides a unified interface to obtain resampled (bootstrap or rarefied) estimates of several disparity / morphospace occupation statistics.

Usage

disparity_resample(
  Data,
  group = NULL,
  n_resamples = 1000,
  statistic = "multivariate_variance",
  CI = 0.95,
  bootstrap_rarefaction = "bootstrap",
  sample_size = NULL
)

Value

A list containing:

chosen_statistic: Character vector of length 1 with the human-readable name of the statistic used.
results: A data frame with columns `group`, `average`, `CI_min`, `CI_max`. One row per group. When CI=1, `CI_min` and `CI_max` represent the minimum and maximum of resampled values rather than confidence intervals.
resampled_values: If a single group: numeric vector of length `n_resamples` with the resampled values. If multiple groups: a named list with one numeric vector (length `n_resamples`) per group.
CI_level: The CI level used (between 0 and 1). When CI=1, ranges are computed instead of confidence intervals.

The returned object has class "disparity_resample" and comes with associated S3 methods for convenient display and visualization:

print.disparity_resample: Prints a formatted summary of results including confidence interval overlap assessment for multiple groups
plot.disparity_resample: Creates a confidence interval plot using ggplot2

Arguments

Data: A data frame, matrix, vector, or 3D array. Observations (specimens) must be in rows (if a 3D array is supplied, the third dimension is assumed to index specimens).
group: A factor or a vector indicating group membership (will be coerced to factor). If `NULL` (default) a single analysis is performed on the full dataset.
n_resamples: Number of resampling replicates (default 1000).
statistic: Character string identifying the statistic to compute. One of `"multivariate_variance"`, `"mean_pairwise_euclidean_distance"`, `"convex_hull_volume"`, `"claramunt_proper_variance"`. Default is `"multivariate_variance"`. Ignored for univariate data (vector input).
CI: Desired two-sided confidence interval level (default 0.95) used for percentile confidence intervals. Use CI=1 to display the full range (minimum to maximum) of resampled values.
bootstrap_rarefaction: Either `"bootstrap"` (default) for resampling with replacement or `"rarefaction"` for resampling without replacement.
sample_size: Either `NULL` (default), a positive integer indicating the number of rows to sample, or the character `"smallest"` to use the size of the smallest group (all groups resampled to that size). For `bootstrap_rarefaction=="rarefaction"`, sampling is without replacement and this parameter is required (not `NULL`). For `bootstrap_rarefaction=="bootstrap"`, sampling is with replacement; if `NULL`, uses original group sizes, otherwise uses the specified sample size. If `"smallest"` is supplied but no groups are defined, an error is returned.

Parallelization

This function automatically uses parallel processing via the future framework, when the packages future and future.apply are installed. This is particularly useful for large datasets, large number of resamples or computationally intensive statistics (e.g., convex hull volume). The parallelization strategy is determined by the user's choice of future plan, providing flexibility across different computing environments (local multicore, cluster, etc.). The function performs parallelization at the level of individual bootstrap/rarefaction replicates within each group. The future plan should be set up by the user before calling this function using future::plan() (see examples). If no plan is set or the future packages are not available, the function will use sequential processing.

Average estimate

For bootstrap resampling, the average estimate reported is the mean of the bootstrap resampled values (for consistency across all bootstrap scenarios). For rarefaction, because the purpose is to make groups comparable at a common (smaller) sample size, the average estimate reported is the mean of the rarefied resampled values for that group (i.e., the mean across all rarefaction replicates).

Input data types

`Data` can be a data frame, a matrix, a vector, or a 3D array of landmark coordinates (e.g., p landmarks x k dimensions x n specimens). In the latter case, the array is converted internally to a 2D matrix with specimens in rows and (landmark * dimension) variables in columns.

Details

The function allows choosing among the following multivariate statistics:

Multivariate variance (trace of the covariance matrix; sum of variances)
Mean pairwise Euclidean distance
Convex hull volume (n-dimensional)
Claramunt proper variance (Claramunt 2010) based on a linear shrinkage covariance matrix

If the input `Data` is univariate (i.e., a vector), the analysis defaults to computing the (univariate) variance within each group (the attribute `statistic` is ignored).

If `bootstrap_rarefaction=="bootstrap"`, the function performs resampling with replacement (i.e., classical bootstrap) by sampling rows of the data matrix / data frame. Optionally, the user can specify a custom sample size via the `sample_size` argument. This allows comparison of bootstrap confidence intervals at the same sample size (essentially, this is rarefaction sampling with replacement), which can be useful to compare bootstrapped confidence intervals across different groups for statistics which are sensitive to sample size (at the expense of broader than necessary confidence intervals for groups that are larger).

If `bootstrap_rarefaction=="rarefaction"`, the function performs resampling without replacement at the sample size indicated in `sample_size` (numeric) or, if `sample_size=="smallest"`, at the size of the smallest group (all groups are resampled to that size). Rarefaction requires specifying a valute to the attribute `sample_size`; an error is returned otherwise.

This function uses the future framework for parallel processing, requiring packages future and future.apply. Users should set up their preferred parallelization strategy using future::plan() before calling this function. For example:

future::plan(future::sequential) for sequential processing
future::plan(future::multisession, workers = 4) for parallel processing with 4 workers (works on all platforms including Windows)
future::plan(future::multicore, workers = 4) for forked processes (Unix-like systems)
future::plan(future::cluster, workers = c("host1", "host2")) for cluster computing

If no plan is set or the future packages are not available, the function will use sequential processing.

References

Drake AG, Klingenberg CP. 2010. Large-scale diversification of skull shape in domestic dogs: disparity and modularity. American Naturalist 175:289-301.

Claramunt S. 2010. Discovering exceptional diversifications at continental scales: the case of the endemic families of Neotropical Suboscine passerines. Evolution 64:2004-2019.

Fruciano C, Tigano C, Ferrito V. 2012. Body shape variation and colour change during growth in a protogynous fish. Environmental Biology of Fishes 94:615-622.

Fruciano C, Pappalardo AM, Tigano C, Ferrito V. 2014. Phylogeographical relationships of Sicilian brown trout and the effects of genetic introgression on morphospace occupation. Biological Journal of the Linnean Society 112:387-398.

Fruciano C, Franchini P, Raffini F, Fan S, Meyer A. 2016. Are sympatrically speciating Midas cichlid fish special? Patterns of morphological and genetic variation in the closely related species Archocentrus centrarchus. Ecology and Evolution 6:4102-4114.

Examples

Run this code

set.seed(123)
# Simulate two groups with different means but same covariance
if (requireNamespace("MASS", quietly = TRUE)) {
  X1 = MASS::mvrnorm(20, mu=rep(0, 10), Sigma=diag(10))
  X2 = MASS::mvrnorm(35, mu=rep(2, 10), Sigma=diag(10))
  Data = rbind(X1, X2)
  grp = factor(c(rep("A", nrow(X1)), rep("B", nrow(X2))))

  # Sequential processing
  # future::plan(future::sequential)  # Default sequential processing
  
  # Parallel processing (uncomment to use)
  # future::plan(future::multisession, workers = 2)  # Use 2 workers

  # Bootstrap multivariate variance
  boot_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="multivariate_variance",
    bootstrap_rarefaction="bootstrap"
  )
  # Direct access to results table
  boot_res$results
  
  # Using the print method for formatted output
  print(boot_res)
  
  # Using the plot method to visualize results
  # plot(boot_res)  # Uncomment to create confidence interval plot

  # Rarefaction (to the smallest group size) of mean pairwise
  # Euclidean distance
  rar_res = disparity_resample(
    Data, group=grp, n_resamples=200,
    statistic="mean_pairwise_euclidean_distance",
    bootstrap_rarefaction="rarefaction", sample_size="smallest"
  )
# Now simulate a third group with larger variance
 X3 = MASS::mvrnorm(15, mu=rep(0, 10), Sigma=diag(10)*1.5)
 grp2 = factor(
   c(rep("A", nrow(X1)), rep("B", nrow(X2)), rep("C", nrow(X3)))
 )
 boot_res2 = disparity_resample(
   Data=rbind(X1, X2, X3), group=grp2, n_resamples=1000,
   statistic="multivariate_variance",
   bootstrap_rarefaction="bootstrap"
 )
 print(boot_res2)
 # plot(boot_res2)
 # Plot of the obtained (95%) confidence intervals (uncomment to plot)
 
 # Reset to sequential processing when done (optional)
 # future::plan(future::sequential)
}

Run the code above in your browser using DataLab