ENphylo_modeling: Calculating species marginality and specialization via ENFA and phylogenetic imputation

Description

The function computes vectors of marginality and specialization according to Rinnan & Lawler (2019) via Environmental Niche Factor Analysis (ENFA) and phylogenetic imputation (Garland & Ives, 2000). It takes a list of Simple Features (or sf) objects and a phylogenetic tree to train ENFA and/or ENphylo models. Both model techniques are calibrated and evaluated while accounting for phylogenetic uncertainty. Calibrations are made on a random subset of the data under the bootstrap cross-validation scheme. The predictive power of the different models is estimated using five different evaluation metrics.

Usage

ENphylo_modeling(input_data, tree, input_mask, obs_col, time_col=NULL,
 min_occ_enfa=30, boot_test_perc=20, boot_reps=10, swap.args= list(nsim=10,
 si=0.2, si2=0.2), eval.args=list(eval_metric_for_imputation="AUC",
 eval_threshold=0.7,output_options="best"),clust=0.5,output.dir)

Value

The function does not return the output into .GlobalEnv. Use the function getENphylo_results to collect results from local folders.

Arguments

input_data

a list of sf::data.frame objects containing species occurrence data in binary format (ones for presence, zero for background points) along with the explanatory continuous variables to be used in ENFA or ENphylo. Each element of the list must be named using the names of the target species. Alternatively, ENFA model outputs generated through ENphylo_modeling can be supplied as named elements of input_data list.

tree

an object of class phylo including all the species listed in input_data. The tree needs not to be ultrametric or fully dichotomous. Any species in the tree that do not match species in input_data are automatically dropped from the tree.

input_mask

a SpatRaster object. It represents the geographical mask defining the spatial domain encompassing the background area enclosing all the species in input_data.

obs_col

character. Name of the input_data column containing the vector of species occurrence data in binary format.

time_col

character. Name of the input_data column containing the time intervals associated to each species presence and background point (optional).

min_occ_enfa

numeric. The minimum number of occurrence data required for a species to be modeled with ENFA.

boot_test_perc

numeric. Percentage of data (ranging between 0 and 100) used to calibrate ENFA and/or ENphylo models within a bootstrap cross-validation scheme. The remaining percentage (100-boot_test_perc) will be used to evaluate model performances.

boot_reps

numeric. Number of evaluation runs performed within the bootstrap cross-validation scheme to evaluate ENFA and/or ENphylo models. If set to 0, models evaluation is skipped and the internal evaluation element returns NULL.

swap.args

list of ENphylo parameters. It includes:

nsim = number of alternative phylogenies generated by altering topology and branch lengths of the reference tree by means of swapONE. nsim must be greater than or equal to 1 (see details);
si,si2 = arguments passed to RRphylo::swapONE.

eval.args

list of evaluation model parameters. It includes:

eval_threshold = the minimum evaluation score required to assess ENFA and ENphylo performance. ENFA models having eval_metric_for_imputation lower than eval_threshold are compared to ENphylo models to keep the one fitting best. Additionally, within ENphylo, models derived from the swapped trees having eval_metric_for_imputation lower than eval_threshold are excluded from the output;
output_options = the strategy adopted to return ENphylo models results (see details). The viable options are: "full", "weighted.mean", and "best".

clust

numeric. The proportion of cores used to train ENFA and ENphylo models. If NULL, parallel computing is disabled. It is set at 0.5 by default.

output.dir

the file path wherein ENphylo_modeling creates "ENphylo_enfa_models" and "ENphylo_imputed_models" folders to store modeling outputs (see details).

Author

Alessandro Mondanaro, Mirko Di Febbraro, Silvia Castiglione, Carmela Serio, Marina Melchionna, Pasquale Raia

Details

ENphylo_modeling automatically arranges input_data in a suitable format to run ENFA or ENphylo. The internal call of the function is "calibrated_enfa" for ENFA and "calibrated_imputed" for ENphylo, respectively.

Phylogenetic uncertainty

The function does not work with nsim < 1 since one of the strongest points of ENphylo_modeling is to test alternative phylogenies to provide the most accurate reconstruction of species environmental preferences. Similarly, setting nsim = 1 limits the power of the function, as it will use the original tree without generating alternative phylogenies.

Phylogenetic Imputation

ENphylo_modeling automatically switches from ENFA to ENphylo algorithm for any species having less than min_occ_enfa occurrences or ENFA model accuracy below eval_threshold. In this latter case, the function performs both models and retains the one performing best according to eval_metric_for_imputation. Phylogenetic imputation is allowed for up to 30% of the species on the tree. If the number of species to impute exceeds 30%, ENphylo_modeling automatically splits the original tree into smaller subtrees, so that the maximum percentage of imputation is observed. Each subtree is designed to impute phylogenetically distant species and to retain species phylogenetically close to the taxa to be imputed (so that imputation is robust). In this case, the function prints the number of phylogenies used.

Outputs

If ENphylo_modeling runs the ENphylo algorithm, the outputs depend on the strategy adopted by the user through the output_options argument. If output_options="full", all CO matrices and evaluation metrics for all the swapped trees tested are returned. Under output_options="weighted.mean", the output consists of a subset of CO matrices and evaluation metrics for those tree swapping iterations achieving a predictive accuracy in terms of eval_metric_for_imputation above eval_threshold. Finally, if output_options="best", a single CO matrix and evaluation scores list corresponding to the most accurate swapped tree is returned. If any tree swapping iterations under either "best" or "weighted.mean" results in accuracy below the threshold, the function automatically switches to "full" strategy.

Eventually, the function creates two new folders, "ENphylo_enfa_models" and "ENphylo_imputed_models", in output.dir. In each of these folders, a number of new named subfolders equal to the number of modeled species are created. Therein, model outputs and background area are saved as model_outputs.RData and study_area.tif, respectively. model_outputs.RData includes a list of three elements, regardless of whether ENFA or ENphylo is used:

$call a character specifying the algorithm used to model the species (i.e. ENFA or ENphylo).
$formatted data a list of input data formatted to run either ENFA or ENphylo algorithms. Specifically, the list reports: the presence data points ($input_ones), the background points ($input_back),the name of the columns associated to the arguments OBS_col and time_col (if specified), the name of the column containing the cell numbers (geoID_col), and the coordinates of presence data only ($one_coords).
$calibrated_model a list. The output objects are different depending on whether ENFA or ENphylo is used to model the species:

ENFA

$full_ model: a list containing marginality and specialization factors, the 'co' matrix, the number of significant axes, and all the other objects generated by applying ENFA on the entire occurrence dataset (see Rinnan et al. 2019 for additional details).
$evaluation: a matrix containing the evaluation scores of the ENFA model assessed by all possible evaluation metrics (i.e. Area Under the Curve (AUC), True Skill Statistic (TSS), Boyce Index (CBI), Sorensen Index, and Omission Rate (OMR)) for each model evaluations run.

ENphylo

$co: a list of the 'co' matrices of length equal to the number of alternative phylogenies tested (i.e. nsim argument). The number of 'co' matrices also reflects the selected output_option strategy.
$evaluation: a data.frame containing the evaluation scores of ENphylo model assessed by all possible evaluation metrics for each alternative phylogeny. The output of this object depends on the strategy adopted by the user through the output_options argument.Specifically, the function internally selects the model (or models) with the highest evaluation score according to the specified evaluation metric.
$output_options: a character vector including the argument output_options and eval_metric_for_imputation set to run the of ENphylo model.

References

Rinnan, D. S., & Lawler, J. (2019). Climate-niche factor analysis: a spatial approach to quantifying species vulnerability to climate change. Ecography, 42(9), 1494–1503. doi/full/10.1111/ecog.03937

Garland, T., & Ives, A. R. (2000). Using the past to predict the present: Confidence intervals for regression equations in phylogenetic comparative methods. American Naturalist, 155(3),346–364. doi.org/10.1086/303327

Mondanaro, A., Di Febbraro, M., Castiglione, S., Melchionna, M., Serio, C., Girardi, G., Blefiore, A.M., & Raia, P. (2023). ENphylo: A new method to model the distribution of extremely rare species. Methods in Ecology and Evolution, 14: 911-922. doi:10.1111/2041-210X.14066

Examples

Run this code

# \donttest{
library(ape)
library(terra)
library(sf)
library(RRgeo)

newwd<-tempdir()
# newwd<-"YOUR_DIRECTORY"
latesturl<-RRgeo:::get_latest_version("12734585")
curl::curl_download(url = paste0(latesturl,"/files/dat.Rda?download=1"),
                    destfile = file.path(newwd,"dat.Rda"), quiet = FALSE)
load(file.path(newwd,"dat.Rda"))
read.tree(system.file("exdata/Eucopdata_tree.txt", package="RRgeo"))->tree
tree$tip.label<-gsub("_"," ",tree$tip.label)
curl::curl_download(paste0(latesturl,"/files/X35kya.tif?download=1"),
                    destfile = file.path(newwd,"X35kya.tif"), quiet = FALSE)
rast(file.path(newwd,"X35kya.tif"))->map35
project(map35,st_crs(dat[[1]])$proj4string,res = 50000)->map

ENphylo_modeling(input_data=dat[c(1,11)],
                 tree=tree,
                 input_mask=map[[1]],
                 obs_col="OBS",
                 time_col="age",
                 min_occ_enfa=15,
                 boot_test_perc=20,
                 boot_reps=10,
                 swap.args=list(nsim=5,si=0.2,si2=0.2),
                 eval.args=list(eval_metric_for_imputation="AUC",
                                eval_threshold=0.7,
                                output_options="best"),
                 clust=NULL,
                 output.dir=newwd)

# }

Run the code above in your browser using DataLab