BIOMOD_Modeling: Run a range of species distribution models

Description

This function allows to calibrate and evaluate a range of modeling techniques for a given species distribution. The dataset can be split up for independent calibration and validation, and the predictive power of the different models can be estimated using a range of evaluation metrics (see Details).

Usage

BIOMOD_Modeling(
  bm.format,
  modeling.id = as.character(format(Sys.time(), "%s")),
  models = c("GLM", "GBM", "GAM", "CTA", "ANN", "SRE", "FDA", "MARS", "RF",
    "MAXENT.Phillips", "MAXENT.Phillips.2"),
  bm.options = NULL,
  nb.rep = 1,
  data.split.perc = 100,
  data.split.table = NULL,
  do.full.models = TRUE,
  weights = NULL,
  prevalence = NULL,
  metric.eval = c("KAPPA", "TSS", "ROC"),
  var.import = 0,
  save.output = TRUE,
  scale.models = FALSE,
  nb.cpu = 1,
  seed.val = NULL,
  do.progress = TRUE
)

Value

A BIOMOD.models.out object containing models outputs, or links to saved outputs.

Models outputs are stored out of R (for memory storage reasons) in 2 different folders created in the current working directory :

a models folder, named after the resp.name argument of BIOMOD_FormatingData, and containing all calibrated models for each repetition and pseudo-absence run
a hidden folder, named .BIOMOD_DATA, and containing outputs related files (original dataset, calibration lines, pseudo-absences selected, predictions, variables importance, evaluation values...), that can be retrieved with get_[...] or load functions, and used by other biomod2 functions, like BIOMOD_Projection or BIOMOD_EnsembleModeling

Arguments

bm.format: a BIOMOD.formated.data or BIOMOD.formated.data.PA object returned by the BIOMOD_FormatingData function
modeling.id: a character corresponding to the name (ID) of the simulation set (a random number by default)
models: a vector containing model names to be computed, must be among GLM, GBM, GAM, CTA, ANN, SRE, FDA, MARS, RF, MAXENT.Phillips, MAXENT.Phillips.2
bm.options: a BIOMOD.models.options object returned by the BIOMOD_ModelingOptions function
nb.rep: an integer corresponding to the number of repetitions to be done for calibration/validation splitting
data.split.perc: a numeric between 0 and 100 corresponding to the percentage of data used to calibrate the models (calibration/validation splitting)
data.split.table: (optional, default NULL)
A matrix or data.frame defining for each repetition (in columns) which observation lines should be used for models calibration (TRUE) and validation (FALSE) (see BIOMOD_CrossValidation)
(if specified, nb.rep, data.split.perc and do.full.models will be ignored)
do.full.models: (optional, default TRUE)
A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not
weights: (optional, default NULL)
A vector of numeric values corresponding to observation weights (one per observation, see Details)
prevalence: (optional, default NULL)
A numeric between 0 and 1 corresponding to the species prevalence to build 'weighted response weights' (see Details)
metric.eval: a vector containing evaluation metric names to be used, must be among ROC, TSS, KAPPA, ACCURACY, BIAS, POD, FAR, POFD, SR, CSI, ETS, HK, HSS, OR, ORSS
var.import: (optional, default NULL)
An integer corresponding to the number of permutations to be done for each variable to estimate variable importance
save.output: (optional, default TRUE)
A logical value defining whether all outputs should be saved on hard drive or not (! strongly recommended !)
scale.models: (optional, default FALSE)
A logical value defining whether all models predictions should be scaled with a binomial GLM or not
nb.cpu: (optional, default 1)
An integer value corresponding to the number of computing resources to be used to parallelize the single models computation
seed.val: (optional, default NULL)
An integer value corresponding to the new seed value to be set
do.progress: (optional, default TRUE)
A logical value defining whether the progress bar is to be rendered or not

Author

Wilfried Thuiller, Damien Georges, Robin Engler

Details

bm.format

If you have decided to add pseudo absences to your original dataset (see BIOMOD_FormatingData),
PA.nb.rep *(nb.rep + 1) models will be created.

models

The set of models to be calibrated on the data. 10 modeling techniques are currently available :

GLM : Generalized Linear Model (glm)
GAM : Generalized Additive Model (gam, gam or bam)
(see BIOMOD_ModelingOptions for details on algorithm selection)
GBM : Generalized Boosting Model, or usually called Boosted Regression Trees (gbm)
CTA : Classification Tree Analysis (rpart)
ANN : Artificial Neural Network (nnet)
SRE : Surface Range Envelop or usually called BIOCLIM
FDA : Flexible Discriminant Analysis (fda)
MARS : Multiple Adaptive Regression Splines (earth)
RF : Random Forest (randomForest)
MAXENT.Phillips : Maximum Entropy (https://biodiversityinformatics.amnh.org/open_source/maxent/)
MAXENT.Phillips.2 : Maximum Entropy (maxnet)

nb.rep & data.split.perc

Most simple method in machine learning to calibrate and evaluate a model is to split the original dataset in two, one to calibrate the model and the other one to evaluate it. The data.split.perc argument defines the percentage of data that will be randomly selected and used for the calibration part, the remaining data constituting the evaluation part. This process is repeated nb.rep times, to be sure not to include bias both in the modeling and evaluation parts.
Other validation methods are also available to the user :
- evaluation dataset can be directly given to the BIOMOD_FormatingData function
- data.split.table argument can be used and obtained from the BIOMOD_CrossValidation function

weights & prevalence

More or less weight can be given to some specific observations.

If weights = prevalence = NULL, each observation (presence or absence) will have the same weight, no matter the total number of presences and absences.
If prevalence = 0.5, presences and absences will be weighted equally (i.e. the weighted sum of presences equals the weighted sum of absences).
If prevalence is set below (above) 0.5, more weight will be given to absences (presences).
If weights is defined, prevalence argument will be ignored, and each observation will have its own weight.
If pseudo-absences have been generated (PA.nb.rep > 0 in BIOMOD_FormatingData), weights are by default calculated such that prevalence = 0.5. Automatically created weights will be integer values to prevent some modeling issues.

metric.eval

ROC : Relative Operating Characteristic
KAPPA : Cohen's Kappa (Heidke skill score)
TSS : True kill statistic (Hanssen and Kuipers discriminant, Peirce's skill score)
FAR : False alarm ratio
SR : Success ratio
ACCURANCY : Accuracy (fraction correct)
BIAS : Bias score (frequency bias)
POD : Probability of detection (hit rate)
CSI : Critical success index (threat score)
ETS : Equitable threat score (Gilbert skill score)

Optimal value of each method can be obtained with the get_optim_value function. Several evaluation metrics can be selected. Please refer to the CAWRC website (section "Methods for dichotomous forecasts") to get detailed description of each metric.

save.output

If this argument is set to FALSE, it may prevent the evaluation of the ensemble models (see BIOMOD_EnsembleModeling) in further steps. Strong recommandation is to keep save.output = TRUE, even if it requires to have some free space onto the hard drive.

scale.models

This parameter is quite experimental and it is recommended not to use it. It may lead to reduction in projection scale amplitude. Some categorical models always have to be scaled (FDA, ANN), but it may be interesting to scale all computed models to ensure comparable predictions (0-1000 range). It might be particularly useful when doing ensemble forecasting to remove the scale prediction effect (the more extended projections are, the more they influence ensemble forecasting results).

do.full.models

Building models with all available information may be useful in some particular cases (e.g. rare species with few presences points). But calibration and evaluation datasets will be the same, which might lead to over-optimistic evaluation scores.

Examples

Run this code


# Load species occurrences (6 species available)
myFile <- system.file('external/species/mammals_table.csv', package = 'biomod2')
DataSpecies <- read.csv(myFile, row.names = 1)
head(DataSpecies)

# Select the name of the studied species
myRespName <- 'GuloGulo'

# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
myFiles <- paste0('external/bioclim/current/bio', c(3, 4, 7, 11, 12), '.grd')
myExpl <- raster::stack(system.file(myFiles, package = 'biomod2'))

# \dontshow{
myExtent <- raster::extent(0,30,45,70)
myExpl <- raster::stack(raster::crop(myExpl, myExtent))
# }

# ---------------------------------------------------------------
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
                                     expl.var = myExpl,
                                     resp.xy = myRespXY,
                                     resp.name = myRespName)

# Create default modeling options
myBiomodOptions <- BIOMOD_ModelingOptions()


# ---------------------------------------------------------------
# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                    modeling.id = 'AllModels',
                                    models = c('RF', 'GLM'),
                                    bm.options = myBiomodOptions,
                                    nb.rep = 2,
                                    data.split.perc = 80,
                                    metric.eval = c('TSS','ROC'),
                                    var.import = 2,
                                    do.full.models = FALSE,
                                    seed.val = 42)
myBiomodModelOut

# Get evaluation scores & variables importance
get_evaluations(myBiomodModelOut)
get_variables_importance(myBiomodModelOut, as.data.frame = TRUE)

# Represent evaluation scores 
bm_PlotEvalMean(bm.out = myBiomodModelOut)
bm_PlotEvalBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'run'))

# Represent variables importance 
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'algo'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'dataset'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'expl.var', 'dataset'))

# Represent response curves 
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = get_built_models(myBiomodModelOut)[c(1:2)],
#                       fixed.var = 'median')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = get_built_models(myBiomodModelOut)[c(1:2)],
#                       fixed.var = 'min')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = get_built_models(myBiomodModelOut)[3],
#                       fixed.var = 'median',
#                       do.bivariate = TRUE)

Run the code above in your browser using DataLab