Learn R Programming

BiodiversityR (version 2.0)

ensemble.test: Suitability mapping based on ensembles of modelling algorithms: comparison of different algorithms

Description

The basic function ensemble.test allows to evaluate different algorithms for (species) suitability modelling, including maximum entropy (MAXENT), boosted regression trees, random forests, generalized linear models (including stepwise selection of explanatory variables), generalized additive models (including stepwise selection of explanatory variables), multivariate adaptive regression splines, regression trees, artificial neural networks, flexible discriminant analysis, support vector machines, the BIOCLIM algorithm, the DOMAIN algorithm and the Mahalanobis algorithm. These sets of functions were developed in parallel with the biomod2 package, especially for inclusion of the maximum entropy algorithm, but also to allow for a more direct integration with the BiodiversityR package, more direct handling of model formulae and greater focus on mapping. Researchers and students of species distribution are strongly encouraged to familiarize themselves with all the options of the BIOMOD and dismo packages.

Usage

ensemble.test(x, p, a = NULL, an = 1000, ext = NULL, k = 0, pt = NULL, at = NULL, PLOTS = TRUE, evaluations.keep = FALSE, 
    MAXENT = 1, GBM = 1, GBMSTEP = 1, RF = 1, GLM = 1, GLMSTEP = 1, 
    GAM = 1, GAMSTEP = 1, MGCV = 1, EARTH = 1, RPART = 1, NNET = 1, FDA = 1, SVM=1, BIOCLIM = 1, DOMAIN = 1, MAHAL=1, 
    Yweights = "BIOMOD", factors = NULL, formulae.defaults=TRUE,
    GBM.formula = NULL, GBM.n.trees = 3000,    
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), GBMSTEP.tree.complexity = 5, GBMSTEP.learning.rate = 0.01, GBMSTEP.bag.fraction = 0.5, 
    RF.formula = NULL, RF.ntree = 750, RF.mtry = log(nlayers(x)), 
    GLM.formula = NULL, GLM.family = binomial(link = "logit"), 
    GLMSTEP.steps = 1000, STEP.formula = NULL, GLMSTEP.scope = NULL, GLMSTEP.k = 2, 
    GAM.formula = NULL, GAM.family = binomial(link = "logit"), 
    GAMSTEP.steps = 1000, GAMSTEP.scope = NULL, 
    MGCV.formula = NULL, 
    EARTH.formula = NULL, EARTH.glm = list(family = binomial(link = "logit"), maxit=1000), 
    RPART.formula = NULL, RPART.xval = 50, 
    NNET.formula = NULL, NNET.size = 8, NNET.decay = 0.01, 
    FDA.formula = NULL, SVM.formula=NULL)

ensemble.test.splits(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, SCRIPT = TRUE, 
    MAXENT = 1, GBM = 1, GBMSTEP = 1, RF = 1, GLM = 1, GLMSTEP = 1, GAM = 1, 
    GAMSTEP = 1, MGCV = 1, EARTH = 1, RPART = 1, NNET = 1, FDA = 1, SVM=1, BIOCLIM = 1, DOMAIN = 1, MAHAL=1, 
    Yweights = "BIOMOD", factors = NULL, formulae.defaults=TRUE,
    GBM.formula = NULL, GBM.n.trees = 3000,    
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), GBMSTEP.tree.complexity = 5, GBMSTEP.learning.rate = 0.01, GBMSTEP.bag.fraction = 0.5, 
    RF.formula = NULL, RF.ntree = 750, RF.mtry = log(nlayers(x)), 
    GLM.formula = NULL, GLM.family = binomial(link = "logit"), 
    GLMSTEP.steps = 1000, STEP.formula = NULL, GLMSTEP.scope = NULL, GLMSTEP.k = 2, 
    GAM.formula = NULL, GAM.family = binomial(link = "logit"), 
    GAMSTEP.steps = 1000, GAMSTEP.scope = NULL, 
    MGCV.formula = NULL, 
    EARTH.formula = NULL, EARTH.glm=list(family=binomial(link="logit"), maxit=1000), 
    RPART.formula = NULL, RPART.xval = 50, 
    NNET.formula = NULL, NNET.size = 8, NNET.decay = 0.01, 
    FDA.formula = NULL, SVM.formula=NULL)

ensemble.test.gbm(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, Yweights = "BIOMOD", factors=NULL, 
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), complexity = c(3:6), learning = c(0.005, 0.002, 0.001), GBMSTEP.bag.fraction = 0.5)

ensemble.test.nnet(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, Yweights = "BIOMOD", factors=NULL, 
    formulae.defaults=TRUE, NNET.formula = NULL, 
    size = c(2, 4, 6, 8), decay = c(0.1, 0.05, 0.01, 0.001))

ensemble.formulae(x, factors = NULL, dummy.vars = NULL)

Arguments

x
RasterStack object (stack) containing all layers that correspond to explanatory variables
p
presence points used for calibrating the suitability models, typically available in 2-column (lon, lat) dataframe; see also prepareData and extract
a
background points used for calibrating the suitability models, typically available in 2-column (lon, lat) dataframe; see also prepareData and extract
an
number of background points for calibration to be selected with randomPoints in case argument a is missing
ext
an Extent object to limit the selection of background points to a sub-region of x, typically provided as c(lonmin, lonmax, latmin, latmax); see also randomPoints and
k
If larger than 1, the number of groups to split between calibration (k-1) and evaluation (1) data sets (for example, k=5 results in 4/5 of presence and background points to be used for calibrating the models, and 1/5 of presence and backgroun
pt
presence points used for evaluating the suitability models, available in 2-column (lon, lat) dataframe; see also prepareData and extract
at
background points used for evaluating the suitability models, available in 2-column (lon, lat) dataframe; see also prepareData and extract
PLOTS
Plot the results of evaluation for the various suitability models (if TRUE). See also evaluate.
evaluations.keep
Keep the results of evaluations (if TRUE). See also evaluate.
MAXENT
number: if larger than 0, then a maximum entropy model (maxent) will be fitted among ensemble
GBM
number: if larger than 0, then a boosted regression trees model (gbm) will be fitted among ensemble
GBMSTEP
number: if larger than 0, then a stepwise boosted regression trees model (gbm.step) will be fitted among ensemble
RF
number: if larger than 0, then a random forest model (randomForest) will be fitted among ensemble
GLM
number: if larger than 0, then a generalized linear model (glm) will be fitted among ensemble
GLMSTEP
number: if larger than 0, then a stepwise generalized linear model (stepAIC) will be fitted among ensemble
GAM
number: if larger than 0, then a generalized additive model (gam) will be fitted among ensemble
GAMSTEP
number: if larger than 0, then a stepwise generalized additive model (step.gam) will be fitted among ensemble
MGCV
number: if larger than 0, then a generalized additive model (gam) will be fitted among ensemble
EARTH
number: if larger than 0, then a multivariate adaptive regression spline model (earth) will be fitted among ensemble
RPART
number: if larger than 0, then a recursive partioning and regression tree model (rpart) will be fitted among ensemble
NNET
number: if larger than 0, then an artificial neural network model (nnet) will be fitted among ensemble
FDA
number: if larger than 0, then a flexible discriminant analysis model (fda) will be fitted among ensemble
SVM
number: if larger than 0, then a support vector machine model (ksvm) will be fitted among ensemble
BIOCLIM
number: if larger than 0, then the BIOCLIM algorithm (bioclim) will be fitted among ensemble
DOMAIN
number: if larger than 0, then the DOMAIN algorithm (domain) will be fitted among ensemble
MAHAL
number: if larger than 0, then the Mahalanobis algorithm (mahal) will be fitted among ensemble
Yweights
chooses how cases of presence and background (absence) are weighted; "BIOMOD" results in equal weighting of all presence and all background cases, "equal" results in equal weighting of all cases. The user can supply a vector of w
factors
vector that indicates which variables are factors; see also prepareData
formulae.defaults
Suggest formulae for most of the models (if TRUE). See also ensemble.formulae.
GBM.formula
formula for the boosted regression trees algorithm; see also gbm
GBM.n.trees
total number of trees to fit for the boosted regression trees model; see also gbm
GBMSTEP.gbm.x
indices of column numbers with explanatory variables for stepwise boosted regression trees; see also gbm.step
GBMSTEP.tree.complexity
complexity of individual trees for stepwise boosted regression trees; see also gbm.step
GBMSTEP.learning.rate
weight applied to individual trees for stepwise boosted regression trees; see also gbm.step
GBMSTEP.bag.fraction
proportion of observations used in selecting variables for stepwise boosted regression trees; see also gbm.step
RF.formula
formula for random forest algorithm; see also randomForest
RF.ntree
number of trees to grow for random forest algorithm; see also randomForest
RF.mtry
number of variables randomly sampled as candidates at each split for random forest algorithm; see also randomForest
GLM.formula
formula for the generalized linear model; see also glm
GLM.family
description of the error distribution and link function for the generalized linear model; see also glm
GLMSTEP.steps
maximum number of steps to be considered for stepwise generalized linear model; see also stepAIC
STEP.formula
formula for the "starting model" to be considered for stepwise generalized linear model; see also stepAIC
GLMSTEP.scope
range of models examined in the stepwise search; see also stepAIC
GLMSTEP.k
multiple of the number of degrees of freedom used for the penalty (only k = 2 gives the genuine AIC); see also stepAIC
GAM.formula
formula for the generalized additive model; see also gam
GAM.family
description of the error distribution and link function for the generalized additive model; see also gam
GAMSTEP.steps
maximum number of steps to be considered in the stepwise generalized additive model; see also step.gam
GAMSTEP.scope
range of models examined in the step-wise search n the stepwise generalized additive model; see also step.gam
MGCV.formula
formula for the generalized additive model; see also gam
EARTH.formula
formula for the multivariate adaptive regression spline model; see also earth
EARTH.glm
list of arguments to pass on to glm; see also earth
RPART.formula
formula for the recursive partioning and regression tree model; see also rpart
RPART.xval
number of cross-validations for the recursive partioning and regression tree model; see also rpart.control
NNET.formula
formula for the artificial neural network model; see also nnet
NNET.size
number of units in the hidden layer for the artificial neural network model; see also nnet
NNET.decay
parameter of weight decay for the artificial neural network model; see also nnet
FDA.formula
formula for the flexible discriminant analysis model; see also fda
SVM.formula
formula for the support vector machine model; see also ksvm
digits
number of decimal place after multiplying results by 100 (percentages); see also round
SCRIPT
provide suggested parameters (weights) for the ensemble.raster function (if TRUE)
complexity
vector with values of complexity of individual trees (tree.complexity) for boosted regression trees; see also gbm.step
learning
vector with values of weights applied to individual trees (learning.rate) for boosted regression trees; see also gbm.step
size
vector with values of number of units in the hidden layer for the artificial neural network model; see also nnet
decay
vector with values of weight decay for the artificial neural network model; see also nnet
dummy.vars
vector that indicates which variables are dummy variables

Value

  • Function ensemble.test (potentially) returns a list with results from evaluations (via evaluate) of calibration and test runs of individual suitability models. Function ensemble.test.splits returns a matrix with, for each individual suitability model, the AUC of each run and the average AUC over the runs. Models are sorted by the average AUC. The average AUC for each model can be used as input weights for the ensemble.raster function (these weights are provided when the option of SCRIPT is selected. Functions ensemble.test.gbm and ensemble.test.nnet return a matrix with, for each combination of model parameters, the AUC of each run and the average AUC. Models are sorted by the average AUC.

Details

The basic function ensemble.test first calibrates individual suitability models based on presence locations p and background locations a, then evaluates these suitability models based on presence locations pt and background locations at. While calibrating and testing individual models, results obtained via the evaluate function are shown in the GUI, and possibly plotted (PLOTS) or saved (evaluations.keep). Function ensemble.test.splits splits the presence and background locations in a user-defined (k) number of subsets, then sequentially calibrates individual suitability models with (k-1) combined subsets and evaluates with the remaining one subset, whereby each subset is used once for evaluation in the user-defined number (k) of runs. For example, k = 5 results in splitting the locations in 5 subsets, then using one of these subsets in turn for evaluations (see also kfold). Function ensemble.test.gbm allows to test various combinations of parameters tree.complexity and learning.rate for the gbm.step model. Function ensemble.test.nnet allows to test various combinations of parameters size and decay for the nnet model. Function ensemble.formulae provides suggestions for formulae that can be used for ensemble.test and ensemble.raster.

References

Buisson L, Thuiller W, Casajus N, Lek S and Grenouillet G. 2010. Uncertainty in ensemble forecasting of species distribution. Global Change Biology 16: 1145-1157

See Also

ensemble.raster

Examples

Run this code
# based on examples in the dismo package
# simplified example of ensemble modelling with 4 modeling algorithms

# get predictor variables
library(dismo)
predictors <- stack(list.files(path=paste(system.file(package="dismo"), '/ex', sep=''), pattern='grd', full.names=TRUE ))
# predictors

# presence points
presence <- paste(system.file(package="dismo"), '/ex/bradypus.csv', sep='')
pres <- read.table(presence, header=TRUE, sep=',')[,-1]

# the kfold function randomly assigns data to groups; subdivide in calibration and training data
groupp <- kfold(pres, 5)
pres_train <- pres[groupp != 1, ]
pres_test <- pres[groupp == 1, ]

# choose background points
ext = extent(-90, -32, -33, 23)
background <- randomPoints(predictors, n=1000, ext=ext, extf = 1.00)
colnames(background) = c('lon', 'lat')
groupa <- kfold(background, 5)
backg_train <- background[groupa != 1, ]
backg_test <- background[groupa == 1, ]

# formulae for random forest and generalized linear model
# compare with: ensemble.formulae(predictors, factors=c("biome"))

rfformula <- as.formula(pb ~ bio1+bio5+bio6+bio7+bio8+bio12+bio16+bio17)

glmformula <- as.formula(pb ~ bio1 + I(bio1^2) + I(bio1^3) +
    bio5 + I(bio5^2) + I(bio5^3) + bio6 + I(bio6^2) + I(bio6^3) + 
    bio7 + I(bio7^2) + I(bio7^3) + bio8 + I(bio8^2) + I(bio8^3) +
    bio12 + I(bio12^2) + I(bio12^3) + bio16 + I(bio16^2) + I(bio16^3) + 
    bio17 + I(bio17^2) + I(bio17^3) )

# fit four ensemble models (RF, GLM, BIOCLIM, DOMAIN)
ensemble.nofactors <- ensemble.test(x=predictors, p=pres_train, a=backg_train, pt=pres_test, at=backg_test,
    MAXENT=0, GBM=0, GBMSTEP=0, RF=1, GLM=1, GLMSTEP=0, GAM=0, GAMSTEP=0, MGCV=0, 
    EARTH=0, RPART=0, NNET=0, FDA=0, SVM=0, BIOCLIM=1, DOMAIN=1, MAHAL=0,
    Yweights="BIOMOD", factors=NULL,
    PLOTS=FALSE, evaluations.keep=TRUE,
    RF.formula=rfformula,
    GLM.formula=glmformula)

# test performance of different suitability models; data are split in 4 subsets, each used once for evaluation
# GAMSTEP option is not available at the moment (problems with assignments)
ensemble.nofactors2 <- ensemble.test.splits(x=predictors, p=pres, a=background, k=4, 
    MAXENT=1, GBM=1, GBMSTEP=1, RF=1, GLM=1, GLMSTEP=1, GAM=1, GAMSTEP=0, MGCV=1, 
    EARTH=1, RPART=1, NNET=1, FDA=1, SVM=1, BIOCLIM=1, DOMAIN=1, MAHAL=1,
    Yweights="BIOMOD", factors=NULL,
    PLOTS=FALSE, formulae.defaults=TRUE,
    GBMSTEP.learning.rate=0.002)
ensemble.nofactors2

Run the code above in your browser using DataLab