ensemble.test: Suitability mapping based on ensembles of modelling algorithms: comparison of different algorithms

Description

The basic function ensemble.test allows to evaluate different algorithms for (species) suitability modelling, including maximum entropy (MAXENT), boosted regression trees, random forests, generalized linear models (including stepwise selection of explanatory variables), generalized additive models (including stepwise selection of explanatory variables), multivariate adaptive regression splines, regression trees, artificial neural networks, flexible discriminant analysis, support vector machines, the BIOCLIM algorithm, the DOMAIN algorithm and the Mahalanobis algorithm. These sets of functions were developed in parallel with the biomod2 package, especially for inclusion of the maximum entropy algorithm, but also to allow for a more direct integration with the BiodiversityR package, more direct handling of model formulae and greater focus on mapping. Researchers and students of species distribution are strongly encouraged to familiarize themselves with all the options of the BIOMOD and dismo packages.

Usage

ensemble.test(x, p, a = NULL, an = 1000, ext = NULL, k = 0, pt = NULL, at = NULL, PLOTS = TRUE, evaluations.keep = FALSE, 
    MAXENT = 1, GBM = 1, GBMSTEP = 1, RF = 1, GLM = 1, GLMSTEP = 1, 
    GAM = 1, GAMSTEP = 1, MGCV = 1, EARTH = 1, RPART = 1, NNET = 1, FDA = 1, SVM=1, BIOCLIM = 1, DOMAIN = 1, MAHAL=1, 
    Yweights = "BIOMOD", factors = NULL, formulae.defaults=TRUE,
    GBM.formula = NULL, GBM.n.trees = 3000,    
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), GBMSTEP.tree.complexity = 5, GBMSTEP.learning.rate = 0.01, GBMSTEP.bag.fraction = 0.5, 
    RF.formula = NULL, RF.ntree = 750, RF.mtry = log(nlayers(x)), 
    GLM.formula = NULL, GLM.family = binomial(link = "logit"), 
    GLMSTEP.steps = 1000, STEP.formula = NULL, GLMSTEP.scope = NULL, GLMSTEP.k = 2, 
    GAM.formula = NULL, GAM.family = binomial(link = "logit"), 
    GAMSTEP.steps = 1000, GAMSTEP.scope = NULL, 
    MGCV.formula = NULL, 
    EARTH.formula = NULL, EARTH.glm = list(family = binomial(link = "logit"), maxit=1000), 
    RPART.formula = NULL, RPART.xval = 50, 
    NNET.formula = NULL, NNET.size = 8, NNET.decay = 0.01, 
    FDA.formula = NULL, SVM.formula=NULL)

ensemble.test.splits(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, SCRIPT = TRUE, 
    MAXENT = 1, GBM = 1, GBMSTEP = 1, RF = 1, GLM = 1, GLMSTEP = 1, GAM = 1, 
    GAMSTEP = 1, MGCV = 1, EARTH = 1, RPART = 1, NNET = 1, FDA = 1, SVM=1, BIOCLIM = 1, DOMAIN = 1, MAHAL=1, 
    Yweights = "BIOMOD", factors = NULL, formulae.defaults=TRUE,
    GBM.formula = NULL, GBM.n.trees = 3000,    
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), GBMSTEP.tree.complexity = 5, GBMSTEP.learning.rate = 0.01, GBMSTEP.bag.fraction = 0.5, 
    RF.formula = NULL, RF.ntree = 750, RF.mtry = log(nlayers(x)), 
    GLM.formula = NULL, GLM.family = binomial(link = "logit"), 
    GLMSTEP.steps = 1000, STEP.formula = NULL, GLMSTEP.scope = NULL, GLMSTEP.k = 2, 
    GAM.formula = NULL, GAM.family = binomial(link = "logit"), 
    GAMSTEP.steps = 1000, GAMSTEP.scope = NULL, 
    MGCV.formula = NULL, 
    EARTH.formula = NULL, EARTH.glm=list(family=binomial(link="logit"), maxit=1000), 
    RPART.formula = NULL, RPART.xval = 50, 
    NNET.formula = NULL, NNET.size = 8, NNET.decay = 0.01, 
    FDA.formula = NULL, SVM.formula=NULL)

ensemble.test.gbm(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, Yweights = "BIOMOD", factors=NULL, 
    GBMSTEP.gbm.x = 2:(1 + nlayers(x)), complexity = c(3:6), learning = c(0.005, 0.002, 0.001), GBMSTEP.bag.fraction = 0.5)

ensemble.test.nnet(x, p, a = NULL, an = 1000, ext = NULL, k = 5, digits = 2, PLOTS = FALSE, Yweights = "BIOMOD", factors=NULL, 
    formulae.defaults=TRUE, NNET.formula = NULL, 
    size = c(2, 4, 6, 8), decay = c(0.1, 0.05, 0.01, 0.001))

ensemble.formulae(x, factors = NULL, dummy.vars = NULL)

Arguments

RasterStack object (stack) containing all layers that correspond to explanatory variables

presence points used for calibrating the suitability models, typically available in 2-column (lon, lat) dataframe; see also prepareData and extract

a

background points used for calibrating the suitability models, typically available in 2-column (lon, lat) dataframe; see also prepareData and extract

an

number of background points for calibration to be selected with randomPoints in case argument a is missing

ext

an Extent object to limit the selection of background points to a sub-region of x, typically provided as c(lonmin, lonmax, latmin, latmax); see also randomPoints and

k

If larger than 1, the number of groups to split between calibration (k-1) and evaluation (1) data sets (for example, k=5 results in 4/5 of presence and background points to be used for calibrating the models, and 1/5 of presence and backgroun

pt

presence points used for evaluating the suitability models, available in 2-column (lon, lat) dataframe; see also prepareData and extract

at

background points used for evaluating the suitability models, available in 2-column (lon, lat) dataframe; see also prepareData and extract

PLOTS

Plot the results of evaluation for the various suitability models (if TRUE). See also evaluate.

evaluations.keep

Keep the results of evaluations (if TRUE). See also evaluate.

MAXENT

number: if larger than 0, then a maximum entropy model (maxent) will be fitted among ensemble

GBM

number: if larger than 0, then a boosted regression trees model (gbm) will be fitted among ensemble

GBMSTEP

number: if larger than 0, then a stepwise boosted regression trees model (gbm.step) will be fitted among ensemble

RF

number: if larger than 0, then a random forest model (randomForest) will be fitted among ensemble

GLM

number: if larger than 0, then a generalized linear model (glm) will be fitted among ensemble

GLMSTEP

number: if larger than 0, then a stepwise generalized linear model (stepAIC) will be fitted among ensemble

GAM

number: if larger than 0, then a generalized additive model (gam) will be fitted among ensemble

GAMSTEP

number: if larger than 0, then a stepwise generalized additive model (step.gam) will be fitted among ensemble

MGCV

number: if larger than 0, then a generalized additive model (gam) will be fitted among ensemble

EARTH

number: if larger than 0, then a multivariate adaptive regression spline model (earth) will be fitted among ensemble

RPART

number: if larger than 0, then a recursive partioning and regression tree model (rpart) will be fitted among ensemble

NNET

number: if larger than 0, then an artificial neural network model (nnet) will be fitted among ensemble

FDA

number: if larger than 0, then a flexible discriminant analysis model (fda) will be fitted among ensemble

SVM

number: if larger than 0, then a support vector machine model (ksvm) will be fitted among ensemble

BIOCLIM

number: if larger than 0, then the BIOCLIM algorithm (bioclim) will be fitted among ensemble

DOMAIN

number: if larger than 0, then the DOMAIN algorithm (domain) will be fitted among ensemble

MAHAL

number: if larger than 0, then the Mahalanobis algorithm (mahal) will be fitted among ensemble

Yweights

chooses how cases of presence and background (absence) are weighted; "BIOMOD" results in equal weighting of all presence and all background cases, "equal" results in equal weighting of all cases. The user can supply a vector of w

factors

vector that indicates which variables are factors; see also prepareData

formulae.defaults

Suggest formulae for most of the models (if TRUE). See also ensemble.formulae.

GBM.formula

formula for the boosted regression trees algorithm; see also gbm

GBM.n.trees

total number of trees to fit for the boosted regression trees model; see also gbm

GBMSTEP.gbm.x

indices of column numbers with explanatory variables for stepwise boosted regression trees; see also gbm.step

GBMSTEP.tree.complexity

complexity of individual trees for stepwise boosted regression trees; see also gbm.step

GBMSTEP.learning.rate

weight applied to individual trees for stepwise boosted regression trees; see also gbm.step

GBMSTEP.bag.fraction

proportion of observations used in selecting variables for stepwise boosted regression trees; see also gbm.step

RF.formula

formula for random forest algorithm; see also randomForest

RF.ntree

number of trees to grow for random forest algorithm; see also randomForest

RF.mtry

number of variables randomly sampled as candidates at each split for random forest algorithm; see also randomForest

GLM.formula

formula for the generalized linear model; see also glm

GLM.family

description of the error distribution and link function for the generalized linear model; see also glm

GLMSTEP.steps

maximum number of steps to be considered for stepwise generalized linear model; see also stepAIC

STEP.formula

formula for the "starting model" to be considered for stepwise generalized linear model; see also stepAIC

GLMSTEP.scope

range of models examined in the stepwise search; see also stepAIC

GLMSTEP.k

multiple of the number of degrees of freedom used for the penalty (only k = 2 gives the genuine AIC); see also stepAIC

GAM.formula

formula for the generalized additive model; see also gam

GAM.family

description of the error distribution and link function for the generalized additive model; see also gam

GAMSTEP.steps

maximum number of steps to be considered in the stepwise generalized additive model; see also step.gam

GAMSTEP.scope

range of models examined in the step-wise search n the stepwise generalized additive model; see also step.gam

MGCV.formula

formula for the generalized additive model; see also gam

EARTH.formula

formula for the multivariate adaptive regression spline model; see also earth

EARTH.glm

list of arguments to pass on to glm; see also earth

RPART.formula

formula for the recursive partioning and regression tree model; see also rpart

RPART.xval

number of cross-validations for the recursive partioning and regression tree model; see also rpart.control

NNET.formula

formula for the artificial neural network model; see also nnet

NNET.size

number of units in the hidden layer for the artificial neural network model; see also nnet

NNET.decay

parameter of weight decay for the artificial neural network model; see also nnet

FDA.formula

formula for the flexible discriminant analysis model; see also fda

SVM.formula

formula for the support vector machine model; see also ksvm

digits

number of decimal place after multiplying results by 100 (percentages); see also round

SCRIPT

provide suggested parameters (weights) for the ensemble.raster function (if TRUE)

complexity

vector with values of complexity of individual trees (tree.complexity) for boosted regression trees; see also gbm.step

learning

vector with values of weights applied to individual trees (learning.rate) for boosted regression trees; see also gbm.step

size

vector with values of number of units in the hidden layer for the artificial neural network model; see also nnet

decay

vector with values of weight decay for the artificial neural network model; see also nnet

dummy.vars

vector that indicates which variables are dummy variables

`Value`

Function ensemble.test (potentially) returns a list with results from evaluations (via evaluate) of calibration and test runs of individual suitability models.

Function ensemble.test.splits returns a matrix with, for each individual suitability model, the AUC of each run and the average AUC over the runs. Models are sorted by the average AUC. The average AUC for each model can be used as input weights for the ensemble.raster function (these weights are provided when the option of SCRIPT is selected.

Functions ensemble.test.gbm and ensemble.test.nnet return a matrix with, for each combination of model parameters, the AUC of each run and the average AUC. Models are sorted by the average AUC.

`Details`

The basic function ensemble.test first calibrates individual suitability models based on presence locations p and background locations a, then evaluates these suitability models based on presence locations pt and background locations at. While calibrating and testing individual models, results obtained via the evaluate function are shown in the GUI, and possibly plotted (PLOTS) or saved (evaluations.keep).

Function ensemble.test.splits splits the presence and background locations in a user-defined (k) number of subsets, then sequentially calibrates individual suitability models with (k-1) combined subsets and evaluates with the remaining one subset, whereby each subset is used once for evaluation in the user-defined number (k) of runs. For example, k = 5 results in splitting the locations in 5 subsets, then using one of these subsets in turn for evaluations (see also kfold).

Function ensemble.test.gbm allows to test various combinations of parameters tree.complexity and learning.rate for the gbm.step model.

Function ensemble.test.nnet allows to test various combinations of parameters size and decay for the nnet model.

Function ensemble.formulae provides suggestions for formulae that can be used for ensemble.test and ensemble.raster.

`References`

Buisson L, Thuiller W, Casajus N, Lek S and Grenouillet G. 2010. Uncertainty in ensemble forecasting of species distribution. Global Change Biology 16: 1145-1157

`See Also`

ensemble.raster

`Examples`

Run this code# based on examples in the dismo package
# simplified example of ensemble modelling with 4 modeling algorithms

# get predictor variables
library(dismo)
predictors <- stack(list.files(path=paste(system.file(package="dismo"), '/ex', sep=''), pattern='grd', full.names=TRUE ))
# predictors

# presence points
presence <- paste(system.file(package="dismo"), '/ex/bradypus.csv', sep='')
pres <- read.table(presence, header=TRUE, sep=',')[,-1]

# the kfold function randomly assigns data to groups; subdivide in calibration and training data
groupp <- kfold(pres, 5)
pres_train <- pres[groupp != 1, ]
pres_test <- pres[groupp == 1, ]

# choose background points
ext = extent(-90, -32, -33, 23)
background <- randomPoints(predictors, n=1000, ext=ext, extf = 1.00)
colnames(background) = c('lon', 'lat')
groupa <- kfold(background, 5)
backg_train <- background[groupa != 1, ]
backg_test <- background[groupa == 1, ]

# formulae for random forest and generalized linear model
# compare with: ensemble.formulae(predictors, factors=c("biome"))

rfformula <- as.formula(pb ~ bio1+bio5+bio6+bio7+bio8+bio12+bio16+bio17)

glmformula <- as.formula(pb ~ bio1 + I(bio1^2) + I(bio1^3) +
    bio5 + I(bio5^2) + I(bio5^3) + bio6 + I(bio6^2) + I(bio6^3) + 
    bio7 + I(bio7^2) + I(bio7^3) + bio8 + I(bio8^2) + I(bio8^3) +
    bio12 + I(bio12^2) + I(bio12^3) + bio16 + I(bio16^2) + I(bio16^3) + 
    bio17 + I(bio17^2) + I(bio17^3) )

# fit four ensemble models (RF, GLM, BIOCLIM, DOMAIN)
ensemble.nofactors <- ensemble.test(x=predictors, p=pres_train, a=backg_train, pt=pres_test, at=backg_test,
    MAXENT=0, GBM=0, GBMSTEP=0, RF=1, GLM=1, GLMSTEP=0, GAM=0, GAMSTEP=0, MGCV=0, 
    EARTH=0, RPART=0, NNET=0, FDA=0, SVM=0, BIOCLIM=1, DOMAIN=1, MAHAL=0,
    Yweights="BIOMOD", factors=NULL,
    PLOTS=FALSE, evaluations.keep=TRUE,
    RF.formula=rfformula,
    GLM.formula=glmformula)

# test performance of different suitability models; data are split in 4 subsets, each used once for evaluation
# GAMSTEP option is not available at the moment (problems with assignments)
ensemble.nofactors2 <- ensemble.test.splits(x=predictors, p=pres, a=background, k=4, 
    MAXENT=1, GBM=1, GBMSTEP=1, RF=1, GLM=1, GLMSTEP=1, GAM=1, GAMSTEP=0, MGCV=1, 
    EARTH=1, RPART=1, NNET=1, FDA=1, SVM=1, BIOCLIM=1, DOMAIN=1, MAHAL=1,
    Yweights="BIOMOD", factors=NULL,
    PLOTS=FALSE, formulae.defaults=TRUE,
    GBMSTEP.learning.rate=0.002)
ensemble.nofactors2
Run the code above in your browser using DataLab