Learn R Programming

BiodiversityR (version 2.7-2)

ensemble.batch: Suitability mapping based on ensembles of modelling algorithms: batch processing

Description

The main function allows for batch processing of different species and different environmental RasterStacks. The function makes internal calls to ensemble.test.splits, ensemble.test and ensemble.raster.

Usage

ensemble.batch(x = NULL, xn = c(x), ext = NULL, species.presence = NULL, species.absence = NULL, presence.min = 20, an = 1000, excludep = FALSE, CIRCLES.at = FALSE, CIRCLES.d = 100000, k.splits = 4, k.test = 0, n.ensembles = 1, SINK = FALSE, RASTER.format = "raster", RASTER.datatype = "INT2S", RASTER.NAflag = -32767, KML.out = FALSE, KML.maxpixels = 100000, KML.blur = 10, models.save = FALSE, threshold.method = "spec_sens", threshold.sensitivity = 0.9, threshold.PresenceAbsence = FALSE, ENSEMBLE.best = 0, ENSEMBLE.min = 0.7, ENSEMBLE.exponent = 1, ENSEMBLE.weight.min = 0.05, input.weights = NULL, MAXENT = 1, GBM = 1, GBMSTEP = 1, RF = 1, GLM = 1, GLMSTEP = 1, GAM = 1, GAMSTEP = 1, MGCV = 1, MGCVFIX = 0, EARTH = 1, RPART = 1, NNET = 1, FDA = 1, SVM = 1, SVME = 1, BIOCLIM = 1, DOMAIN = 1, MAHAL = 1, PROBIT = FALSE, AUC.weights = TRUE, Yweights = "BIOMOD", layer.drops = NULL, factors = NULL, dummy.vars = NULL, formulae.defaults = TRUE, maxit = 100, MAXENT.a = NULL, MAXENT.an = 10000, MAXENT.BackData = NULL, MAXENT.path = paste(getwd(), "/models/maxent", sep=""), GBM.formula = NULL, GBM.n.trees = 2001, GBMSTEP.gbm.x = 2:(1 + raster::nlayers(x)), GBMSTEP.tree.complexity = 5, GBMSTEP.learning.rate = 0.005, GBMSTEP.bag.fraction = 0.5, GBMSTEP.step.size = 100, RF.formula = NULL, RF.ntree = 751, RF.mtry = floor(sqrt(raster::nlayers(x))), GLM.formula = NULL, GLM.family = binomial(link = "logit"), GLMSTEP.steps = 1000, STEP.formula = NULL, GLMSTEP.scope = NULL, GLMSTEP.k = 2, GAM.formula = NULL, GAM.family = binomial(link = "logit"), GAMSTEP.steps = 1000, GAMSTEP.scope = NULL, GAMSTEP.pos = 1, MGCV.formula = NULL, MGCV.select = FALSE, MGCVFIX.formula = NULL, EARTH.formula = NULL, EARTH.glm = list(family = binomial(link = "logit"), maxit = maxit), RPART.formula = NULL, RPART.xval = 50, NNET.formula = NULL, NNET.size = 8, NNET.decay = 0.01, FDA.formula = NULL, SVM.formula = NULL, SVME.formula = NULL, MAHAL.shape = 1)
ensemble.mean(RASTER.species.name = "Species001", RASTER.stack.name = "base", positive.filters = c("grd", "_ENSEMBLE_"), negative.filters = c("xml"), RASTER.format = "raster", RASTER.datatype = "INT2S", RASTER.NAflag = -32767, KML.out = FALSE, KML.maxpixels = 100000, KML.blur = 10, p = NULL, a = NULL, pt = NULL, at = NULL, threshold = -1, threshold.method = "spec_sens", threshold.sensitivity = 0.9, threshold.PresenceAbsence = FALSE)
ensemble.plot(RASTER.species.name = "Species001", RASTER.stack.name = "base", plot.method = "suitability", dev.new.width = 7, dev.new.height = 7, main = paste(RASTER.species.name, " ", plot.method, " for ", RASTER.stack.name, sep=""), positive.filters = c("grd","_MEAN_"), negative.filters = c("xml"), p=NULL, a=NULL, threshold = -1, threshold.method = "spec_sens", threshold.sensitivity = 0.9, threshold.PresenceAbsence = FALSE, abs.breaks = 6, pres.breaks = 6, maptools.boundaries = TRUE, maptools.col = "dimgrey", ...)

Arguments

x
RasterStack object (stack) containing all layers to calibrate an ensemble.
xn
RasterStack object (stack) containing all layers that correspond to explanatory variables of an ensemble calibrated earlier with x. Several RasterStack objects can be provided in a format as c(stack1, stack2, stack3); these will be used sequentially. See also predict.
ext
an Extent object to limit the prediction to a sub-region of xn and the selection of background points to a sub-region of x, typically provided as c(lonmin, lonmax, latmin, latmax); see also predict, randomPoints and extent
species.presence
presence points used for calibrating the suitability models, available in 3-column (species, x, y) or (species, lon, lat) dataframe
species.absence
background points used for calibrating the suitability models, either available in a 3-column (species, x, y) or (species, lon, lat), or available in a 2-column (x, y) or (lon, lat) dataframe. In case of a 2-column dataframe, the same background locations will be used for all species.
presence.min
minimum number of presence locations for the organism (if smaller, no models are fitted).
an
number of background points for calibration to be selected with randomPoints in case argument a or species.absence is missing
excludep
parameter that indicates (if TRUE) that presence points will be excluded from the background points; see also randomPoints
CIRCLES.at
If TRUE, then new background points that will be used for evaluationg the suitability models will be selected (randomPoints) in circular neighbourhoods (created with circles) around presence locations (p and pt).
CIRCLES.d
Radius in m of circular neighbourhoods (created with circles) around presence locations (p and pt).
k
If larger than 1, the mumber of groups to split between calibration (k-1) and evaluation (1) data sets (for example, k=5 results in 4/5 of presence and background points to be used for calibrating the models, and 1/5 of presence and background points to be used for evaluating the models). See also kfold.
k.splits
If larger than 1, the number of splits for the ensemble.test.splits step in batch processing. See also kfold.
k.test
If larger than 1, the mumber of groups to split between calibration (k-1) and evaluation (1) data sets when calibrating the final models (for example, k=5 results in 4/5 of presence and background points to be used for calibrating the models, and 1/5 of presence and background points to be used for evaluating the models). See also kfold.
n.ensembles
If larger than 1, the number of different ensembles generated per species in batch processing.
SINK
Append the results to a text file in subfolder 'outputs' (if TRUE). The name of file is based on species names. In case a file already exists, then results are appended. See also sink.
RASTER.format
Format of the raster files that will be generated. See writeFormats and writeRaster.
RASTER.datatype
Format of the raster files that will be generated. See dataType and writeRaster.
RASTER.NAflag
Value that is used to store missing data. See writeRaster.
KML.out
if FALSE, then no kml layers (layers that can be shown in Google Earth) are produced. If TRUE, then kml files will be saved in a subfolder 'kml'.
KML.maxpixels
Maximum number of pixels for the PNG image that will be displayed in Google Earth. See also KML.
KML.blur
Integer that results in increasing the size of the PNG image by KML.blur^2, which may help avoid blurring of isolated pixels. See also KML.
models.save
Save the list with model details to a file (if TRUE). The filename will be species.name with extension .models; this file will be saved in subfolder of models. When loading this file, model results will be available as ensemble.models.
threshold.method
Method to calculate the threshold between predicted absence and presence; possibilities include spec_sens (highest sum of the true positive rate and the true negative rate), kappa (highest kappa value), no_omission (highest threshold that corresponds to no omission), prevalence (modeled prevalence is closest to observed prevalence) and equal_sens_spec (equal true positive rate and true negative rate). See threshold. Options specific to the BiodiversityR implementation are: threshold.mean (resulting in calculating the mean value of spec_sens, equal_sens_spec and prevalence) and threshold.min (resulting in calculating the minimum value of spec_sens, equal_sens_spec and prevalence).
threshold.sensitivity
Sensitivity value for threshold.method = 'sensitivity'. See threshold.
threshold.PresenceAbsence
If TRUE calculate thresholds with the PresenceAbsence package. See optimal.thresholds.
ENSEMBLE.best
The number of individual suitability models to be used in the consensus suitability map (based on a weighted average). In case this parameter is smaller than 1 or larger than the number of positive input weights of individual models, then all individual suitability models with positive input weights are included in the consensus suitability map. In case a vector is provided, ensemble.strategy is called internally to determine weights for the ensemble model.
ENSEMBLE.min
The minimum input weight (typically corresponding to AUC values) for a model to be included in the ensemble. In case a vector is provided, function ensemble.strategy is called internally to determine weights for the ensemble model.
ENSEMBLE.exponent
Exponent applied to AUC values to convert AUC values into weights (for example, an exponent of 2 converts input weights of 0.7, 0.8 and 0.9 into 0.7^2=0.49, 0.8^2=0.64 and 0.9^2=0.81). See details.
ENSEMBLE.weight.min
The minimum output weight for models included in the ensemble, applying to weights that sum to one. Note that ENSEMBLE.min typically refers to input AUC values.
input.weights
array with numeric values for the different modelling algorithms; if NULL then values provided by parameters such as MAXENT and GBM will be used. As an alternative, the output from ensemble.test.splits can be used.
MAXENT
Input weight for a maximum entropy model (maxent). (Only weights > 0 will be used.)
GBM
Input weight for a boosted regression trees model (gbm). (Only weights > 0 will be used.)
GBMSTEP
Input weight for a stepwise boosted regression trees model (gbm.step). (Only weights > 0 will be used.)
RF
Input weight for a random forest model (randomForest). (Only weights > 0 will be used.)
GLM
Input weight for a generalized linear model (glm). (Only weights > 0 will be used.)
GLMSTEP
Input weight for a stepwise generalized linear model (stepAIC). (Only weights > 0 will be used.)
GAM
Input weight for a generalized additive model (gam). (Only weights > 0 will be used.)
GAMSTEP
Input weight for a stepwise generalized additive model (step.gam). (Only weights > 0 will be used.)
MGCV
Input weight for a generalized additive model (gam). (Only weights > 0 will be used.)
MGCVFIX
number: if larger than 0, then a generalized additive model with fixed d.f. regression splines (gam) will be fitted among ensemble
EARTH
Input weight for a multivariate adaptive regression spline model (earth). (Only weights > 0 will be used.)
RPART
Input weight for a recursive partioning and regression tree model (rpart). (Only weights > 0 will be used.)
NNET
Input weight for an artificial neural network model (nnet). (Only weights > 0 will be used.)
FDA
Input weight for a flexible discriminant analysis model (fda). (Only weights > 0 will be used.)
SVM
Input weight for a support vector machine model (ksvm). (Only weights > 0 will be used.)
SVME
Input weight for a support vector machine model (svm). (Only weights > 0 will be used.)
BIOCLIM
Input weight for the BIOCLIM algorithm (bioclim). (Only weights > 0 will be used.)
DOMAIN
Input weight for the DOMAIN algorithm (domain). (Only weights > 0 will be used.)
MAHAL
Input weight for the Mahalonobis algorithm (mahal). (Only weights > 0 will be used.)
PROBIT
If TRUE, then subsequently to the fitting of the individual algorithm (e.g. maximum entropy or GAM) a generalized linear model (glm) with probit link family=binomial(link="probit") will be fitted to transform the predictions, using the previous predictions as explanatory variable. This transformation results in all model predictions to be probability estimates.
AUC.weights
If TRUE, then use the average of the AUC for the different submodels in the different crossvalidation runs as weights for the 'full' ensemble model. See ensemble.test.splits for details.
Yweights
chooses how cases of presence and background (absence) are weighted; "BIOMOD" results in equal weighting of all presence and all background cases, "equal" results in equal weighting of all cases. The user can supply a vector of weights similar to the number of cases in the calibration data set.
layer.drops
vector that indicates which layers should be removed from RasterStack x. See also addLayer.
factors
vector that indicates which variables are factors; see also prepareData
dummy.vars
vector that indicates which variables are dummy variables (influences formulae suggestions)
formulae.defaults
Suggest formulae for most of the models (if TRUE). See also ensemble.formulae.
maxit
Maximum number of iterations for some of the models. See also glm.control, gam.control, gam.control and nnet.
MAXENT.a
background points used for calibrating the maximum entropy model (maxent), typically available in 2-column (lon, lat) dataframe; see also prepareData and extract. Ignored if MAXENT.BackData is provided.
MAXENT.an
number of background points for calibration to be selected with randomPoints in case argument MAXENT.a is missing. When used with the ensemble.batch function, the same background locations will be used for each of the species runs; this implies that for each species, presence locations are not excluded from the background data for this function.
MAXENT.BackData
dataframe containing explanatory variables for the background locations. This information will be used for calibrating the maximum entropy model (maxent). When used with the ensemble.batch function, the same background locations will be used for each of the cross-validation runs; this is based on the assumption that a large number (~10000) of background locations are used.
MAXENT.path
path to the directory where output files of the maximum entropy model are stored; see also maxent
GBM.formula
formula for the boosted regression trees algorithm; see also gbm
GBM.n.trees
total number of trees to fit for the boosted regression trees model; see also gbm
GBMSTEP.gbm.x
indices of column numbers with explanatory variables for stepwise boosted regression trees; see also gbm.step
GBMSTEP.tree.complexity
complexity of individual trees for stepwise boosted regression trees; see also gbm.step
GBMSTEP.learning.rate
weight applied to individual trees for stepwise boosted regression trees; see also gbm.step
GBMSTEP.bag.fraction
proportion of observations used in selecting variables for stepwise boosted regression trees; see also gbm.step
GBMSTEP.step.size
number of trees to add at each cycle for stepwise boosted regression trees (should be small enough to result in a smaller holdout deviance than the initial number of trees [50]); see also gbm.step
RF.formula
formula for the random forest algorithm; see also randomForest
RF.ntree
number of trees to grow for random forest algorithm; see also randomForest
RF.mtry
number of variables randomly sampled as candidates at each split for random forest algorithm; see also randomForest
GLM.formula
formula for the generalized linear model; see also glm
GLM.family
description of the error distribution and link function for the generalized linear model; see also glm
GLMSTEP.steps
maximum number of steps to be considered for stepwise generalized linear model; see also stepAIC
STEP.formula
formula for the "starting model" to be considered for stepwise generalized linear model; see also stepAIC
GLMSTEP.scope
range of models examined in the stepwise search; see also stepAIC
GLMSTEP.k
multiple of the number of degrees of freedom used for the penalty (only k = 2 gives the genuine AIC); see also stepAIC
GAM.formula
formula for the generalized additive model; see also gam
GAM.family
description of the error distribution and link function for the generalized additive model; see also gam
GAMSTEP.steps
maximum number of steps to be considered in the stepwise generalized additive model; see also step.gam
GAMSTEP.scope
range of models examined in the step-wise search n the stepwise generalized additive model; see also step.gam
GAMSTEP.pos
parameter expected to be set to 1 to allow for fitting of the stepwise generalized additive model
MGCV.formula
formula for the generalized additive model; see also gam
MGCV.select
if TRUE, then the smoothing parameter estimation that is part of fitting can completely remove terms from the model; see also gam
MGCVFIX.formula
formula for the generalized additive model with fixed d.f. regression splines; see also gam (the default formulae sets "s(..., fx=TRUE, ...)"; see also s)
EARTH.formula
formula for the multivariate adaptive regression spline model; see also earth
EARTH.glm
list of arguments to pass on to glm; see also earth
RPART.formula
formula for the recursive partioning and regression tree model; see also rpart
RPART.xval
number of cross-validations for the recursive partioning and regression tree model; see also rpart.control
NNET.formula
formula for the artificial neural network model; see also nnet
NNET.size
number of units in the hidden layer for the artificial neural network model; see also nnet
NNET.decay
parameter of weight decay for the artificial neural network model; see also nnet
FDA.formula
formula for the flexible discriminant analysis model; see also fda
SVM.formula
formula for the support vector machine model; see also ksvm
SVME.formula
formula for the support vector machine model; see also svm
MAHAL.shape
parameter that influences the transformation of output values of mahal. See details section.
RASTER.species.name
First part of the names of the raster files, expected to identify the modelled species (or organism).
RASTER.stack.name
Last part of the names of the raster files, expected to identify the predictor stack used.
positive.filters
vector that indicates parts of filenames for files that will be included in the calculation of the mean probability values
negative.filters
vector that indicates parts of filenames for files that will not be included in the calculation of the mean probability values
p
presence points used for calibrating the suitability models, typically available in 2-column (x, y) or (lon, lat) dataframe; see also prepareData and extract
a
background points used for calibrating the suitability models, typically available in 2-column (x, y) or (lon, lat) dataframe; see also prepareData and extract
pt
presence points used for evaluating the suitability models, typically available in 2-column (lon, lat) dataframe; see also prepareData
at
background points used for calibrating the suitability models, typicall available in 2-column (lon, lat) dataframe; see also prepareData and extract
threshold
Threshold value that will be used to distinguish between presence and absence. If < 0, then a threshold value will be calculated from the provided presence p and absence a locations.
plot.method
Choice of maps to be plotted: suitability plots suitability maps, presence plots presence-absence maps and count plots count maps (count of number of algorithms or number of ensembles predicting presence).
dev.new.width
Width for new graphics device (dev.new). If < 0, then no new graphics device is opened.
dev.new.height
Heigth for new graphics device (dev.new). If < 0, then no new graphics device is opened.
main
main title for the plots.
abs.breaks
Number of breaks in the colouring scheme for absence (only applies to suitability mapping).
pres.breaks
Number of breaks in the colouring scheme for presence (only applies to suitability mapping).
maptools.boundaries
If TRUE, then plot approximate country boundaries wrld_simpl
maptools.col
Colour for approximate country boundaries plotted via wrld_simpl
...
Other items passed to function plot.

Value

The function finally results in ensemble raster layers for each species, including the fitted values for the ensemble model, the estimated presence-absence and the count of the number of submodels prediction presence and absence.

Details

This function allows for batch processing of different species and different environmental RasterStacks. The function makes internal calls to ensemble.test.splits, ensemble.test and ensemble.raster.

ensemble.test.splits results in a cross-validation procedure whereby the data set is split in calibration and testing subsets and the best weights for the ensemble model are determined (including the possibility for weights = 0).

ensemble.test is the step whereby models are calibrated using all the available presence data.

ensemble.raster is the final step whereby raster layers are produced for the ensemble model.

Function ensemble.mean results in raster layers that are based on the summary of several ensemble layers: the new ensemble has probability values that are the mean of the probabilities of the different raster layers, the presence-absence threshold is derived for this new ensemble layer, whereas the count reflects the number of ensemble layers where presence was predicted. Note the assumption that input probabilities are scaled between 0 and 1000 (as the output from ensemble.raster), whereas thresholds are based on actual probabilities (scaled between 0 and 1).

Function ensemble.plot plots suitability, presence-absence or count maps. In the case of suitability maps, the presence-absence threshold needs to be provide as suitabilities smaller than the threshold will be coloured red to orange, whereas suitabilities larger than the threshold will be coloured light blue to dark blue.

References

Buisson L, Thuiller W, Casajus N, Lek S and Grenouillet G. 2010. Uncertainty in ensemble forecasting of species distribution. Global Change Biology 16: 1145-1157

See Also

ensemble.test.splits, ensemble.test, ensemble.raster

Examples

Run this code
## Not run: 
# # based on examples in the dismo package
# 
# # get predictor variables
# library(dismo)
# predictor.files <- list.files(path=paste(system.file(package="dismo"), '/ex', sep=''),
#     pattern='grd', full.names=TRUE)
# predictors <- stack(predictor.files)
# # subset based on Variance Inflation Factors
# predictors <- subset(predictors, subset=c("bio5", "bio6", 
#     "bio16", "bio17", "biome"))
# predictors
# predictors@title <- "base"
# 
# # presence points
# presence_file <- paste(system.file(package="dismo"), '/ex/bradypus.csv', sep='')
# pres <- read.table(presence_file, header=TRUE, sep=',')
# pres[,1] <- rep("Bradypus", nrow(pres))
# 
# # choose background points
# ext <- extent(-90, -32, -33, 23)
# background <- randomPoints(predictors, n=1000, ext=ext, extf = 1.00)
# 
# # north and south for new predictions (as if new climates)
# ext2 <- extent(-90, -32, 0, 23)
# predictors2 <- crop(predictors, y=ext2)
# predictors2@title <- "north"
# 
# ext3 <- extent(-90, -32, -33, 0)
# predictors3 <- crop(predictors, y=ext3)
# predictors3@title <- "south"
# 
# # fit 3 ensembles with batch processing, choosing the best ensemble model based on the 
# # average AUC of 4-fold split of calibration and testing data
# # final models use all available presence data and average weights determined by the 
# # ensemble.test.splits function (called internally)
# # batch processing can handle several species by using 3-column species.presence and 
# # species.absence data sets
# # note that these calculations can take a while
# 
# ensemble.nofactors <- ensemble.batch(x=predictors, ext=ext,
#     xn=c(predictors2, predictors3),
#     species.presence=pres, 
#     species.absence=background, 
#     k.splits=4, k.test=0, 
#     n.ensembles=3, 
#     SINK=TRUE,
#     layer.drops=c("biome"),
#     ENSEMBLE.best=0, ENSEMBLE.exponent=c(1, 2, 4, 6, 8), 
#     ENSEMBLE.min=0.7,
#     MAXENT=1, GBM=1, GBMSTEP=0, RF=1, GLM=1, GLMSTEP=1, GAM=1, GAMSTEP=0, MGCV=1, 
#     EARTH=1, RPART=1, NNET=1, FDA=1, SVM=1, SVME=1, BIOCLIM=1, DOMAIN=1, MAHAL=0,
#     Yweights="BIOMOD",
#     formulae.defaults=TRUE)
# 
# # summaries for the 3 ensembles for the species
# # summaries are based on files in folders ensemble, ensemble/presence and 
# # ensemble/count
# 
# ensemble.mean(RASTER.species.name="Bradypus", RASTER.stack.name="base",
#     p=pres, a=background, 
#     KML.out=T)
# 
# # plot mean suitability 
# plot1 <- ensemble.plot(RASTER.species.name="Bradypus", RASTER.stack.name="base",
#     plot.method="suitability",
#     p=pres, a=background, abs.breaks=4, pres.breaks=9)
# plot1
# 
# ## End(Not run)

Run the code above in your browser using DataLab