BIOMOD_CrossValidation: Custom models cross-validation procedure

Description

This function creates a DataSplitTable that can be given as parameter to the BIOMOD_Modeling function to evaluate models with repeated k-fold or stratified cross-validation (CV) instead of repeated split samples.

Usage

BIOMOD_CrossValidation(
  bm.format,
  k = 5,
  nb.rep = 5,
  do.stratification = FALSE,
  method = "both",
  balance = "presences",
  do.full.models = TRUE
)

Value

A DataSplitTable

matrix with k * nb.rep (+ 1 if do.full.models = TRUE) columns that can be given as parameter to the BIOMOD_Modeling function.

Arguments

bm.format: a BIOMOD.formated.data-class or BIOMOD.formated.data.PA-class object returned by the BIOMOD_FormatingData function
k: an integer corresponding to the number of bins/partitions for k-fold CV
nb.rep: an integer corresponding to the number of repetitions of k-fold CV (set to 1 if do.stratification = TRUE)
do.stratification: a logical defining whether stratified CV should be run
method: a character corresponding to the CV stratification method (if do.stratification = TRUE), must be x, y, both, block or the name of a predictor for environmental stratified CV
balance: a character defining whether partitions should be balanced for presences or absences (resp. pseudo-absences or background)
do.full.models: (optional, default TRUE)
A logical value defining whether models should be also calibrated and validated over the whole dataset or not

Author

Frank Breiner

Details

Stratified cross-validation may be used to test for model overfitting and to assess transferability in geographic and environmental space :

x and y stratification was described in Wenger and Olden 2012 (see References). While y stratification uses k partitions along the y-gradient, x stratification does the same for the x-gradient, and both combines them.
block stratification was described in Muscarella et al. 2014 (see References). Four bins of equal size are partitioned (bottom-left, bottom-right, top-left and top-right).

If balance = 'presences', presences are divided (balanced) equally over the partitions (e.g. Fig. 1b in Muscarelly et al. 2014). Pseudo-absences will however be unbalanced over the partitions especially if the presences are clumped on an edge of the study area.

If balance = 'absences', absences (resp. pseudo-absences or background) are divided (balanced) as equally as possible between the partitions (geographical balanced bins given that absences are spread over the study area equally, approach similar to Fig. 1 in Wenger et Olden 2012). Presences will however be unbalanced over the partitions especially if the presences are clumped on an edge of the study area.

References

Muscarella, R., Galante, P.J., Soley-Guardia, M., Boria, R.A., Kass, J.M., Uriarte, M. & Anderson, R.P. (2014). ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in Ecology and Evolution, 5, 1198-1205.
Wenger, S.J. & Olden, J.D. (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods in Ecology and Evolution, 3, 260-267.

Examples

Run this code


# Load species occurrences (6 species available)
myFile <- system.file('external/species/mammals_table.csv', package = 'biomod2')
DataSpecies <- read.csv(myFile, row.names = 1)
head(DataSpecies)

# Select the name of the studied species
myRespName <- 'GuloGulo'

# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
myFiles <- paste0('external/bioclim/current/bio', c(3, 4, 7, 11, 12), '.grd')
myExpl <- raster::stack(system.file(myFiles, package = 'biomod2'))

# \dontshow{
myExtent <- raster::extent(0,30,45,70)
myExpl <- raster::stack(raster::crop(myExpl, myExtent))
# }

# ---------------------------------------------------------------
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
                                     expl.var = myExpl,
                                     resp.xy = myRespXY,
                                     resp.name = myRespName)

# Create default modeling options
myBiomodOptions <- BIOMOD_ModelingOptions()

 
# ---------------------------------------------------------------
# Create the different validation datasets
myBiomodCV <- BIOMOD_CrossValidation(bm.format = myBiomodData)
head(myBiomodCV)

# Several validation strategies can be combined
DataSplitTable.b <- BIOMOD_CrossValidation(bm.format = myBiomodData,
                                           k = 5,
                                           nb.rep = 2,
                                           do.full.models = FALSE)
DataSplitTable.y <- BIOMOD_CrossValidation(bm.format = myBiomodData,
                                           k = 2,
                                           do.stratification = TRUE,
                                           method = "y")
colnames(DataSplitTable.y)[1:2] <- c("RUN11", "RUN12")
myBiomodCV <- cbind(DataSplitTable.b, DataSplitTable.y)
head(myBiomodCV)

# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                    modeling.id = 'mod.CV',
                                    models = c('RF', 'GLM'),
                                    bm.options = myBiomodOptions,
                                    nb.rep = 2,
                                    data.split.table = myBiomodCV,
                                    metric.eval = c('TSS','ROC'),
                                    var.import = 3,
                                    do.full.models = FALSE,
                                    seed.val = 42)

# Get evaluation scores & variables importance
myEval <- get_evaluations(myBiomodModelOut, as.data.frame = TRUE)
myEval$CV.strategy <- "Random"
myEval$CV.strategy[grepl("13", myEval$Model.name)] <- "Full"
myEval$CV.strategy[grepl("11|12", myEval$Model.name)] <- "Stratified"
head(myEval)

boxplot(myEval$Testing.data ~ interaction(myEval$Algo, myEval$CV.strategy),
        xlab = "", ylab = "ROC AUC", col = rep(c("brown", "cadetblue"), 3))

Run the code above in your browser using DataLab