cv_similarity: Compute similarity measures to evaluate possible extrapolation in testing folds

Description

This function evaluates environmental similarity between training and testing folds, helping to detect potential extrapolation in the testing data. It supports three similarity measures: Multivariate Environmental Similarity Surface (MESS), Manhattan distance (L1), and Euclidean distance (L2).

Usage

cv_similarity(
  cv,
  x,
  r,
  num_plot = seq_along(cv$folds_list),
  method = "MESS",
  num_sample = 10000L,
  jitter_width = 0.1,
  points_size = 2,
  points_alpha = 0.7,
  points_colors = NULL,
  progress = TRUE
)

Value

a ggplot object

Arguments

cv: a blockCV cv_* object; a cv_spatial, cv_cluster, cv_buffer or cv_nndm
x: a simple features (sf) or SpatialPoints object of the spatial sample data used for creating the cv object.
r: a terra SpatRaster object of environmental predictor that are going to be used for modelling. This is used to calculate similarity between the training and testing points.
num_plot: a vector of indices of folds.
method: the similarity method including: MESS, L1 and L2. Read the details section.
num_sample: number of random samples from raster to calculate similarity distances (only for L1 and L2).
jitter_width: numeric; the width of jitter points.
points_size: numeric; the size of points.
points_alpha: numeric; the opacity of points
points_colors: character; a character vector of colours for points
progress: logical; whether to shows a progress bar for random fold selection.

Details

The MESS is calculated as described in Elith et al. (2010). MESS represents how similar a point in a testing fold is to a training fold (as a reference set of points), with respect to a set of predictor variables in r. The negative values are the sites where at least one variable has a value that is outside the range of environments over the reference set, so these are novel environments.

When using the L1 (Manhattan) or L2 (Euclidean) distance options (experimental), the function performs the following steps for each test sample:

1. Calculates the minimum distance between each test sample and all training samples in the same fold using the selected metric (L1 or L2).
2. Calculates a baseline distance: the average of the minimum distances between a set of random background samples (defined by num_sample) from the raster and all training/test samples combined.
3. Computes a similarity score by subtracting the test sample’s minimum distance from the baseline average. A higher score indicates the test sample is more similar to the training data, while lower or negative scores indicate novelty.

This provides a simple, distance-based novelty metric, useful for assessing extrapolation or dissimilarity in prediction scenarios. Note that this approach is experimental.

References

Elith, J., Kearney, M., & Phillips, S. (2010). The art of modelling range-shifting species: The art of modelling range-shifting species. Methods in Ecology and Evolution, 1(4), 330–342.

Examples

Run this code

# \donttest{
library(blockCV)

# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)

# load raster data
path <- system.file("extdata/au/", package = "blockCV")
files <- list.files(path, full.names = TRUE)
covars <- terra::rast(files)

# hexagonal spatial blocking by specified size and random assignment
sb <- cv_spatial(x = pa_data,
                 column = "occ",
                 size = 450000,
                 k = 5,
                 iteration = 1)

# compute extrapolation
cv_similarity(cv = sb, r = covars, x = pa_data)

# }

Run the code above in your browser using DataLab