cv_spatial: Use spatial blocks to separate train and test folds

Description

This function creates spatially separated folds based on a distance to number of row and/or column. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern. The distance (size) should be in metres, regardless of the unit of the reference system of the input data (for more information see the details section). By default, the function creates blocks according to the extent and shape of the spatial sample data (x e.g. the species occurrence), Alternatively, blocks can be created based on r assuming that the user has considered the landscape for the given species and case study. Blocks can also be offset so the origin is not at the outer corner of the rasters. Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012) and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.

Usage

cv_spatial(
  x,
  column = NULL,
  r = NULL,
  k = 5L,
  hexagon = TRUE,
  flat_top = FALSE,
  size = NULL,
  rows_cols = c(10, 10),
  selection = "random",
  iteration = 100L,
  user_blocks = NULL,
  folds_column = NULL,
  deg_to_metre = 111325,
  biomod2 = TRUE,
  offset = c(0, 0),
  extend = 0,
  seed = NULL,
  progress = TRUE,
  report = TRUE,
  plot = TRUE,
  ...
)

Value

An object of class S3. A list of objects including:

folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomod_table - a matrix with the folds to be used in biomod2 package
k - number of the folds
size - input size, if not null
column - the name of the column if provided
blocks - spatial polygon of the blocks
records - a table with the number of points in each category of training and testing

Arguments

x: a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification).
column: character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary response i.e. 0s and 1s) is stored to find balanced records in cross-validation folds. If column = NULL the response variable classes will be treated the same and only training and testing records will be counted. This is used for binary (e.g. presence-absence/background) or multi-class responses (e.g. land cover classes for remote sensing image classification), and you can ignore it when the response variable is continuous or count data.
r: a terra SpatRaster object (optional). If provided, its extent will be used to specify the blocks. It also supports stars, raster, or path to a raster file on disk.
k: integer value. The number of desired folds for cross-validation. The default is k = 5.
hexagon: logical. Creates hexagonal (default) spatial blocks. If FALSE, square blocks is created.
flat_top: logical. Creating hexagonal blocks with topped flat.
size: numeric value of the specified range by which blocks are created and training/testing data are separated. This distance should be in metres. The range could be explored by cv_spatial_autocor and cv_block_size functions.
rows_cols: integer vector. Two integers to define the blocks based on row and column e.g. c(10, 10) or c(5, 1). Hexagonal blocks uses only the first one. This option is ignored when size is provided.
selection: type of assignment of blocks into folds. Can be random (default), systematic, checkerboard, or predefined. The checkerboard does not work with hexagonal and user-defined spatial blocks. If the selection = 'predefined', user-defined blocks and folds_column must be supplied.
iteration: integer value. The number of attempts to create folds with balanced records. Only works when selection = "random".
user_blocks: an sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all the species (response) points. If selection = 'predefined', this argument and folds_column must be supplied.
folds_column: character. Indicating the name of the column (in user_blocks) in which the associated folds are stored. This argument is necessary if you choose the 'predefined' selection.
deg_to_metre: integer. The conversion rate of metres to degree. See the details section for more information.
biomod2: logical. Creates a matrix of folds that can be directly used in the biomod2 package as a CV.user.table for cross-validation.
offset: two number between 0 and 1 to shift blocks by that proportion of block size. This option only works when size is provided.
extend: numeric; This parameter specifies the percentage by which the map's extent is expanded to increase the size of the square spatial blocks, ensuring that all points fall within a block. The value should be a numeric between 0 and 5.
seed: integer; a random seed for reproducibility (although an external seed should also work).
progress: logical; whether to shows a progress bar for random fold selection.
report: logical; whether to print the report of the records per fold.
plot: logical; whether to plot the final blocks with fold numbers in ggplot. You can re-create this with cv_plot.
...: additional option for cv_plot.

Details

To maintain consistency, all functions in this package use meters as their unit of measurement. However, when the input map has a geographic coordinate system (in decimal degrees), the block size is calculated by dividing the size parameter by deg_to_metre (which defaults to 111325 meters, the standard distance of one degree of latitude on the Equator). In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325.

The offset can be used to change the spatial position of the blocks. It can also be used to assess the sensitivity of analysis results to shifting in the blocking arrangements. These options are available when size is defined. By default the region is located in the middle of the blocks and by setting the offsets, the blocks will shift.

Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer).

References

Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.

O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.

Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.

Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.

Examples

Run this code

# \donttest{
library(blockCV)

# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)

# hexagonal spatial blocking by specified size and random assignment
sb1 <- cv_spatial(x = pa_data,
                  column = "occ",
                  size = 450000,
                  k = 5,
                  selection = "random",
                  iteration = 50)

# spatial blocking by row/column and systematic fold assignment
sb2 <- cv_spatial(x = pa_data,
                  column = "occ",
                  rows_cols = c(8, 10),
                  k = 5,
                  hexagon = FALSE,
                  selection = "systematic")

# }

Run the code above in your browser using DataLab