This function creates spatially separated folds based on a distance to number of row and/or column.
It assigns blocks to the training and testing folds randomly, systematically or
in a checkerboard pattern. The distance (size
)
should be in metres, regardless of the unit of the reference system of
the input data (for more information see the details section). By default,
the function creates blocks according to the extent and shape of the spatial sample data (x
e.g.
the species occurrence), Alternatively, blocks can be created based on r
assuming that the
user has considered the landscape for the given species and case study.
Blocks can also be offset so the origin is not at the outer corner of the rasters.
Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or
columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012)
and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.
cv_spatial(
x,
column = NULL,
r = NULL,
k = 5L,
hexagon = TRUE,
flat_top = FALSE,
size = NULL,
rows_cols = c(10, 10),
selection = "random",
iteration = 100L,
user_blocks = NULL,
folds_column = NULL,
deg_to_metre = 111325,
biomod2 = TRUE,
offset = c(0, 0),
extend = 0,
seed = NULL,
progress = TRUE,
report = TRUE,
plot = TRUE,
...
)
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomod_table - a matrix with the folds to be used in biomod2 package
k - number of the folds
size - input size, if not null
column - the name of the column if provided
blocks - spatial polygon of the blocks
records - a table with the number of points in each category of training and testing
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification).
character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored to find balanced records in cross-validation folds. If column = NULL
the response variable classes will be treated the same and only training and testing records will be counted.
This is used for binary (e.g. presence-absence/background) or multi-class responses (e.g. land cover classes for
remote sensing image classification), and you can ignore it when the response variable is
continuous or count data.
a terra SpatRaster object (optional). If provided, its extent will be used to specify the blocks. It also supports stars, raster, or path to a raster file on disk.
integer value. The number of desired folds for cross-validation. The default is k = 5
.
logical. Creates hexagonal (default) spatial blocks. If FALSE
, square blocks is created.
logical. Creating hexagonal blocks with topped flat.
numeric value of the specified range by which blocks are created and training/testing data are separated.
This distance should be in metres. The range could be explored by cv_spatial_autocor
and cv_block_size
functions.
integer vector. Two integers to define the blocks based on row and
column e.g. c(10, 10)
or c(5, 1)
. Hexagonal blocks uses only the first one. This
option is ignored when size
is provided.
type of assignment of blocks into folds. Can be random (default), systematic, checkerboard, or predefined.
The checkerboard does not work with hexagonal and user-defined spatial blocks. If the selection = 'predefined'
, user-defined
blocks and folds_column
must be supplied.
integer value. The number of attempts to create folds with balanced records. Only works when selection = "random"
.
an sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all
the species (response) points. If selection = 'predefined'
, this argument and folds_column must be supplied.
character. Indicating the name of the column (in user_blocks
) in which the associated folds are stored.
This argument is necessary if you choose the 'predefined' selection.
integer. The conversion rate of metres to degree. See the details section for more information.
logical. Creates a matrix of folds that can be directly used in the biomod2 package as a CV.user.table for cross-validation.
two number between 0 and 1 to shift blocks by that proportion of block size.
This option only works when size
is provided.
numeric; This parameter specifies the percentage by which the map's extent is expanded to increase the size of the square spatial blocks, ensuring that all points fall within a block. The value should be a numeric between 0 and 5.
integer; a random seed for reproducibility (although an external seed should also work).
logical; whether to shows a progress bar for random fold selection.
logical; whether to print the report of the records per fold.
logical; whether to plot the final blocks with fold numbers in ggplot.
You can re-create this with cv_plot
.
additional option for cv_plot
.
To maintain consistency, all functions in this package use meters as their unit of
measurement. However, when the input map has a geographic coordinate system (in decimal degrees),
the block size is calculated by dividing the size
parameter by deg_to_metre
(which
defaults to 111325 meters, the standard distance of one degree of latitude on the Equator).
In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible
value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325
.
The offset
can be used to change the spatial position of the blocks. It can also be used to
assess the sensitivity of analysis results to shifting in the blocking arrangements.
These options are available when size
is defined. By default the region is
located in the middle of the blocks and by setting the offsets, the blocks will shift.
Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial
autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of
the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called
edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are
not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer
).
Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.
O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.
Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.
cv_buffer
and cv_cluster
; cv_spatial_autocor
and cv_block_size
for selecting block size
For CV.user.table see BIOMOD_Modeling
in biomod2 package
# \donttest{
library(blockCV)
# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)
# hexagonal spatial blocking by specified size and random assignment
sb1 <- cv_spatial(x = pa_data,
column = "occ",
size = 450000,
k = 5,
selection = "random",
iteration = 50)
# spatial blocking by row/column and systematic fold assignment
sb2 <- cv_spatial(x = pa_data,
column = "occ",
rows_cols = c(8, 10),
k = 5,
hexagon = FALSE,
selection = "systematic")
# }
Run the code above in your browser using DataLab