optimCLHS: Optimization of sample configurations for spatial trend identification and estimation (IV)

Description

Optimize a sample configuration for spatial trend identification and estimation using the method proposed by Minasny and McBratney (2006), known as the conditioned Latin hypercube sampling. An utility function U is defined so that the sample reproduces the marginal distribution and correlation matrix of the numeric covariates, and the class proportions of the factor covariates (CLHS). The utility function is obtained aggregating three objective functions: O1, O2, and O3.

Usage

optimCLHS(points, candi, covars, use.coords = FALSE,
  clhs.version = c("paper", "fortran", "update"),
  schedule = scheduleSPSANN(), plotit = FALSE, track = FALSE,
  boundary, progress = "txt", verbose = FALSE, weights)
objCLHS(points, candi, covars, use.coords = FALSE,
  clhs.version = c("paper", "fortran", "update"), weights)

Arguments

points

Integer value, integer vector, data frame or matrix, or list.

Integer value. The number of points. These points will be randomly sampled from candi to form the starting sample configuration.
Integer vector. The row indexes of candi that correspond to the points that form the starting sample configuration. The length of the vector defines the number of points.
Data frame or matrix. An object with three columns in the following order: [, "id"], the row indexes of candi that correspond to each point, [, "x"], the projected x-coordinates, and [, "y"], the projected y-coordinates.
List. An object with two named sub-arguments: fixed, a data frame or matrix with the projected x- and y-coordinates of the existing sample configuration -- kept fixed during the optimization --, and free, an integer value defining the number of points that should be added to the existing sample configuration -- free to move during the optimization.

candi

Data frame or matrix with the candidate locations for the jittered points. candi must have two columns in the following order: [, "x"], the projected x-coordinates, and [, "y"], the projected y-coordinates.

covars

Data frame or matrix with the covariates in the columns.

use.coords

(Optional) Logical value. Should the spatial x- and y-coordinates be used as covariates? Defaults to use.coords = FALSE.

clhs.version

(Optional) Character value setting the CLHS version that should be used. Available options are: "paper", for the formulations of O1, O2, and O3 as presented in the original paper by Minasny and McBratney (2006); "fortran", for the formulations of O1 and O3 that include a scaling factor as implemented in the late Fortran code by Budiman Minasny (ca. 2015); and "update", for formulations of O1, O2, and O3 that include the modifications proposed the authors of this package in 2018 (see below). Defaults to clhs.version = "paper".

schedule

List with 11 named sub-arguments defining the control parameters of the cooling schedule. See scheduleSPSANN.

plotit

(Optional) Logical for plotting the optimization results, including a) the progress of the objective function, and b) the starting (gray circles) and current sample configuration (black dots), and the maximum jitter in the x- and y-coordinates. The plots are updated at each 10 jitters. When adding points to an existing sample configuration, fixed points are indicated using black crosses. Defaults to plotit = FALSE.

track

(Optional) Logical value. Should the evolution of the energy state be recorded and returned along with the result? If track = FALSE (the default), only the starting and ending energy states are returned along with the results.

boundary

(Optional) SpatialPolygon defining the boundary of the spatial domain. If missing and plotit = TRUE, boundary is estimated from candi.

progress

(Optional) Type of progress bar that should be used, with options "txt", for a text progress bar in the R console, "tk", to put up a Tk progress bar widget, and NULL to omit the progress bar. A Tk progress bar widget is useful when using parallel processors. Defaults to progress = "txt".

verbose

(Optional) Logical for printing messages about the progress of the optimization. Defaults to verbose = FALSE.

weights

List with named sub-arguments. The weights assigned to each one of the objective functions that form the multi-objective combinatorial optimization problem. They must be named after the respective objective function to which they apply. The weights must be equal to or larger than 0 and sum to 1.

Value

optimCLHS returns an object of class OptimizedSampleConfiguration: the optimized sample configuration with details about the optimization.

objCLHS returns a numeric value: the energy state of the sample configuration -- the objective function value.

Details

Details about the mechanism used to generate a new sample configuration out of the current sample configuration by randomly perturbing the coordinates of a sample point are available in the help page of spJitter.

Marginal sampling strata

Reproducing the marginal distribution of the numeric covariates depends upon the definition of marginal sampling strata. Equal-area marginal sampling strata are defined using the sample quantiles estimated with quantile using a continuous function (type = 7), that is, a function that interpolates between existing covariate values to estimate the sample quantiles. This is the procedure implemented in the original method of Minasny and McBratney (2006), which creates breakpoints that do not occur in the population of existing covariate values. Depending on the level of discretization of the covariate values, that is, how many significant digits they have, this can create repeated breakpoints, resulting in empty marginal sampling strata. The number of empty marginal sampling strata will ultimately depend on the frequency distribution of the covariate and on the number of sampling points. The effect of these features on the spatial modelling outcome still is poorly understood.

Correlation between numeric covariates

The correlation between two numeric covariates is measured using the sample Pearson's r, a descriptive statistic that ranges from -1 to +1. This statistic is also known as the sample linear correlation coefficient. The effect of ignoring the correlation among factor covariates and between factor and numeric covariates on the spatial modelling outcome still is poorly understood.

Multi-objective combinatorial optimization

A method of solving a multi-objective combinatorial optimization problem (MOCOP) is to aggregate the objective functions into a single utility function U. In the spsann package, as in the original implementation of the CLHS by Minasny and McBratney (2006), the aggregation is performed using the weighted sum method, which uses weights to incorporate the a priori preferences of the user about the relative importance of each objective function. When the user has no preference, the objective functions receive equal weights.

The weighted sum method is affected by the relative magnitude of the different objective function values. The objective functions implemented in optimCLHS have different units and orders of magnitude. The consequence is that the objective function with the largest values, generally O1, may have a numerical dominance during the optimization. In other words, the weights may not express the true preferences of the user, resulting that the meaning of the utility function becomes unclear because the optimization will likely favour the objective function which is numerically dominant.

An efficient solution to avoid numerical dominance is to scale the objective functions so that they are constrained to the same approximate range of values, at least in the end of the optimization. In the original implementation of the CLHS by Minasny and McBratney (2006), clhs.version = "paper", optimCLHS uses the naive aggregation method, which ignores that the three objective functions have different units and orders of magnitude. In a 2015 Fortran implementation of the CLHS, clhs.version = "fortran", scaling factors were included to make the values of the three objective function more comparable. The effect of ignoring the need to scale the objective functions, or using arbitrary scaling factors, on the spatial modelling outcome still is poorly understood. Thus, an updated version of O1, O2, and O3 has been implemented in the spsann package. The need formulation aim at making the values returned by the objective functions more comparable among themselves without having to resort to arbitrary scaling factors. The effect of using these new formulations have not been tested yet.

References

Minasny, B.; McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.

Minasny, B.; McBratney, A. B. Conditioned Latin Hypercube Sampling for calibrating soil sensor data to soil properties. Chapter 9. Viscarra Rossel, R. A.; McBratney, A. B.; Minasny, B. (Eds.) Proximal Soil Sensing. Amsterdam: Springer, p. 111-119, 2010.

Roudier, P.; Beaudette, D.; Hewitt, A. A conditioned Latin hypercube sampling algorithm incorporating operational constraints. 5th Global Workshop on Digital Soil Mapping. Sydney, p. 227-231, 2012.

Examples

Run this code

# NOT RUN {
data(meuse.grid, package = "sp")
candi <- meuse.grid[1:1000, 1:2]
covars <- meuse.grid[1:1000, 5]
schedule <- scheduleSPSANN(
  chains = 1, initial.temperature = 20, x.max = 1540, y.max = 2060, 
  x.min = 0, y.min = 0, cellsize = 40)
set.seed(2001)
res <- optimCLHS(
  points = 10, candi = candi, covars = covars, use.coords = TRUE,
  clhs.version = "fortran", weights = list(O1 = 0.5, O3 = 0.5), schedule = schedule)
objSPSANN(res) - objCLHS(
  points = res, candi = candi, covars = covars, use.coords = TRUE, 
  clhs.version = "fortran", weights = list(O1 = 0.5, O3 = 0.5))
# }

Run the code above in your browser using DataLab