optimCLHS: Optimization of sample configurations for spatial trend identification and estimation (IV)

Description

Optimize a sample configuration for spatial trend identification and estimation using the method proposed by Minasny and McBratney (2006), known as the conditioned Latin hypercube sampling. An utility function U is defined so that the sample reproduces the marginal distribution and correlation matrix of the numeric covariates, and the class proportions of the factor covariates (CLHS). The utility function is obtained aggregating three objective functions: O1, O2, and O3.

Usage

optimCLHS(points, candi, iterations = 50000, covars, use.coords = FALSE,
  x.max, x.min, y.max, y.min, acceptance = list(initial = 0.8, cooling =
  iterations/10), stopping = list(max.count = iterations/10),
  plotit = FALSE, track = FALSE, boundary, progress = TRUE,
  verbose = FALSE, greedy = FALSE, weights = list(O1 = 1/3, O2 = 1/3, O3 =
  1/3))
objCLHS(points, candi, covars, use.coords = FALSE, weights = list(O1 = 1/3,
  O2 = 1/3, O3 = 1/3))

Arguments

points

Integer value, integer vector, data frame or matrix. If points is an integer value, it defines the number of points that should be randomly sampled from candi to form the starting system configuration. If points is a

candi

Data frame or matrix with the candidate locations for the perturbed points. candi must have two columns in the following order: [, "x"] the projected x-coordinates, and [, "y"] the projected y-coordinates.

iterations

Integer. The maximum number of iterations that should be used for the optimization. Defaults to iterations = 100.

covars

Data frame or matrix with the covariates in the columns.

use.coords

Logical value. Should the geographic coordinates be used as covariates? Defaults to use.coords = FALSE.

x.max,x.min,y.max,y.min

Numeric value. The minimum and maximum quantity of random noise to be added to the projected x- and y-coordinates. The minimum quantity should be equal to, at least, the minimum distance between two neighbouring candidate locations. The units are the same

acceptance

List with two named sub-arguments: initial -- numeric value between 0 and 1 defining the initial acceptance probability, and cooling -- a numeric value defining the exponential factor by which the acceptance probability decreases

stopping

List with one named sub-argument: max.count -- integer value defining the maximum allowable number of iterations without improvement of the objective function value. Defaults to stopping = list(max.count = iterations / 10).

plotit

Logical for plotting the optimization results. This includes a) the progress of the objective function values and acceptance probabilities, and b) the original points, the perturbed points and the progress of the maximum perturbation in the x- and y-coord

track

Logical value. Should the evolution of the energy state and acceptance probability be recorded and returned with the result? If track = FALSE (the default), only the starting and ending energy state values are returned with the result.

boundary

SpatialPolygon. The boundary of the spatial domain. If missing, it is estimated from candi.

progress

Logical for printing a progress bar. Defaults to progress = TRUE.

verbose

Logical for printing messages about the progress of the optimization. Defaults to verbose = FALSE.

greedy

Logical value. Should the optimization be done using a greedy algorithm, that is, accepting only better system configurations? Defaults to greedy = FALSE. (experimental)

weights

List with named sub-arguments. The weights assigned to each one of the objective functions that form the multi-objective optimization problem (MOOP). They must be named after the respective objective function to which they apply. The weights must be equal

Value

optimCLHS returns a matrix: the optimized sample configuration.
objCLHS returns a numeric value: the energy state of the sample configuration - the objective function value.

Jittering methods

There are two ways of jittering the coordinates. They differ on how the set of candidate locations is defined. The first method uses an infinite set of candidate locations, that is, any point in the spatial domain can be selected as the new location of a perturbed point. All that this method needs is a polygon indicating the boundary of the spatial domain. This method is not implemented in the spsann package (yet) because it is computationally demanding: every time a point is jittered, it is necessary to check if it is inside the spatial domain.

The second method consists of using a finite set of candidate locations for the perturbed points. A finite set of candidate locations is created by discretizing the spatial domain, that is, creating a fine grid of points that serve as candidate locations for the perturbed points. This is the only method currently implemented in the spsann package because it is one of the least computationally demanding.

Using a finite set of candidate locations has one important inconvenience. When a point is selected to be jittered, it may be that the new location already is occupied by another point. If this happens, another location is iteratively sought for as many times as there are points in points. Because the more points there are in points, the more likely it is that the new location already is occupied by another point. If a solution is not found, the point selected to be jittered point is kept in its original location.

A more elegant method can be defined using a finite set of candidate locations coupled with a form of two-stage random sampling as implemented in spsample. Because the candidate locations are placed on a finite regular grid, they can be seen as being the centre nodes of a finite set of grid cells (or pixels of a raster image). In the first stage, one of the grid cells is selected with replacement, i.e. independently of already being occupied by another sample point. The new location for the point chosen to be jittered is selected within that grid cell by simple random sampling. This method guarantees that any location in the spatial domain can be a candidate location. It also discards the need to check if the new location already is occupied by another point. This method is not implemented (yet) in the spsann package.

Distance between two points

The distance between two points is computed as the Euclidean distance between them. This computation assumes that the optimization is operating in the two-dimensional Euclidean space, i.e. the coordinates of the sample points and candidate locations should not be provided as latitude/longitude. Package spsann has no mechanism to check if the coordinates are projected, and the user is responsible for making sure that this requirement is attained.

Marginal sampling strata

Reproducing the marginal distribution of the numeric covariates depends upon the definition of marginal sampling strata. Equal-area marginal sampling strata are defined using the sample quantiles estimated with quantile using a continuous function (type = 7), that is, a function that interpolates between existing covariate values to estimate the sample quantiles -- this is the procedure implemented in the method of Minasny and McBratney (2006), which creates breakpoints that do not occur in the population of existing covariate values. Depending on the level of discretization of the covariate values, that is, how many significant digits they have, this can create repeated breakpoints, resulting in empty marginal sampling strata. The number of empty marginal sampling strata will ultimately depend on the frequency distribution of the covariate, and on the number of sampling points.

Correlation between numeric covariates

The correlation between two numeric covariates is measured using the sample Pearson's r, a descriptive statistic that ranges from $-1$ to $+1$. This statistic is also known as the sample linear correlation coefficient.

Multi-objective optimization

A method of solving a multi-objective optimization problem (MOOP) is to aggregate the objective functions into a single utility function U. In the spsann package, as in the original CLHS, the aggregation is performed using the weighted sum method, which uses weights to incorporate the preferences of the user about the relative importance of each objective function. When the user has no preference, the objective functions receive equal weights.

The weighted sum method is affected by the relative magnitude of the different objective function values. The objective functions implemented in optimCLHS have different units and orders of magnitude. The consequence is that the objective function with the largest values, generally O1, may have a numerical dominance during the optimization. In other words, the weights will not express the true preferences of the user, and the meaning of the utility function becomes unclear -- the optimization will favour the objective function which is numerically dominant.

An efficient solution to avoid numerical dominance is to transform the objective functions so that they are constrained to the same approximate range of values, at least in the end of the optimization. However, as in the original CLHS, optimCLHS uses the naive aggregation method, which ignores that the three objective functions have different units and orders of magnitude. The same aggregation procedure is implemented in the clhs package.

concept

simulated annealing

spatial trend

References

Edzer Pebesma, Jon Skoien with contributions from Olivier Baume, A. Chorti, D.T. Hristopulos, S.J. Melles and G. Spiliopoulos (2013). intamapInteractive: procedures for automated interpolation - methods only to be used interactively, not included in intamap package. R package version 1.1-10.

van Groenigen, J.-W. Constrained optimization of spatial sampling: a geostatistical approach. Wageningen: Wageningen University, p. 148, 1999.

Minasny, B.; McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.

Minasny, B.; McBratney, A. B. Conditioned Latin Hypercube Sampling for calibrating soil sensor data to soil properties. Chapter 9. Viscarra Rossel, R. A.; McBratney, A. B.; Minasny, B. (Eds.) Proximal Soil Sensing. Amsterdam: Springer, p. 111-119, 2010.

Roudier, P.; Beaudette, D.; Hewitt, A. A conditioned Latin hypercube sampling algorithm incorporating operational constraints. 5th Global Workshop on Digital Soil Mapping. Sydney, p. 227-231, 2012.

Examples

Run this code

require(sp)
data(meuse.grid)
candi <- meuse.grid[, 1:2]
covars <- meuse.grid[, 5]
weights <- list(O1 = 0.5, O3 = 0.5)
set.seed(2001)
res <- optimCLHS(points = 100, candi = candi, covars = covars,
                 use.coords = TRUE, weights = weights, iterations = 100)
objSPSANN(res) - # 106.0691
objCLHS(points = res, candi = candi, covars = covars, use.coords = TRUE,
        weights = weights)

# MARGINAL DISTRIBUTION
par(mfrow = c(3, 3))
# Covariates
i <- sample(1:nrow(candi), 100)
hist(candi[, 1], breaks = 10)
hist(candi[, 2], breaks = 10)
hist(covars, breaks = 10)
# Optimized sample
hist(candi[res[, 1], 1], breaks = 10)
hist(candi[res[, 1], 2], breaks = 10)
hist(covars[res[, 1]], breaks = 10)
# Random sample
hist(candi[i, 1], breaks = 10)
hist(candi[i, 2], breaks = 10)
hist(covars[i], breaks = 10)

# LINEAR CORRELATION
# Covariates
cor(cbind(candi[, 1], candi[, 2], covars))
# Optimized sample
cor(cbind(candi[res[, 1], 1], candi[res[, 1], 2], covars[res[, 1]]))
# Random sample
cor(cbind(candi[i, 1], candi[i, 2], covars[i]))

Run the code above in your browser using DataLab