optimPPL: Optimization of sample configurations for variogram identification and estimation

Description

Optimize a sample configuration for variogram identification and estimation. A criterion is defined so that the optimized sample configuration has a given number of points or point-pairs contributing to each lag-distance class (PPL).

Usage

optimPPL(points, candi, iterations = 100, lags = 7,
  lags.type = "exponential", lags.base = 2, cutoff,
  criterion = "distribution", distri, pairs = FALSE, x.max, x.min, y.max,
  y.min, acceptance = list(initial = 0.99, cooling = iterations/10),
  stopping = list(max.count = iterations/10), plotit = FALSE,
  track = FALSE, boundary, progress = TRUE, verbose = FALSE,
  greedy = FALSE, weights = NULL, nadir = NULL, utopia = NULL)
objPPL(points, candi, lags = 7, lags.type = "exponential", lags.base = 2,
  cutoff, distri, criterion = "distribution", pairs = FALSE, x.max, x.min,
  y.max, y.min)
countPPL(points, candi, lags = 7, lags.type = "exponential",
  lags.base = 2, cutoff, pairs = FALSE, x.max, x.min, y.max, y.min)

Arguments

points

Integer value, integer vector, data frame or matrix. If points is an integer value, it defines the number of points that should be randomly sampled from candi to form the starting system configuration. If points is a

candi

Data frame or matrix with the candidate locations for the perturbed points. candi must have two columns in the following order: [, "x"] the projected x-coordinates, and [, "y"] the projected y-coordinates.

iterations

Integer. The maximum number of iterations that should be used for the optimization. Defaults to iterations = 100.

lags

Integer value. The number of lag-distance classes. Alternatively, a vector of numeric values with the lower and upper limits of each lag-distance class. The lowest value must be larger than zero. Defaults to lags = 7.

lags.type

Character value. The type of lag-distance classes, with options "equidistant" and "exponential". Defaults to lags.type = "exponential".

lags.base

Numeric value. Base of the exponential expression used to create exponentially spaced lag-distance classes. Used only when lags.type = "exponential". Defaults to lags.base = 2.

cutoff

Numeric value. The maximum distance up to which lag-distance classes are created. Used only when lags is an integer value.

criterion

Character value. The feature used to describe the energy state of the system configuration, with options "minimum" and "distribution". Defaults to objective = "distribution".

distri

Numeric vector. The distribution of points or point-pairs per lag-distance class that should be attained at the end of the optimization. Used only when criterion = "distribution". Defaults to a uniform distribution.

pairs

Logical value. Should the sample configuration be optimized regarding the number of point-pairs per lag-distance class? Defaults to pairs = FALSE.

x.max,x.min,y.max,y.min

Numeric value. The minimum and maximum quantity of random noise to be added to the projected x- and y-coordinates. The minimum quantity should be equal to, at least, the minimum distance between two neighbouring candidate locations. The units are the same

acceptance

List with two named sub-arguments: initial -- numeric value between 0 and 1 defining the initial acceptance probability, and cooling -- a numeric value defining the exponential factor by which the acceptance probability decreases

stopping

List with one named sub-argument: max.count -- integer value defining the maximum allowable number of iterations without improvement of the objective function value. Defaults to stopping = list(max.count = iterations / 10).

plotit

Logical for plotting the optimization results. This includes a) the progress of the objective function values and acceptance probabilities, and b) the original points, the perturbed points and the progress of the maximum perturbation in the x- and y-coord

track

Logical value. Should the evolution of the energy state and acceptance probability be recorded and returned with the result? If track = FALSE (the default), only the starting and ending energy state values are returned with the result.

boundary

SpatialPolygon. The boundary of the spatial domain. If missing, it is estimated from candi.

progress

Logical for printing a progress bar. Defaults to progress = TRUE.

verbose

Logical for printing messages about the progress of the optimization. Defaults to verbose = FALSE.

greedy

Logical value. Should the optimization be done using a greedy algorithm, that is, accepting only better system configurations? Defaults to greedy = FALSE. (experimental)

weights

List with named sub-arguments. The weights assigned to each one of the objective functions that form the multi-objective optimization problem (MOOP). They must be named after the respective objective function to which they apply. The weights must be equal

nadir

List with named sub-arguments. Three options are available: 1) sim -- the number of simulations that should be used to estimate the nadir point, and seeds -- vector defining the random seeds for each simulation; 2) user

utopia

List with named sub-arguments. Two options are available: 1) user -- a list of user-defined values named after the respective objective function to which they apply; 2) abs -- logical for calculating the utopia point internally (

Value

optimPPL returns a matrix: the optimized sample configuration.
objPPL returns a numeric value: the energy state of the sample configuration - the objective function value.
countPPL returns a data.frame with three columns: a) the lower and b) upper limits of each lag-distance class, and c) the number of points or point-pairs per lag-distance class.

Jittering methods

There are two ways of jittering the coordinates. They differ on how the set of candidate locations is defined. The first method uses an infinite set of candidate locations, that is, any point in the spatial domain can be selected as the new location of a perturbed point. All that this method needs is a polygon indicating the boundary of the spatial domain. This method is not implemented in the spsann package (yet) because it is computationally demanding: every time a point is jittered, it is necessary to check if it is inside the spatial domain.

The second method consists of using a finite set of candidate locations for the perturbed points. A finite set of candidate locations is created by discretizing the spatial domain, that is, creating a fine grid of points that serve as candidate locations for the perturbed points. This is the only method currently implemented in the spsann package because it is one of the least computationally demanding.

Using a finite set of candidate locations has one important inconvenience. When a point is selected to be jittered, it may be that the new location already is occupied by another point. If this happens, another location is iteratively sought for as many times as there are points in points. Because the more points there are in points, the more likely it is that the new location already is occupied by another point. If a solution is not found, the point selected to be jittered point is kept in its original location.

A more elegant method can be defined using a finite set of candidate locations coupled with a form of two-stage random sampling as implemented in spsample. Because the candidate locations are placed on a finite regular grid, they can be seen as being the centre nodes of a finite set of grid cells (or pixels of a raster image). In the first stage, one of the grid cells is selected with replacement, i.e. independently of already being occupied by another sample point. The new location for the point chosen to be jittered is selected within that grid cell by simple random sampling. This method guarantees that any location in the spatial domain can be a candidate location. It also discards the need to check if the new location already is occupied by another point. This method is not implemented (yet) in the spsann package.

Distance between two points

The distance between two points is computed as the Euclidean distance between them. This computation assumes that the optimization is operating in the two-dimensional Euclidean space, i.e. the coordinates of the sample points and candidate locations should not be provided as latitude/longitude. Package spsann has no mechanism to check if the coordinates are projected, and the user is responsible for making sure that this requirement is attained.

Multi-objective optimization

A method of solving a multi-objective optimization problem is to aggregate the objective functions into a single utility function. In the spsann package, the aggregation is performed using the weighted sum method, which incorporates in the weights the preferences of the user regarding the relative importance of each objective function.

The weighted sum method is affected by the relative magnitude of the different function values. The objective functions implemented in the spsann package have different units and orders of magnitude. The consequence is that the objective function with the largest values will have a numerical dominance in the optimization. In other words, the weights will not express the true preferences of the user, and the meaning of the utility function becomes unclear.

A solution to avoid the numerical dominance is to transform the objective functions so that they are constrained to the same approximate range of values. Several function-transformation methods can be used and the spsann offers a few of them. The upper-lower-bound approach requires the user to inform the maximum (nadir point) and minimum (utopia point) absolute function values. The resulting function values will always range between 0 and 1.

Using the upper-bound approach requires the user to inform only the nadir point, while the utopia point is set to zero. The upper-bound approach for transformation aims at equalizing only the upper bounds of the objective functions. The resulting function values will always be smaller than or equal to 1.

Sometimes, the absolute maximum and minimum values of an objective function can be calculated exactly. This seems not to be the case of the objective functions implemented in the spsann package. If the user is uncomfortable with informing the nadir and utopia points, there is the option for using numerical simulations. It consists in computing the function value for many random sample configurations. The mean function value is used to set the nadir point, while the the utopia point is set to zero. This approach is similar to the upper-bound approach, but the function values will have the same orders of magnitude only at the starting point of the optimization. Function values larger than one are likely to occur during the optimization. We recommend the user to avoid this approach whenever possible because the effect of the starting point on the optimization as a whole usually is insignificant or arbitrary.

The upper-lower-bound approach with the Pareto maximum and minimum values is the most elegant solution to transform the objective functions. However, it is the most time consuming. It works as follows:

Optimize a sample configuration with respect to each objective function that composes the MOOP;
Compute the function value of every objective function for every optimized sample configuration;
Record the maximum and minimum absolute function values computed for each objective function--these are the Pareto maximum and minimum.

For example, consider that a MOOP is composed of two objective functions (A and B). The minimum absolute value for function A is obtained when the sample configuration is optimized with respect to function A. This is the Pareto minimum of function A. Consequently, the maximum absolute value for function A is obtained when the sample configuration is optimized regarding function B. This is the Pareto maximum of function A. The same logic applies for function B.

Lag-distance classes

Two types of lag-distance classes can be created by default. The first are evenly spaced lags (lags.type = "equidistant"). They are created by simply dividing the distance interval from 0.0001 to cutoff by the required number of lags. The minimum value of 0.0001 guarantees that a point does not form a pair with itself. The second type of lags is defined by exponential spacings (lags.type = "exponential"). The spacings are defined by the base $b$ of the exponential expression $b^n$, where $n$ is the required number of lags. The base is defined using the argument lags.base.

Using the default uniform distribution means that the number of point-pairs per lag-distance class (pairs = TRUE) is equal to $n \times (n - 1) / (2 \times lag)$, where $n$ is the total number of points and $lag$ is the number of lags. If pairs = FALSE, then it means that the number of points per lag is equal to the total number of points. This is the same as expecting that each point contributes to every lag. Distributions other than the available options can be easily implemented changing the arguments lags and distri.

There are two optimizing criteria implemented. The first is called using criterion = "distribution" and is used to minimize the sum of the absolute differences between a pre-specified distribution and the observed distribution of points or point-pairs per lag-distance class. The second criterion is called using criterion = "minimum". It corresponds to maximizing the minimum number of points or point-pairs observed over all lag-distance classes.

concept

simulated annealing

variogram

References

Edzer Pebesma, Jon Skoien with contributions from Olivier Baume, A. Chorti, D.T. Hristopulos, S.J. Melles and G. Spiliopoulos (2013). intamapInteractive: procedures for automated interpolation - methods only to be used interactively, not included in intamap package. R package version 1.1-10.

van Groenigen, J.-W. Constrained optimization of spatial sampling: a geostatistical approach. Wageningen: Wageningen University, p. 148, 1999.

Arora, J. Introduction to optimum design. Waltham: Academic Press, p. 896, 2011.

Marler, R. T.; Arora, J. S. Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization, v. 26, p. 369-395, 2004.

Marler, R. T.; Arora, J. S. Function-transformation methods for multi-objective optimization. Engineering Optimization, v. 37, p. 551-570, 2005.

Marler, R. T.; Arora, J. S. The weighted sum method for multi-objective optimization: new insights. Structural and Multidisciplinary Optimization, v. 41, p. 853-862, 2009.

Bresler, E.; Green, R. E. Soil parameters and sampling scheme for characterizing soil hydraulic properties of a watershed. Honolulu: University of Hawaii at Manoa, p. 42, 1982.

Pettitt, A. N. & McBratney, A. B. Sampling designs for estimating spatial variance components. Applied Statistics. v. 42, p. 185, 1993.

Russo, D. Design of an optimal sampling network for estimating the variogram. Soil Science Society of America Journal. v. 48, p. 708-716, 1984.

Truong, P. N.; Heuvelink, G. B. M.; Gosling, J. P. Web-based tool for expert elicitation of the variogram. Computers and Geosciences. v. 51, p. 390-399, 2013.

Warrick, A. W.; Myers, D. E. Optimization of sampling locations for variogram calculations. Water Resources Research. v. 23, p. 496-500, 1987.

Examples

Run this code

# This example takes more than 5 seconds to run!
require(sp)
data(meuse.grid)
candi <- meuse.grid[, 1:2]
set.seed(2001)
res <- optimPPL(points = 100, candi = candi)
objSPSANN(res) # 160
objPPL(points = res, candi = candi)
countPPL(points = res, candi = candi)

Run the code above in your browser using DataLab