kenStone: Kennard-Stone algorithm for calibration sampling

Description

Select calibration samples from a large multivariate data using the Kennard-Stone algorithm

Usage

kenStone(X,k,metric,pc,group,.center = TRUE,.scale = FALSE)

Arguments

a numeric matrix

number of desired calibration samples

metric

distance metric to be used: 'euclid' (Euclidean distance) or 'mahal' (Mahalanobis distance, default).

optional. If not specified, distance are computed in the Euclidean space. Alternatively, distance are computed in the principal component score space and pc is the number of principal components retained. If pc < 1, t

group

An optional factor (or vector that can be coerced to a factor by as.factor) of length equal to nrow(X), giving the identifier of related observations (e.g. samples of the same batch

.center

logical value indicating whether the input matrix should be centered before Principal Component Analysis. Default set to TRUE.

.scale

logical value indicating whether the input matrix should be scaled before Principal Component Analysis. Default set to FALSE.

Value

a list with components:
- 'model'
{ numeric vector giving the row indices of the input data selected for calibration}
'test'numeric vector giving the row indices of the remaining observations
'pc'if the pc argument is specified, a numeric matrix of the scaled pc scores

Details

The Kennard--Stone algorithm allows to select samples with a uniform distribution over the predictor space (Kennard and Stone, 1969). It starts by selecting the pair of points that are the farthest apart. They are assigned to the calibration set and removed from the list of points. Then, the procedure assigns remaining points to the calibration set by computing the distance between each unassigned points $i_0$ and selected points $i$ and finding the point $i_0$ for which: $$d_{selected} = \max\limits_{i_0}(\min\limits_{i}(d_{i,i_{0}}))$$ This essentially selects point $i_0$ which is the farthest apart from its closest neighbors $i$ in the calibration set. The algorithm uses the Euclidean distance to select the points. However, the Mahalanobis distance can also be used. This can be achieved by performing a PCA analysis on the input data and computing the Euclidean distance on the truncated score matrix according to the following definition of the Mahalanobis $H$ distance: $$H^{2}_{ij} = \sum\limits_{a=1}^{A}{(\hat{t}_{ia}-\hat{t}_{ja})^{2}/\hat{\lambda}_{a}}$$ where $\hat{t}_{ia}$ is the a^th principal component score of point $i$, $\hat{t}_{ja}$ is the corresponding value for point $j$, $\hat{\lambda}_a$ is the eigenvalue of principal component $a$ and $A$ is the number of principal components included in the computation.

References

Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148.

Examples

Run this code

data(NIRsoil)
sel <- kenStone(NIRsoil$spc,k=30,pc=.99)
plot(sel$pc[,1:2],xlab='PC1',ylab='PC2')
points(sel$pc[sel$model,1:2],pch=19,col=2)  # points selected for calibration
# Test on artificial data
X <- expand.grid(1:20,1:20) + rnorm(1e5,0,.1)
plot(X,xlab='VAR1',ylab='VAR2')
sel <- kenStone(X,k=25,metric='euclid')
points(X[sel$model,],pch=19,col=2)

Run the code above in your browser using DataLab