laGP (version 1.5-5)

# blhs: Bootstrapped block Latin hypercube subsampling

## Description

Provides bootstrapped block Latin hypercube subsampling under a given data set to aid in consistent estimation of a global separable lengthscale parameter

## Usage

blhs(y, X, m)
blhs.loop(y, X, m, K, da, g = 1e-3, maxit = 100, verb = 0, plot.it = FALSE)

## Arguments

y

a vector of responses/dependent values with length(y) = nrow(X)

X

a matrix or data.frame containing the full (large) design matrix of input locations

m

a positive scalar integer giving the number of divisions on each coordinate of input space defining the block structure

K

a positive scalar integer specifying the number of Bootstrap replicates desired

da

a lengthscale prior, say as generated by darg

g

a positive scalar giving the fixed nugget value of the nugget parameter; by default g = 1e-3

maxit

a positive scalar integer giving the maximum number of iterations for MLE calculations via "L-BFGS-B"; see mleGPsep for more details

verb

a non-negative integer specifying the verbosity level; verb = 0 (by default) is quiet, and larger values cause more progress information to be printed to the screen

plot.it

plot.it = FALSE by default; if plot.it = TRUE, then each of the K lengthscale estimates from bootstrap iterations will be shown via boxplot

## Value

blhs returns

xs

the subsampled input space

ys

the subsampled responses, length(ys) = nrow(xs)

blhs.loop returns

that

the lengthscale estimate (median), length(that) = ncol(X)

ly

the subsampled data size (median)

xm

the subsampled input space (median)

ym

the subsampled responses (median)

## Details

Bootstrapped block Latin hypercube subsampling (BLHS) yields a global lengthscale estimator which is asymptotically consistent with the MLE calculated on the full data set. However, since it works on data subsets, it comes at a much reduced computational cost. Intuitively, the BLHS guarantees a good mix of short and long pairwise distances. A single bootstrap LH subsample may be obtained by dividing each dimension of the input space equally into m intervals, yielding $$m^d$$ mutually exclusive hypercubes. It is easy to show that the average number of observations in each hypercube is $$Nm^{-d}$$ if there are $$N$$ samples in the original design. From each of these hypercubes, m blocks are randomly selected following the LH paradigm, i.e., so that only one interval is chosen from each of the m segments. The average number of observations in the subsample, combining the m randomly selected blocks, is $$Nm^{-d+1}$$.

Ensuring a subsample size of at least one requires having $$m\leq N^{\frac{1}{d-1}}$$, thereby linking the parameter m to computational effort. Smaller m is preferred so long as GP inference on data of that size remains tractable. Since the blocks follow an LH structure, the resulting sub-design inherits the usual LHS properties, e.g., retaining marginal properties like univariate stratification modulo features present in the original, large N, design.

For more details, see Liu (2014), Zhao, et al. (2017) and Sun, et al. (2019).

blhs returns the subsampled input space and the corresponding responses.

blhs.loop returns the median of the K lengthscale maximum likelihood estimates, the subsampled data size to which that corresponds, and the subsampled data, including the input space and the responses, from the bootstrap iterations

## References

F. Sun, R.B. Gramacy, B. Haaland, E. Lawrence, and A. Walker (2019). Emulating satellite drag from large simulation experiments, SIAM/ASA Journal on Uncertainty Quantification, 7(2), pp. 720-759; preprint on arXiv:1712.00182; http://arxiv.org/abs/1712.00182

Y. Zhao, Y. Hung, and Y. Amemiya (2017). Efficient Gaussian Process Modeling using Experimental Design-Based Subagging, Statistica Sinica, to appear;

Yufan Liu (2014) Recent Advances in Computer Experiment Modeling. Ph.D. Thesis at Rutgers, The State University of New Jersey. http://dx.doi.org/doi:10.7282/T38G8J1H

## Examples

# NOT RUN {
# input space based on latin-hypercube sampling (not required)
library(lhs)
## two dimensional example with N=216 sized sample
X <- randomLHS(216, 2)
# pseudo responses, not important for visualizing design
Y <- runif(216)

## BLHS sample with m=6 divisions in each coordinate
sub <- blhs(y=Y, X=X, m=6)
Xsub <- sub$xs # the bootstrapped subsample # visualization plot(X, xaxt="n", yaxt="n", xlim=c(0,1), ylim=c(0,1), xlab="factor 1", ylab="factor 2", col="cyan", main="BLHS") b <- seq(0, 1, by=1/6) abline(h=b, v=b, col="black", lty=2) axis(1, at=seq (0, 1, by=1/6), cex.axis=0.8, labels=expression(0, 1/6, 2/6, 3/6, 4/6, 5/6, 1)) axis(2, at=seq (0, 1, by=1/6), cex.axis=0.8, labels=expression(0, 1/6, 2/6, 3/6, 4/6, 5/6, 1), las=1) points(Xsub, col="red", pch=19, cex=1.25) ## Comparing global lengthscale MLE based on BLHS and random subsampling # } # NOT RUN { # famous borehole function borehole <- function(x){ rw <- x[1] * (0.15 - 0.05) + 0.05 r <- x[2] * (50000 - 100) + 100 Tu <- x[3] * (115600 - 63070) + 63070 Tl <- x[5] * (116 - 63.1) + 63.1 Hu <- x[4] * (1110 - 990) + 990 Hl <- x[6] * (820 - 700) + 700 L <- x[7] * (1680 - 1120) + 1120 Kw <- x[8] * (12045 - 9855) + 9855 m1 <- 2 * pi * Tu * (Hu - Hl) m2 <- log(r / rw) m3 <- 1 + 2*L*Tu/(m2*rw^2*Kw) + Tu/Tl return(m1/m2/m3) } N <- 100000 # number of observations xt <- randomLHS(N, 8) # input space yt <- apply(xt, 1, borehole) # response colnames(xt) <- c("rw", "r", "Tu", "Tl", "Hu", "Hl", "L", "Kw") ## prior on the GP lengthscale parameter da <- darg(list(mle=TRUE, max=100), xt) ## make space for two sets of boxplots par(mfrow=c(1,2)) # BLHS calculating with visualization of the K MLE lengthscale estimates K <- 10 # number of Bootstrap samples; Sun, et al (2017) uses K <- 31 sub_blhs <- blhs.loop(y=yt, X=xt, K=K, m=2, da=da, maxit=200, plot.it=TRUE) # a random subsampling analog for comparison sn <- sub_blhs$ly # extract a size that is consistent with the BLHS
that.rand <- matrix(NA, ncol=8, nrow=K)
for(i in 1:K){
sub <- sample(1:nrow(xt), sn)
gpsepi <- newGPsep(xt[sub,], yt[sub], d=da$start, g=1e-3, dK=TRUE) mle <- mleGPsep(gpsepi, tmin=da$min, tmax=10*da$max, ab=da$ab, maxit=200)
deleteGPsep(gpsepi)
that.rand[i,] <- mle\$d
}

## put random boxplots next to BLHS ones
boxplot(that.rand, xlab="input", ylab="theta-hat", col=2,
main="random subsampling")

# }