Provides bootstrapped block Latin hypercube subsampling under a given data set to aid in consistent estimation of a global separable lengthscale parameter

```
blhs(y, X, m)
blhs.loop(y, X, m, K, da, g = 1e-3, maxit = 100, verb = 0, plot.it = FALSE)
```

y

a vector of responses/dependent values with `length(y) = nrow(X)`

X

a `matrix`

or `data.frame`

containing the full (large) design matrix of input locations

m

a positive scalar integer giving the number of divisions on each coordinate of input space defining the block structure

K

a positive scalar integer specifying the number of Bootstrap replicates desired

da

a lengthscale prior, say as generated by `darg`

g

a positive scalar giving the fixed nugget value of the nugget parameter; by default `g = 1e-3`

maxit

a positive scalar integer giving the maximum number of iterations for MLE calculations via `"L-BFGS-B"`

;
see `mleGPsep`

for more details

verb

a non-negative integer specifying the verbosity level; `verb = 0`

(by default) is quiet,
and larger values cause more progress information to be printed to the screen

plot.it

`plot.it = FALSE`

by default; if `plot.it = TRUE`

, then each of the `K`

lengthscale estimates from bootstrap iterations will be shown
via `boxplot`

`blhs`

returns

`xs`

the subsampled input space

`ys`

the subsampled responses, `length(ys) = nrow(xs)`

blhs.loop returns

`that`

the lengthscale estimate (median), `length(that) = ncol(X)`

`ly`

the subsampled data size (median)

`xm`

the subsampled input space (median)

`ym`

the subsampled responses (median)

Bootstrapped block Latin hypercube subsampling (BLHS) yields a global lengthscale estimator
which is asymptotically consistent with the MLE calculated on the full data set. However, since it
works on data subsets, it comes at a much reduced computational cost. Intuitively, the BLHS
guarantees a good mix of short and long pairwise distances. A single bootstrap LH subsample
may be obtained by dividing each dimension of the input space equally into `m`

intervals, yielding \(m^d\) mutually exclusive hypercubes. It is easy to show
that the average number of observations in each hypercube is \(Nm^{-d}\)
if there are \(N\) samples in the original design. From each of these hypercubes,
`m`

`blocks`

are randomly selected following the LH paradigm, i.e., so that
only one interval is chosen from each of the `m`

segments. The average number of
observations in the subsample, combining the `m`

randomly selected blocks,
is \(Nm^{-d+1}\).

Ensuring a subsample size of at least `one`

requires having \(m\leq N^{\frac{1}{d-1}}\),
thereby linking the parameter `m`

to computational effort. Smaller `m`

is preferred so long
as GP inference on data of that size remains tractable. Since the blocks follow
an LH structure, the resulting sub-design inherits the usual LHS properties,
e.g., retaining marginal properties like univariate stratification modulo features present in the original,
large `N`

, design.

For more details, see Liu (2014), Zhao, et al. (2017) and Sun, et al. (2019).

`blhs`

returns the subsampled input space and the corresponding responses.

`blhs.loop`

returns the median of the `K`

lengthscale maximum likelihood estimates, the subsampled data size to which
that corresponds, and the subsampled data, including the input space and the responses, from the bootstrap iterations

F. Sun, R.B. Gramacy, B. Haaland, E. Lawrence, and A. Walker (2019).
*Emulating satellite drag from large simulation experiments*,
SIAM/ASA Journal on Uncertainty Quantification, 7(2), pp. 720-759;
preprint on arXiv:1712.00182;
http://arxiv.org/abs/1712.00182

Y. Zhao, Y. Hung, and Y. Amemiya (2017).
*Efficient Gaussian Process Modeling using Experimental Design-Based Subagging*,
Statistica Sinica, to appear;

Yufan Liu (2014)
*Recent Advances in Computer Experiment Modeling*.
Ph.D. Thesis at Rutgers, The State University of New Jersey.
http://dx.doi.org/doi:10.7282/T38G8J1H

# NOT RUN { # input space based on latin-hypercube sampling (not required) library(lhs) ## two dimensional example with N=216 sized sample X <- randomLHS(216, 2) # pseudo responses, not important for visualizing design Y <- runif(216) ## BLHS sample with m=6 divisions in each coordinate sub <- blhs(y=Y, X=X, m=6) Xsub <- sub$xs # the bootstrapped subsample # visualization plot(X, xaxt="n", yaxt="n", xlim=c(0,1), ylim=c(0,1), xlab="factor 1", ylab="factor 2", col="cyan", main="BLHS") b <- seq(0, 1, by=1/6) abline(h=b, v=b, col="black", lty=2) axis(1, at=seq (0, 1, by=1/6), cex.axis=0.8, labels=expression(0, 1/6, 2/6, 3/6, 4/6, 5/6, 1)) axis(2, at=seq (0, 1, by=1/6), cex.axis=0.8, labels=expression(0, 1/6, 2/6, 3/6, 4/6, 5/6, 1), las=1) points(Xsub, col="red", pch=19, cex=1.25) ## Comparing global lengthscale MLE based on BLHS and random subsampling # } # NOT RUN { # famous borehole function borehole <- function(x){ rw <- x[1] * (0.15 - 0.05) + 0.05 r <- x[2] * (50000 - 100) + 100 Tu <- x[3] * (115600 - 63070) + 63070 Tl <- x[5] * (116 - 63.1) + 63.1 Hu <- x[4] * (1110 - 990) + 990 Hl <- x[6] * (820 - 700) + 700 L <- x[7] * (1680 - 1120) + 1120 Kw <- x[8] * (12045 - 9855) + 9855 m1 <- 2 * pi * Tu * (Hu - Hl) m2 <- log(r / rw) m3 <- 1 + 2*L*Tu/(m2*rw^2*Kw) + Tu/Tl return(m1/m2/m3) } N <- 100000 # number of observations xt <- randomLHS(N, 8) # input space yt <- apply(xt, 1, borehole) # response colnames(xt) <- c("rw", "r", "Tu", "Tl", "Hu", "Hl", "L", "Kw") ## prior on the GP lengthscale parameter da <- darg(list(mle=TRUE, max=100), xt) ## make space for two sets of boxplots par(mfrow=c(1,2)) # BLHS calculating with visualization of the K MLE lengthscale estimates K <- 10 # number of Bootstrap samples; Sun, et al (2017) uses K <- 31 sub_blhs <- blhs.loop(y=yt, X=xt, K=K, m=2, da=da, maxit=200, plot.it=TRUE) # a random subsampling analog for comparison sn <- sub_blhs$ly # extract a size that is consistent with the BLHS that.rand <- matrix(NA, ncol=8, nrow=K) for(i in 1:K){ sub <- sample(1:nrow(xt), sn) gpsepi <- newGPsep(xt[sub,], yt[sub], d=da$start, g=1e-3, dK=TRUE) mle <- mleGPsep(gpsepi, tmin=da$min, tmax=10*da$max, ab=da$ab, maxit=200) deleteGPsep(gpsepi) that.rand[i,] <- mle$d } ## put random boxplots next to BLHS ones boxplot(that.rand, xlab="input", ylab="theta-hat", col=2, main="random subsampling") # }