mleGP: Inference for GP correlation parameters

Description

Maximum likelihood/a posteriori inference for (isotropic and separable) Gaussian lengthscale and nugget parameters, marginally or jointly, for Gaussian process regression

Usage

mleGP(gpi, param = c("d", "g"), tmin=sqrt(.Machine$double.eps), 
  tmax = -1, ab = c(0, 0), verb = 0)
mleGPsep(gpsepi, param=c("d", "g", "both"), tmin=rep(sqrt(.Machine$double.eps), 2),
  tmax=c(-1,1), ab=rep(0,4), maxit=100, verb=0)
mleGPsep.R(gpsepi, param=c("d", "g"), tmin=sqrt(.Machine$double.eps), 
  tmax=-1, ab=c(0,0), maxit=100, verb=0)
jmleGP(gpi, drange=c(sqrt(.Machine$double.eps),10), 
  grange=c(sqrt(.Machine$double.eps), 1), dab=c(0,0), gab=c(0,0), verb=0)
jmleGP.R(gpi, N=100, drange=c(sqrt(.Machine$double.eps),10), 
  grange=c(sqrt(.Machine$double.eps), 1), dab=c(0,0), gab=c(0,0), verb=0)
jmleGPsep(gpsepi, drange=c(sqrt(.Machine$double.eps),10), 
  grange=c(sqrt(.Machine$double.eps), 1), dab=c(0,0), gab=c(0,0), 
  maxit=100, verb=0)
jmleGPsep.R(gpsepi, N=100, drange=c(sqrt(.Machine$double.eps),10), 
  grange=c(sqrt(.Machine$double.eps), 1), dab=c(0,0), gab=c(0,0), 
  maxit=100, mleGPsep=mleGPsep.R, verb=0)

Arguments

gpi

a C-side GP object identifier (positive integer); e.g., as returned by newGP

gpsepi

similar to gpi but indicating a separable GP object, as returned by newGPsep

for jmleGP.R, the maximum number of times the pair of margins should be iterated over before determining failed convergence; note that (at this time) jmleGP uses a hard-coded N=100 in its C implementation

param

for mleGP, indicating whether to work on the lengthscale (d) or nugget (g) margin

tmin

for mleGP, smallest value considered for the parameter (param)

tmax

for mleGP, largest value considered for the parameter (param)

drange

for jmleGP, these are c(tmin, tmax) values for the lengthscale parameter; the default values are reasonable for 1-d inputs in the unit interval

grange

for jmleGP, these are c(tmin, tmax) values for the nugget parameter; the default values are reasonable for responses with a range of one

for mleGP, a non-negative 2-vector describing shape and rate parameters to a Gamma prior for the parameter (param); a zero-setting for either value results in no-prior being used (MLE inference); otherwise MAP inference is performed

maxit

for mleGPsep this is passed as control=list(trace=maxit) to optim's L-BFGS-B method for optimizing the likelihood/posterior of a separable GP representation; this argument is not used for isotropic GP versions, nor for optimizing the nugget

dab

for jmleGP, this is ab for the lengthscale parameter

gab

for jmleGP, this is ab for the nugget parameter

mleGPsep

function for internal MLE calculation of the separable lengthscale parameter; one of either mleGPsep.R based on method="L-BFGS-B" using optim; or mleGPsep using the C entry point lbfgsb. Both options use a C backend for the nugget

verb

a verbosity argument indicating how much information about the optimization steps should be printed to the screen; verb <= 0 is quiet; for jmleGP, a verb - 1 value is passed to the mleGP or mleGPsep subroutine(s)

Value

A self-explanatory data.frame is returned containing the values inferred and the number of iterations used. The jmleGP.R and jmleGPsep.R functions will also show progress details (the values obtained after each iteration over the marginals).

However, the most important “output” is the modified GP object which retains the setting of the parameters reported on output as a side effect.

mleGPsep and jmleGPsep provide an output field/column called conv indicating convergence (when 0), or alternately a value agreeing with a non-convergence code provided on output by optim

Details

mleGP and mleGPsep perform marginal (or profile) inference for the specified param, either the lengthscale or the nugget. mleGPsep can perform simultaneous lengthscale and nugget inference via a common gradient with param = "both". More details are provided below.

For the lengthscale, mleGP uses a Newton-like scheme with analytic first and second derivatives (more below) to find the scalar parameter for the isotropic Gaussian correlation function, with hard-coded 100-max iterations threshold and a sqrt(.Machine$double.eps) tolerance for determining convergence; mleGPsep.R uses L-BFGS-B via optim for the vectorized parameter of the separable Gaussian correlation, with a user-supplied maximum number of iterations (maxit) passed to optim. When maxit is reached the output conv = 1 is returned, subsequent identical calls to mleGPsep.R can be used to continue with further iterations. mleGPsep is similar, but uses the C entry point lbfgsb.

For the nugget, both mleGP and mleGPsep utilize a (nearly identical) Newton-like scheme leveraging first and second derivatives.

jmleGP and jmleGPsep provide joint inference by iterating over the marginals of lengthscale and nugget. The jmleGP.R function is an R-only wrapper around mleGP (which is primarily in C), whereas jmleGP is primarily in C but with reduced output and with hard-coded N=100. The same is true for jmleGPsep.

mleGPsep provides a param = "both" alternative to jmleGPsep leveraging a common gradient. It can be helpful to supply a larger maxit argument compared to jmleGPsep as the latter may do up to 100 outer iterations, cycling over lengthscale and nugget. mleGPsep usually requires many fewer total iterations, unless one of the lengthscale or nugget is already converged. In anticipation of param = "both" the mleGPsep function has longer default values for its bounds and prior arguments. These longer arguments are ignored when param != "both". At this time mleGP does not have a param = "both" option.

All methods are initialized at the value of the parameter(s) currently stored by the C-side object referenced by gpi or gpsepi. It is highly recommended that sensible range values (tmin, tmax or drange, grange) be provided. The defaults provided are too loose for most applications. As illustrated in the examples below, the darg and garg functions can be used to set appropriate ranges from the distributions of inputs and output data respectively.

The Newton-like method implemented for the isotropic lengthscale and for the nugget offers very fast convergence to local maxima, but sometimes it fails to converge (for any of the usual reasons). The implementation detects this, and in such cases it invokes a Brent_fmin call instead - this is the method behind the optimize function.

Note that the gpi or gpsepi object(s) must have been allocated with dK=TRUE; alternatively, one can call buildKGP or buildKGPsep - however, this is not in the NAMESPACE at this time

References

For standard GP inference, refer to any graduate text, e.g., Rasmussen & Williams Gaussian Processes for Machine Learning;

Examples

Run this code

# NOT RUN {
## a simple example with estimated nugget
library(MASS)

## motorcycle data and predictive locations
X <- matrix(mcycle[,1], ncol=1)
Z <- mcycle[,2]

## get sensible ranges
d <- darg(NULL, X)
g <- garg(list(mle=TRUE), Z)

## initialize the model
gpi <- newGP(X, Z, d=d$start, g=g$start, dK=TRUE)

## separate marginal inference (not necessary - for illustration only)
print(mleGP(gpi, "d", d$min, d$max))
print(mleGP(gpi, "g", g$min, g$max))

## joint inference (could skip straight to here)
print(jmleGP(gpi, drange=c(d$min, d$max), grange=c(g$min, g$max)))

## plot the resulting predictive surface
N <- 100
XX <- matrix(seq(min(X), max(X), length=N), ncol=1)
p <- predGP(gpi, XX, lite=TRUE)
plot(X, Z, main="stationary GP fit to motorcycle data")
lines(XX, p$mean, lwd=2)
lines(XX, p$mean+1.96*sqrt(p$s2*p$df/(p$df-2)), col=2, lty=2)
lines(XX, p$mean-1.96*sqrt(p$s2*p$df/(p$df-2)), col=2, lty=2)

## clean up
deleteGP(gpi)

## 
## with a separable correlation function 
##

## 2D Example: GoldPrice Function, mimicking GP_fit package
f <- function(x) 
{
  x1 <- 4*x[,1] - 2
  x2 <- 4*x[,2] - 2;
  t1 <- 1 + (x1 + x2 + 1)^2*(19 - 14*x1 + 3*x1^2 - 14*x2 + 6*x1*x2 + 3*x2^2);
  t2 <- 30 + (2*x1 -3*x2)^2*(18 - 32*x1 + 12*x1^2 + 48*x2 - 36*x1*x2 + 27*x2^2);
  y <- t1*t2;
  return(y)
}

## build design
library(tgp)
n <- 50 ## change to 100 or 1000 for more interesting demo
B <- rbind(c(0,1), c(0,1))
X <- dopt.gp(n, Xcand=lhs(10*n, B))$XX
## this differs from GP_fit in that we use the log response
Y <- log(f(X))

## get sensible ranges
d <- darg(NULL, X)
g <- garg(list(mle=TRUE), Y)

## build GP and jointly optimize via profile mehtods
gpisep <- newGPsep(X, Y, d=rep(d$start, 2), g=g$start, dK=TRUE)
jmleGPsep(gpisep, drange=c(d$min, d$max), grange=c(g$min, g$max))

## clean up
deleteGPsep(gpisep)

## alternatively, we can optimize via a combined gradient
gpisep <- newGPsep(X, Y, d=rep(d$start, 2), g=g$start, dK=TRUE)
mleGPsep(gpisep, param="both", tmin=c(d$min, g$min), tmax=c(d$max, d$max))
deleteGPsep(gpisep)
# }

Run the code above in your browser using DataLab