gp_cv: gp_cv

Description

Performs cross-validation to select an optimal tuning parameter for penalized MLE of the lengthscale parameter in Gaussian processes.

Usage

gp_cv(
  y,
  x,
  lambda = NULL,
  sep = TRUE,
  mu = FALSE,
  g = FALSE,
  fixed_g = NULL,
  profile = TRUE,
  initialvals = NULL,
  n_init = 10,
  scad = FALSE,
  k = NULL,
  theta_upper = 1000,
  theta_lower = 0.001,
  metric = "dpe",
  ncores = 1
)

Value

A list includes y, x, selected lambda, and settings:

y: A copy of y.
x: A copy of x.
lambda.dpe.min: Returned when k is specified and metric="dpe"; the lambda value that minimizes the dpe across the folds.
lambda.dpe.1se: Returned when k is specified and metric="dpe"; the lambda value selected using the one-standard-error rule.
lambda.min: Returned when k is not specified or metric="mse"; the lambda value that minimizes mean squared error across the folds.
lambda.1se: Returned when k is not specified or metric="mse"; the lambda value selected using the one-standard-error rule.
lambda.score.max: Returned when k is specified and metric="score"; the lambda value that maximizes the score across the folds.
lambda.score.1se: Returned when k is specified and metric="score"; the lambda value selected using the one-standard-error rule.
lambda.md.min: Returned when k is specified and metric="md"; the lambda value that minimizes the md across the folds.
lambda.md.1se: Returned when k is specified and metric="md"; the lambda value selected using the one-standard-error rule.
initialvals: A vector or matrix of initial values used in optim.
n_init: A copy of n_init: the number of randomly generated initial value sets.
d: The dimensionality of the lengthscale parameter. If sep=TRUE, d is equal to the number of columns in x. Otherwise it is set to 1 for isotropic kernels.
profile: A copy of the logical indicator for profile likelihood optimization.
mu: A copy of the logical indicator for mean estimation.
g: A copy of the logical indicator for nugget estimation.
fixed_g: The fixed nugget value used when g = FALSE. If NULL, the nugget is set to 1.490116e-08 in mle_penalty function.
metric: A copy of the evaluation metric used in CV.
scad: A copy of the logical indicator for SCAD penalty usage.
theta_upper: A copy of theta_upper for optimization.
theta_lower: A copy of theta_lower for optimization.

Arguments

y: A numeric vector of the response variable.
x: A numeric vector or matrix of the input variables.
lambda: A tuning parameter. Default is NULL. Users may specify one or more lambda values to be evaluated. When NULL, 41 lambda values ranging from 0 to 7.389 will be automatically evaluated.
sep: Logical indicator for using a separable kernel function (sep=TRUE) or an isotropic kernel function (sep=FALSE). Default is TRUE.
mu: Logical indicator for assuming zero mean (mu=FALSE) or estimating the mean (mu=TRUE). Default is FALSE (assumes the data is centered beforehand).
g: Logical indicator for fixing the nugget value to a small constant (g=FALSE) or estimating the nugget (g=TRUE). Default is FALSE.
fixed_g: Nugget value to fix when g=FALSE. Default is fixed_g=NULL. If NULL, the nugget is fixed to 1.490116e-08.
profile: Logical indicator for optimizing the profile log-likelihood (profile=TRUE). When TRUE, the log-likelihood is a function of lengthscale and nugget only. Solve the closed forms for scale and mu parameters. When FALSE, the full log-likelihood is optimized (lengthscale, scale, mean, and nugget are estimated together). Default is TRUE.
initialvals: A numeric vector or matrix of initial values for optimization. The length should match the number of parameters to estimate. Default is NULL. If NULL, 10 sets of initial values are randomly generated. The number of sets can be specified by specifying n_init.
n_init: An integer indicating the number of randomly generated initial value sets to evaluate when initialvals is not provided. Default is 10.
scad: Logical indicator for a lasso penalty (scad=FALSE) or SCAD penalty (scad=TRUE) when penalty=TRUE. Default is lasso penalty.
k: The number of folds for k-fold CV. Default is NULL. When NULL, leave-one-out CV using mean squared error metric is performed. To conduct k-fold CV, users must specify a value for k.
theta_upper: Upper bound for theta in optim. Default is 1000.
theta_lower: Lower bound for theta in optim. Default is 0.001.
metric: The evaluation metric used in CV. Default is "dpe". The available metrics are "dpe", "md", "score", and "mse". The dpe, md, and score metrics are only available when k is specified.
ncores: A number of cores for parallel computing with optim. Default is 1 (no parallelization). Make sure your system supports the specified number of cores. Paralleling is recommended to improve performance.

Details

This function supports both leave-one-out and k-fold cross-validation for selecting a suitable tuning parameter value in penalized likelihood estimation. Users can choose among several evaluation metrics, including decorrelated prediction error (dpe), Mahalanobis distance (md), score, and mean squared error (mse), to guide the selection process. For the dpe, md, and score metrics, only k-fold cross-validation is available, as these metrics account for correlation structure. For leave-one-out cross-validation, only the mse metric be used. For dpe, md, and mse metrics, the lambda corresponding to the minimum value across the k folds is selected as optimal. For the score metric, the lambda with the maximum value is selected. The function returns the optimal lambda value along with the lambda selected using the one-standard error rule.

Examples

Run this code

# \donttest{
### training data ###
n <- 8

### test function ###
f_x <- function(x) {
return(sin(2*pi*x) + x^2)
}

### generate x ###
x <- runif(n, 0, 1)
y <- f_x(x)

### k-fold cross validation ###
cv.lambda <- gp_cv(y, x, k=4)

### mse metric ###
cv.lambda <- gp_cv(y, x, k=4, metric="mse")

### leave-one-out cross validation ###
cv.lambda <- gp_cv(y, x)


#' ### specify the number of randomly generated initial value sets to be evaluated. ###
cv.lambda <- gp_cv(y, x, n_init=5)

# }

Run the code above in your browser using DataLab