GPModel: Create a `GPModel` object

Description

Create a GPModel which contains a Gaussian process and / or mixed effects model with grouped random effects

Usage

GPModel(likelihood = "gaussian", group_data = NULL,
  group_rand_coef_data = NULL, ind_effect_group_rand_coef = NULL,
  drop_intercept_group_rand_effect = NULL, gp_coords = NULL,
  gp_rand_coef_data = NULL, cov_function = "matern", cov_fct_shape = 1.5,
  gp_approx = "none", num_parallel_threads = NULL,
  matrix_inversion_method = "default", weights = NULL,
  likelihood_learning_rate = 1, cov_fct_taper_range = 1,
  cov_fct_taper_shape = 1, num_neighbors = NULL,
  vecchia_ordering = "random", ind_points_selection = "kmeans++",
  num_ind_points = NULL, cover_tree_radius = 1, seed = 0L,
  cluster_ids = NULL, likelihood_additional_param = NULL,
  free_raw_data = FALSE, vecchia_approx = NULL, vecchia_pred_type = NULL,
  num_neighbors_pred = NULL)

Value

A GPModel containing ontains a Gaussian process and / or mixed effects model with grouped random effects

Arguments

likelihood

A string specifying the likelihood function (distribution) of the response variable. Available options:

"gaussian"
"bernoulli_logit": Bernoulli likelihood with a logit link function for binary classification. Aliases: "binary", "binary_logit"
"bernoulli_probit": Bernoulli likelihood with a probit link function for binary classification. Aliases: "binary_probit"
"binomial_logit": Binomial likelihood with a logit link function. The response variable y needs to contain proportions of successes / trials, and the weights parameter needs to contain the numbers of trials. Aliases: "binomial"
"binomial_probit": Binomial likelihood with a probit link function. The response variable y needs to contain proportions of successes / trials, and the weights parameter needs to contain the numbers of trials
"beta_binomial": Beta-binomial likelihood with a logit link function. The response variable y needs to contain proportions of successes / trials, and the weights parameter needs to contain the numbers of trials. Aliases: "betabinomial", "beta-binomial"
"poisson": Poisson likelihood with a log link function
"negative_binomial": negative binomial likelihood with a log link function (aka "nbinom2", "negative_binomial_2"). The variance is mu * (mu + r) / r, mu = mean, r = shape, with this parametrization
"negative_binomial_1": Negative binomial 1 (aka "nbinom1") likelihood with a log link function. The variance is mu * (1 + phi), mu = mean, phi = dispersion, with this parametrization
"gamma": Gamma likelihood with a log link function
"lognormal": Log-normal likelihood with a log link function
"beta" : Beta likelihood with a logit link function (parametrization of Ferrari and Cribari-Neto, 2004)
"t": t-distribution (e.g., for robust regression)
"t_fix_df": t-distribution with the degrees-of-freedom (df) held fixed and not estimated. The df can be set via the likelihood_additional_param parameter
"zero_inflated_gamma": Zero-inflated gamma likelihood. The log-transformed mean of the response variable equals the sum of fixed and random effects, E(y) = mu = exp(F(X) + Zb), and the rate parameter equals (1-p0) * gamma / mu, where p0 is the zero-inflation probability and gamma the shape parameter. I.e., the rate parameter depends on F(X) + Zb, and p0 and gamma are (univariate auxiliary) parameters that are estimated. Note that E(y) = mu above refers the the mean of the entire distribution and not just the positive part
"zero_censored_power_transformed_normal": Likelihood of a censored and power-transformed normal variable for modeling data with a point mass at 0 and a continuous distribution for y > 0. The model used is Y = max(0,X)^lambda, X ~ N(mu, sigma^2), where mu = F(X) + Zb, and sigma and lambda are (auxiliary) parameters that are estimated. For more details on this model, see Sigrist et al. (2012, AOAS) "A dynamic nonstationary spatio-temporal model for short term prediction of precipitation"
"gaussian_heteroscedastic": Gaussian likelihood where both the mean and the variance are related to fixed and random effects. This is currently only implemented for GPs with a 'vecchia' approximation
Note: the first lines in the likelihoods source file contain additional comments on the specific parametrizations used
Note: other likelihoods can be implemented upon request

group_data

A vector or matrix whose columns are categorical grouping variables. The elements being group levels defining grouped random effects. The elements of 'group_data' can be integer, double, or character. The number of columns corresponds to the number of grouped (intercept) random effects

group_rand_coef_data

A vector or matrix with numeric covariate data for grouped random coefficients

ind_effect_group_rand_coef

A vector with integer indices that indicate the corresponding categorical grouping variable (=columns) in 'group_data' for every covariate in 'group_rand_coef_data'. Counting starts at 1. The length of this index vector must equal the number of covariates in 'group_rand_coef_data'. For instance, c(1,1,2) means that the first two covariates (=first two columns) in 'group_rand_coef_data' have random coefficients corresponding to the first categorical grouping variable (=first column) in 'group_data', and the third covariate (=third column) in 'group_rand_coef_data' has a random coefficient corresponding to the second grouping variable (=second column) in 'group_data'

drop_intercept_group_rand_effect

A vector of type logical (boolean). Indicates whether intercept random effects are dropped (only for random coefficients). If drop_intercept_group_rand_effect[k] is TRUE, the intercept random effect number k is dropped / not included. Only random effects with random slopes can be dropped.

gp_coords

A matrix with numeric coordinates (= inputs / features) for defining Gaussian processes

gp_rand_coef_data

A vector or matrix with numeric covariate data for Gaussian process random coefficients

cov_function

A string specifying the covariance function for the Gaussian process. Available options:

"matern": Matern covariance function with the smoothness specified by the cov_fct_shape parameter (using the parametrization of Rasmussen and Williams, 2006)
"matern_estimate_shape": same as "matern" but the smoothness parameter is also estimated
"matern_space_time": Spatio-temporal Matern covariance function with different range parameters for space and time. Note that the first column in gp_coords must correspond to the time dimension
"space_time_gneiting": Spatio-temporal covariance function given in Eq. (16) of Gneiting (2002). Note that the first column in gp_coords must correspond to the time dimension. This covariance has seven parameters (in the following order: sigma2, a, c, alpha, nu, beta, delta) which are all estimated by default. You can disable the estimation of some of these parameter using the 'estimate_cov_par_index' argument of the params argument in either the fit function of a gp_model object or the set_optim_params function prior to estimation.
"matern_ard": anisotropic Matern covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column of gp_coords
"matern_ard_estimate_shape": same as "matern_ard" but the smoothness parameter is also estimated
"exponential": Exponential covariance function (using the parametrization of Diggle and Ribeiro, 2007)
"gaussian": Gaussian, aka squared exponential, covariance function (using the parametrization of Diggle and Ribeiro, 2007)
"gaussian_ard": anisotropic Gaussian, aka squared exponential, covariance function with Automatic Relevance Determination (ARD), i.e., with a different range parameter for every coordinate dimension / column of gp_coords
"powered_exponential": powered exponential covariance function with the exponent specified by the cov_fct_shape parameter (using the parametrization of Diggle and Ribeiro, 2007)
"wendland": Compactly supported Wendland covariance function (using the parametrization of Bevilacqua et al., 2019, AOS)
"linear": linear covariance function. This corresponds to a Bayesian linear regression model with a Gaussian prior on the coefficients with a constant variance diagonal prior covariance, and the prior variance is estimated using empirical Bayes.

cov_fct_shape

A numeric specifying the shape parameter of the covariance function (e.g., smoothness parameter for Matern and Wendland covariance) This parameter is irrelevant for some covariance functions such as the exponential or Gaussian

gp_approx

A string specifying the large data approximation for Gaussian processes. Available options:

"none": No approximation
"vecchia": Vecchia approximation; see Sigrist (2022, JMLR) for more details
"full_scale_vecchia": Vecchia-inducing points full-scale (VIF) approximation; see Gyger, Furrer, and Sigrist (2025) for more details
"tapering": The covariance function is multiplied by a compactly supported Wendland correlation function
"fitc": Fully Independent Training Conditional approximation aka modified predictive process approximation; see Gyger, Furrer, and Sigrist (2024) for more details
"full_scale_tapering": Full-scale approximation combining an inducing point / predictive process approximation with tapering on the residual process; see Gyger, Furrer, and Sigrist (2024) for more details
"vecchia_latent": similar as "vecchia" but a Vecchia approximation is applied to the latent Gaussian process for likelihood == "gaussian". For likelihood != "gaussian", "vecchia" and "vecchia_latent" are equivalent

num_parallel_threads

An integer specifying the number of parallel threads for OMP. If num_parallel_threads = NULL, all available threads are used

matrix_inversion_method

A string specifying the method used for inverting covariance matrices. Available options:

"default": iterative methods where possible, otherwise Cholesky factorization
"cholesky": Cholesky factorization
"iterative": iterative methods. A combination of the conjugate gradient, the Lanczos algorithm, and other methods. This is currently only supported for the following cases:
- grouped random effects with more than one level
- likelihood != "gaussian" and gp_approx == "vecchia" (non-Gaussian likelihoods with a Vecchia-Laplace approximation)
- likelihood != "gaussian" and gp_approx == "full_scale_vecchia" (non-Gaussian likelihoods with a VIF approximation)
- likelihood == "gaussian" and gp_approx == "full_scale_tapering" (Gaussian likelihood with a full-scale tapering approximation)

weights

A vector with sample weights

likelihood_learning_rate

A numeric with a learning rate for the likelihood for generalized Bayesian inference (only non-Gaussian likelihoods)

cov_fct_taper_range

A numeric specifying the range parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)

cov_fct_taper_shape

A numeric specifying the shape (=smoothness) parameter of the Wendland covariance function and Wendland correlation taper function. We follow the notation of Bevilacqua et al. (2019, AOS)

num_neighbors

An integer specifying the number of neighbors for the Vecchia and VIF approximations. Internal default values if NULL:

20 for gp_approx = "vecchia"
30 for gp_approx = "full_scale_vecchia"

Note: for prediction, the number of neighbors can be set through the 'num_neighbors_pred' parameter in the 'set_prediction_data' function. By default, num_neighbors_pred = 2 * num_neighbors. Further, the type of Vecchia approximation used for making predictions is set through the 'vecchia_pred_type' parameter in the 'set_prediction_data' function

vecchia_ordering

A string specifying the ordering used in the Vecchia approximation. Available options:

"none": the default ordering in the data is used
"random": a random ordering
"time": ordering accorrding to time (only for space-time models)
"time_random_space": ordering according to time and randomly for all spatial points with the same time points (only for space-time models)

ind_points_selection

A string specifying the method for choosing inducing points Available options:

"kmeans++: the k-means++ algorithm
"cover_tree": the cover tree algorithm
"random": random selection from data points

num_ind_points

An integer specifying the number of inducing points / knots for FITC, full_scale_tapering, and VIF approximations. Internal default values if NULL:

500 for gp_approx = "FITC" and gp_approx = "full_scale_tapering"
200 for gp_approx = "full_scale_vecchia"

cover_tree_radius

A numeric specifying the radius (= "spatial resolution") for the cover tree algorithm

seed

An integer specifying the seed used for model creation (e.g., random ordering in Vecchia approximation)

cluster_ids

A vector with elements indicating independent realizations of random effects / Gaussian processes (same values = same process realization). The elements of 'cluster_ids' can be integer, double, or character.

likelihood_additional_param

A numeric specifying an additional parameter for the likelihood which cannot be estimated for this likelihood (e.g., degrees of freedom for likelihood = "t_fix_df"). This is not to be confused with any auxiliary parameters that can be estimated and accessed through the function get_aux_pars after estimation. Note that this likelihood_additional_param parameter is irrelevant for many likelihoods. If likelihood_additional_param = NULL, the following internal default values are used:

df = 2 for likelihood = "t_fix_df"

free_raw_data

A boolean. If TRUE, the data (groups, coordinates, covariate data for random coefficients) is freed in R after initialization

vecchia_approx

Discontinued. Use the argument gp_approx instead

vecchia_pred_type

A string specifying the type of Vecchia approximation used for making predictions. This is discontinued here. Use the function 'set_prediction_data' to specify this

num_neighbors_pred

an integer specifying the number of neighbors for making predictions. This is discontinued here. Use the function 'set_prediction_data' to specify this

Author

Fabio Sigrist

Examples

Run this code

# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples

data(GPBoost_data, package = "gpboost")

#--------------------Grouped random effects model: single-level random effect----------------
gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian")

#--------------------Gaussian process model----------------
gp_model <- GPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5,
                    likelihood="gaussian")

#--------------------Combine Gaussian process with grouped random effects----------------
gp_model <- GPModel(group_data = group_data,
                    gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5,
                    likelihood="gaussian")

Run the code above in your browser using DataLab