This distribution implements the variational Gaussian process (VGP), as
described in Titsias (2009) and Hensman (2013). The VGP is an
inducing point-based approximation of an exact GP posterior.
Ultimately, this Distribution class represents a marginal distribution over function values at a
collection of index_points
. It is parameterized by
a kernel function,
a mean function,
the (scalar) observation noise variance of the normal likelihood,
a set of index points,
a set of inducing index points, and
the parameters of the (full-rank, Gaussian) variational posterior distribution over function values at the inducing points, conditional on some observations.
tfd_variational_gaussian_process(
kernel,
index_points,
inducing_index_points,
variational_inducing_observations_loc,
variational_inducing_observations_scale,
mean_fn = NULL,
observation_noise_variance = 0,
predictive_noise_variance = 0,
jitter = 1e-06,
validate_args = FALSE,
allow_nan_stats = FALSE,
name = "VariationalGaussianProcess"
)
PositiveSemidefiniteKernel
-like instance representing the
GP's covariance function.
float
Tensor
representing finite (batch of) vector(s) of
points in the index set over which the VGP is defined. Shape has the
form [b1, ..., bB, e1, f1, ..., fF]
where F
is the number of feature
dimensions and must equal kernel$feature_ndims
and e1
is the number
(size) of index points in each batch (we denote it e1
to distinguish
it from the numer of inducing index points, denoted e2
below).
Ultimately the VariationalGaussianProcess distribution corresponds to an
e1
-dimensional multivariate normal. The batch shape must be
broadcastable with kernel$batch_shape
, the batch shape of
inducing_index_points
, and any batch dims yielded by mean_fn
.
float
Tensor
of locations of inducing points in
the index set. Shape has the form [b1, ..., bB, e2, f1, ..., fF]
, just
like index_points
. The batch shape components needn't be identical to
those of index_points
, but must be broadcast compatible with them.
float
Tensor
; the mean of the
(full-rank Gaussian) variational posterior over function values at the
inducing points, conditional on observed data. Shape has the form [b1, ..., bB, e2]
,
where b1, ..., bB
is broadcast compatible with other
parameters' batch shapes, and e2
is the number of inducing points.
float
Tensor
; the scale
matrix of the (full-rank Gaussian) variational posterior over function
values at the inducing points, conditional on observed data. Shape has
the form [b1, ..., bB, e2, e2]
, where b1, ..., bB
is broadcast
compatible with other parameters and e2
is the number of inducing points.
function that acts on index points to produce a (batch
of) vector(s) of mean values at those index points. Takes a Tensor
of
shape [b1, ..., bB, f1, ..., fF]
and returns a Tensor
whose shape is
(broadcastable with) [b1, ..., bB]
. Default value: NULL
implies constant zero function.
float
Tensor
representing the variance
of the noise in the Normal likelihood distribution of the model. May be
batched, in which case the batch shape must be broadcastable with the
shapes of all other batched parameters (kernel$batch_shape
, index_points
, etc.).
Default value: 0.
float
Tensor
representing additional
variance in the posterior predictive model. If NULL
, we simply re-use
observation_noise_variance
for the posterior predictive noise. If set
explicitly, however, we use the given value. This allows us, for
example, to omit predictive noise variance (by setting this to zero) to
obtain noiseless posterior predictions of function values, conditioned
on noisy observations.
float
scalar Tensor
added to the diagonal of the covariance
matrix to ensure positive definiteness of the covariance matrix. Default value: 1e-6
.
Logical, default FALSE. When TRUE distribution parameters are checked for validity despite possibly degrading runtime performance. When FALSE invalid inputs may silently render incorrect outputs. Default value: FALSE.
Logical, default TRUE. When TRUE, statistics (e.g., mean, mode, variance) use the value NaN to indicate the result is undefined. When FALSE, an exception is raised if one or more of the statistic's batch members are undefined.
name prefixed to Ops created by this class.
a distribution instance.
A VGP is "trained" by selecting any kernel parameters, the locations of the
inducing index points, and the variational parameters. Titsias (2009) and
Hensman (2013) describe a variational lower bound on the marginal log
likelihood of observed data, which this class offers through the
variational_loss
method (this is the negative lower bound, for convenience
when plugging into a TF Optimizer's minimize
function).
Training may be done in minibatches.
Titsias (2009) describes a closed form for the optimal variational
parameters, in the case of sufficiently small observational data (ie,
small enough to fit in memory but big enough to warrant approximating the GP
posterior). A method to compute these optimal parameters in terms of the full
observational data set is provided as a staticmethod,
optimal_variational_posterior
. It returns a
MultivariateNormalLinearOperator
instance with optimal location and scale parameters.
Mathematical Details
Notation We will in general be concerned about three collections of index points, and it'll be good to give them names:
x[1], ..., x[N]
: observation index points -- locations of our observed data.
z[1], ..., z[M]
: inducing index points -- locations of the
"summarizing" inducing points
t[1], ..., t[P]
: predictive index points -- locations where we are
making posterior predictions based on observations and the variational
parameters.
To lighten notation, we'll use X, Z, T
to denote the above collections.
Similarly, we'll denote by f(X)
the collection of function values at each of
the x[i]
, and by Y
, the collection of (noisy) observed data at each x[i]
.
We'll denote kernel matrices generated from pairs of index points as K_tt
,
K_xt
, K_tz
, etc, e.g.,
K_tz = | k(t[1], z[1]) k(t[1], z[2]) ... k(t[1], z[M]) | | k(t[2], z[1]) k(t[2], z[2]) ... k(t[2], z[M]) | | ... ... ... | | k(t[P], z[1]) k(t[P], z[2]) ... k(t[P], z[M]) |
Preliminaries
A Gaussian process is an indexed collection of random variables, any finite
collection of which are jointly Gaussian. Typically, the index set is some
finite-dimensional, real vector space, and indeed we make this assumption in
what follows. The GP may then be thought of as a distribution over functions
on the index set. Samples from the GP are functions on the whole index set;
these can't be represented in finite compute memory, so one typically works
with the marginals at a finite collection of index points. The properties of
the GP are entirely determined by its mean function m
and covariance
function k
. The generative process, assuming a mean-zero normal likelihood
with stddev sigma
, is
f ~ GP(m, k) Y | f(X) ~ Normal(f(X), sigma), i = 1, ... , N
In finite terms (ie, marginalizing out all but a finite number of f(X), sigma), we can write
f(X) ~ MVN(loc=m(X), cov=K_xx) Y | f(X) ~ Normal(f(X), sigma), i = 1, ... , N
Posterior inference is possible in analytical closed form but becomes intractible as data sizes get large. See Rasmussen (2006) for details.
The VGP
The VGP is an inducing point-based approximation of an exact GP posterior, where two approximating assumptions have been made:
function values at non-inducing points are mutually independent conditioned on function values at the inducing points,
the (expensive) posterior over function values at inducing points conditional on obseravtions is replaced with an arbitrary (learnable) full-rank Gaussian distribution,
q(f(Z)) = MVN(loc=m, scale=S),
where m
and S
are parameters to be chosen by optimizing an evidence
lower bound (ELBO).
The posterior predictive distribution becomes
q(f(T)) = integral df(Z) p(f(T) | f(Z)) q(f(Z)) = MVN(loc = A @ m, scale = B^(1/2))
where
A = K_tz @ K_zz^-1 B = K_tt - A @ (K_zz - S S^T) A^T
The approximate posterior predictive distribution q(f(T))
is what the
VariationalGaussianProcess
class represents.
Model selection in this framework entails choosing the kernel parameters, inducing point locations, and variational parameters. We do this by optimizing a variational lower bound on the marginal log likelihood of observed data. The lower bound takes the following form (see Titsias (2009) and Hensman (2013) for details on the derivation):
L(Z, m, S, Y) = MVN(loc= (K_zx @ K_zz^-1) @ m, scale_diag=sigma).log_prob(Y) - (Tr(K_xx - K_zx @ K_zz^-1 @ K_xz) + Tr(S @ S^T @ K_zz^1 @ K_zx @ K_xz @ K_zz^-1)) / (2 * sigma^2) - KL(q(f(Z)) || p(f(Z))))
where in the final KL term, p(f(Z))
is the GP prior on inducing point
function values. This variational lower bound can be computed on minibatches
of the full data set (X, Y)
. A method to compute the negative variational
lower bound is implemented as VariationalGaussianProcess$variational_loss
.
Optimal variational parameters
As described in Titsias (2009), a closed form optimum for the variational
location and scale parameters, m
and S
, can be computed when the
observational data are not prohibitively voluminous. The
optimal_variational_posterior
function to computes the optimal variational
posterior distribution over inducing point function values in terms of the GP
parameters (mean and kernel functions), inducing point locations, observation
index points, and observations. Note that the inducing index point locations
must still be optimized even when these parameters are known functions of the
inducing index points. The optimal parameters are computed as follows:
C = sigma^-2 (K_zz + K_zx @ K_xz)^-1 optimal Gaussian covariance: K_zz @ C @ K_zz optimal Gaussian location: sigma^-2 K_zz @ C @ K_zx @ Y
For usage examples see e.g. tfd_sample()
, tfd_log_prob()
, tfd_mean()
.
Other distributions:
tfd_autoregressive()
,
tfd_batch_reshape()
,
tfd_bates()
,
tfd_bernoulli()
,
tfd_beta_binomial()
,
tfd_beta()
,
tfd_binomial()
,
tfd_categorical()
,
tfd_cauchy()
,
tfd_chi2()
,
tfd_chi()
,
tfd_cholesky_lkj()
,
tfd_continuous_bernoulli()
,
tfd_deterministic()
,
tfd_dirichlet_multinomial()
,
tfd_dirichlet()
,
tfd_empirical()
,
tfd_exp_gamma()
,
tfd_exp_inverse_gamma()
,
tfd_exponential()
,
tfd_gamma_gamma()
,
tfd_gamma()
,
tfd_gaussian_process_regression_model()
,
tfd_gaussian_process()
,
tfd_generalized_normal()
,
tfd_geometric()
,
tfd_gumbel()
,
tfd_half_cauchy()
,
tfd_half_normal()
,
tfd_hidden_markov_model()
,
tfd_horseshoe()
,
tfd_independent()
,
tfd_inverse_gamma()
,
tfd_inverse_gaussian()
,
tfd_johnson_s_u()
,
tfd_joint_distribution_named_auto_batched()
,
tfd_joint_distribution_named()
,
tfd_joint_distribution_sequential_auto_batched()
,
tfd_joint_distribution_sequential()
,
tfd_kumaraswamy()
,
tfd_laplace()
,
tfd_linear_gaussian_state_space_model()
,
tfd_lkj()
,
tfd_log_logistic()
,
tfd_log_normal()
,
tfd_logistic()
,
tfd_mixture_same_family()
,
tfd_mixture()
,
tfd_multinomial()
,
tfd_multivariate_normal_diag_plus_low_rank()
,
tfd_multivariate_normal_diag()
,
tfd_multivariate_normal_full_covariance()
,
tfd_multivariate_normal_linear_operator()
,
tfd_multivariate_normal_tri_l()
,
tfd_multivariate_student_t_linear_operator()
,
tfd_negative_binomial()
,
tfd_normal()
,
tfd_one_hot_categorical()
,
tfd_pareto()
,
tfd_pixel_cnn()
,
tfd_poisson_log_normal_quadrature_compound()
,
tfd_poisson()
,
tfd_power_spherical()
,
tfd_probit_bernoulli()
,
tfd_quantized()
,
tfd_relaxed_bernoulli()
,
tfd_relaxed_one_hot_categorical()
,
tfd_sample_distribution()
,
tfd_sinh_arcsinh()
,
tfd_skellam()
,
tfd_spherical_uniform()
,
tfd_student_t_process()
,
tfd_student_t()
,
tfd_transformed_distribution()
,
tfd_triangular()
,
tfd_truncated_cauchy()
,
tfd_truncated_normal()
,
tfd_uniform()
,
tfd_vector_diffeomixture()
,
tfd_vector_exponential_diag()
,
tfd_vector_exponential_linear_operator()
,
tfd_vector_laplace_diag()
,
tfd_vector_laplace_linear_operator()
,
tfd_vector_sinh_arcsinh_diag()
,
tfd_von_mises_fisher()
,
tfd_von_mises()
,
tfd_weibull()
,
tfd_wishart_linear_operator()
,
tfd_wishart_tri_l()
,
tfd_wishart()
,
tfd_zipf()