VariationalBayes: Variational Bayes

Description

The VariationalBayes function is a numerical approximation method for deterministically estimating the marginal posterior distributions, target distributions, in a Bayesian model with approximated distributions by minimizing the Kullback-Leibler Divergence (KLD) between the target and its approximation.

Usage

VariationalBayes(Model, parm, Data, Covar=NULL, Interval=1.0E-6,
     Iterations=1000, Method="Salimans2", Samples=1000, sir=TRUE,
     Stop.Tolerance=1.0E-5, CPUs=1, Type="PSOCK")

Arguments

Model

This required argument receives the model from a user-defined function. The user-defined function is where the model is specified. VariationalBayes passes two arguments to the model function, parms and Data. For more information, see the LaplacesDemon function and ``LaplacesDemon Tutorial'' vignette.

parm

This argument requires a vector of initial values equal in length to the number of parameters. VariationalBayes will attempt to optimize these initial values for the parameters, where the optimized values are the posterior means, for later use with the IterativeQuadrature, LaplacesDemon, or PMC function. The GIV function may be used to randomly generate initial values. Parameters must be continuous.

Data

This required argument accepts a list of data. The list of data must include mon.names which contains monitored variable names, and parm.names which contains parameter names. VariationalBayes must be able to determine the sample size of the data, and will look for a scalar sample size variable n or N. If not found, it will look for variable y or Y, and attempt to take its number of rows as sample size. VariationalBayes needs to determine sample size due to the asymptotic nature of this method. Sample size should be at least $\sqrt{J}$ with $J$ exchangeable parameters.

Covar

This argument defaults to NULL, but may otherwise accept a $K \times K$ covariance matrix (where $K$ is the number of dimensions or parameters) of the parameters. When the model is updated for the first time and prior variance or covariance is unknown, then Covar=NULL should be used. Once VariationalBayes has finished updating, it may be desired to continue updating where it left off, in which case the covariance matrix from the last run can be input into the next run.

Interval

This argument receives an interval for estimating approximate gradients. The logarithm of the unnormalized joint posterior density of the Bayesian model is evaluated at the current parameter value, and again at the current parameter value plus this interval.

Iterations

This argument accepts an integer that determines the number of iterations that VariationalBayes will attempt to maximize the logarithm of the unnormalized joint posterior density. Iterations defaults to 1000. VariationalBayes will stop before this number of iterations if the tolerance is less than or equal to the Stop.Tolerance criterion. The required amount of computer memory increases with Iterations. If computer memory is exceeded, then all will be lost.

Method

This optional argument currently accepts only Salimans2, which is the second algorithm in Salimans and Knowles (2013).

Samples

This argument indicates the number of posterior samples to be taken with sampling importance resampling via the SIR function, which occurs only when sir=TRUE. Note that the number of samples should increase with the number and intercorrelations of the parameters.

sir

This logical argument indicates whether or not Sampling Importance Resampling (SIR) is conducted via the SIR function to draw independent posterior samples. This argument defaults to TRUE. Even when TRUE, posterior samples are drawn only when VariationalBayes has converged. Posterior samples are required for many other functions, including plot.vb and predict.vb. The only time that it is advantageous for sir=FALSE is when VariationalBayes is used to help the initial values for IterativeQuadrature, LaplacesDemon, or PMC, and it is unnecessary for time to be spent on sampling. Less time can be spent on sampling by increasing CPUs, which parallelizes the sampling.

Stop.Tolerance

This argument accepts any positive number and defaults to 1.0E-3. Tolerance is calculated each iteration, and the criteria varies by algorithm. The algorithm is considered to have converged to the user-specified Stop.Tolerance when the tolerance is less than or equal to the value of Stop.Tolerance, and the algorithm terminates at the end of the current iteration. Often, multiple criteria are used, in which case the maximum of all criteria becomes the tolerance. For example, when partial derivatives are taken, it is commonly required that the Euclidean norm of the partial derivatives is a criterion, and another common criterion is the Euclidean norm of the differences between the current and previous parameter values. Several algorithms have other, specific tolerances.

CPUs

This argument accepts an integer that specifies the number of central processing units (CPUs) of the multicore computer or computer cluster. This argument defaults to CPUs=1, in which parallel processing does not occur. Parallelization occurs only for sampling with SIR when sir=TRUE.

Type

This argument specifies the type of parallel processing to perform, accepting either Type="PSOCK" or Type="MPI".

Value

VariationalBayes returns an object of class vb that is a list with the following components:

Call

This is the matched call of VariationalBayes.

Converged

This is a logical indicator of whether or not VariationalBayes converged within the specified Iterations according to the supplied Stop.Tolerance criterion. Convergence does not indicate that the global maximum has been found, but only that the tolerance was less than or equal to the Stop.Tolerance criterion.

Covar

This is the estimated covariance matrix. The Covar matrix may be scaled and input into the Covar argument of the LaplacesDemon or PMC function for further estimation, or the diagonal of this matrix may be used to represent the posterior variance of the parameters, provided the algorithm converged and matrix inversion was successful. To scale this matrix for use with Laplace's Demon or PMC, multiply it by ${2.38}^{2} / d$ , where $d$ is the number of initial values.

Deviance

This is a vector of the iterative history of the deviance in the VariationalBayes function, as it sought convergence.

History

This is an array of the iterative history of the parameters in the VariationalBayes function, as it sought convergence. The first matrix is for means and the second matrix is for variances.

Initial.Values

This is the vector of initial values that was originally given to VariationalBayes in the parm argument.

LML

This is an approximation of the logarithm of the marginal likelihood of the data (see the LML function for more information). When the model has converged and sir=TRUE, the NSIS method is used. When the model has converged and sir=FALSE, the LME method is used. This is the logarithmic form of equation 4 in Lewis and Raftery (1997). As a rough estimate of Kass and Raftery (1995), the LME-based LML is worrisome when the sample size of the data is less than five times the number of parameters, and LML should be adequate in most problems when the sample size of the data exceeds twenty times the number of parameters (p. 778). The LME is inappropriate with hierarchical models. However LML is estimated, it is useful for comparing multiple models with the BayesFactor function.

LP.Final

This reports the final scalar value for the logarithm of the unnormalized joint posterior density.

LP.Initial

This reports the initial scalar value for the logarithm of the unnormalized joint posterior density.

Minutes

This is the number of minutes that VariationalBayes was running, and this includes the initial checks as well as drawing posterior samples and creating summaries.

Monitor

When sir=TRUE, a number of independent posterior samples equal to Samples is taken, and the draws are stored here as a matrix. The rows of the matrix are the samples, and the columns are the monitored variables.

Posterior

Step.Size.Final

This is the final, scalar Step.Size value at the end of the VariationalBayes algorithm.

Step.Size.Initial

This is the initial, scalar Step.Size.

Summary1

This is a summary matrix that summarizes the point-estimated posterior means and variances. Uncertainty around the posterior means is estimated from the estimated covariance matrix. Rows are parameters. The following columns are included: Mean, SD (Standard Deviation), LB (Lower Bound), and UB (Upper Bound). The bounds constitute a 95% probability interval.

Summary2

This is a summary matrix that summarizes the posterior samples drawn with sampling importance resampling (SIR) when sir=TRUE, given the point-estimated posterior means and covariance matrix. Rows are parameters. The following columns are included: Mean, SD (Standard Deviation), LB (Lower Bound), and UB (Upper Bound). The bounds constitute a 95% probability interval.

Tolerance.Final

This is the last Tolerance of the VariationalBayes algorithm.

Tolerance.Stop

This is the Stop.Tolerance criterion.

Details

Variational Bayes (VB) is a family of numerical approximation algorithms that is a subset of variational inference algorithms, or variational methods. Some examples of variational methods include the mean-field approximation, loopy belief propagation, tree-reweighted belief propagation, and expectation propagation (EP).

Variational inference for probabilistic models was introduced in the field of machine learning, influenced by statistical physics literature (Saul et al., 1996; Saul and Jordan, 1996; Jaakkola, 1997). The mean-field methods in Neal and Hinton (1999) led to variational algorithms.

Variational inference algorithms were later generalized for conjugate exponential-family models (Attias, 1999, 2000; Wiegerinck, 2000; Ghahramani and Beal, 2001; Xing et al., 2003). These algorithms still require different designs for different model forms. Salimans and Knowles (2013) introduced general-purpose VB algorithms for Gaussian posteriors.

A VB algorithm deterministically estimates the marginal posterior distributions (target distributions) in a Bayesian model with approximated distributions by minimizing the Kullback-Leibler Divergence (KLD) between the target and its approximation. The complicated posterior distribution is approximated with a simpler distribution. The simpler, approximated distribution is called the variational approximation, or approximation distribution, of the posterior. The term variational is derived from the calculus of variations, and regards optimization algorithms that select the best function (which is a distribution in VB), rather than merely selecting the best parameters.

VB algorithms often use Gaussian distributions as approximating distributions. In this case, both the mean and variance of the parameters are estimated.

Usually, a VB algorithm is slower to convergence than a Laplace Approximation algorithm, and faster to convergence than a Monte Carlo algorithm such as Markov chain Monte Carlo (MCMC). VB often provides solutions with comparable accuracy to MCMC in less time. Though Monte Carlo algorithms provide a numerical approximation to the exact posterior using a set of samples, VB provides a locally-optimal, exact analytical solution to an approximation of the posterior. VB is often more applicable than MCMC to big data or large-dimensional models.

Since VB is deterministic, it is asymptotic and subject to the same limitations with respect to sample size as Laplace Approximation. However, VB estimates more parameters than Laplace Approximation, such as when Laplace Approximation optimizes the posterior mode of a Gaussian distribution, while VB optimizes both the Gaussian mean and variance.

Traditionally, VB algorithms required customized equations. The VariationalBayes function uses general-purpose algorithms. A general-purpose VB algorithm is less efficient than an algorithm custom designed for the model form. However, a general-purpose algorithm is applied consistently and easily to numerous model forms.

When Method="Salimans2", the second algorithm of Salimans and Knowles (2013) is used. This requires the gradient and Hessian, which is more efficient with a small number of parameters as long as the posterior is twice differentiable. The step size is constant. This algorithm is suitable for marginal posterior distributions that are Gaussian and unimodal. A stochastic approximation algorithm is used in the context of fixed-form VB, inspired by considering fixed-form VB to be equivalent to performing a linear regression with the sufficient statistics of the approximation as independent variables and the unnormalized logarithm of the joint posterior density as the dependent variable. The number of requested iterations should be large, since the step-size decreases for larger requested iterations, and a small step-size will eventually converge. A large number of requested iterations results in a smaller step-size and better convergence properties, so hope for early convergence. However convergence is checked only in the last half of the iterations after the algorithm begins to average the mean and variance from the samples of the stochastic approximation. The history of stochastic samples is returned.

References

Attias, H. (1999). "Inferring Parameters and Structure of Latent Variable Models by Variational Bayes". In Uncertainty in Artificial Intelligence.

Attias, H. (2000). "A Variational Bayesian Framework for Graphical Models". In Neural Information Processing Systems.

Ghahramani, Z. and Beal, M. (2001). "Propagation Algorithms for Variational Bayesian Learning". In Neural Information Processing Systems, p. 507--513.

Jaakkola, T. (1997). "Variational Methods for Inference and Estimation in Graphical Models". PhD thesis, Massachusetts Institute of Technology.

Salimans, T. and Knowles, D.A. (2013). "Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression". Bayesian Analysis, 8(4), p. 837--882.

Neal, R. and Hinton, G. (1999). "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants". In Learning in Graphical Models, p. 355--368. MIT Press, 1999.

Saul, L. and Jordan, M. (1996). "Exploiting Tractable Substructures in Intractable Networks". Neural Information Processing Systems.

Saul, L., Jaakkola, T., and Jordan, M. (1996). "Mean Field Theory for Sigmoid Belief Networks". Journal of Artificial Intelligence Research, 4, p. 61--76.

Wiegerinck, W. (2000). "Variational Approximations Between Mean Field Theory and the Junction Tree Algorithm". In Uncertainty in Artificial Intelligence.

Xing, E., Jordan, M., and Russell, S. (2003). "A Generalized Mean Field Algorithm for Variational Inference in Exponential Families". In Uncertainty in Artificial Intelligence.

Examples

Run this code

# NOT RUN {
# The accompanying Examples vignette is a compendium of examples.
####################  Load the LaplacesDemon Library  #####################
library(LaplacesDemon)

##############################  Demon Data  ###############################
data(demonsnacks)
y <- log(demonsnacks$Calories)
X <- cbind(1, as.matrix(log(demonsnacks[,10]+1)))
J <- ncol(X)
for (j in 2:J) X[,j] <- CenterScale(X[,j])

#########################  Data List Preparation  #########################
mon.names <- "mu[1]"
parm.names <- as.parm.names(list(beta=rep(0,J), sigma=0))
pos.beta <- grep("beta", parm.names)
pos.sigma <- grep("sigma", parm.names)
PGF <- function(Data) {
     beta <- rnorm(Data$J)
     sigma <- runif(1)
     return(c(beta, sigma))
     }
MyData <- list(J=J, PGF=PGF, X=X, mon.names=mon.names,
     parm.names=parm.names, pos.beta=pos.beta, pos.sigma=pos.sigma, y=y)

##########################  Model Specification  ##########################
Model <- function(parm, Data)
     {
     ### Parameters
     beta <- parm[Data$pos.beta]
     sigma <- interval(parm[Data$pos.sigma], 1e-100, Inf)
     parm[Data$pos.sigma] <- sigma
     ### Log-Prior
     beta.prior <- sum(dnormv(beta, 0, 1000, log=TRUE))
     sigma.prior <- dhalfcauchy(sigma, 25, log=TRUE)
     ### Log-Likelihood
     mu <- tcrossprod(Data$X, t(beta))
     LL <- sum(dnorm(Data$y, mu, sigma, log=TRUE))
     ### Log-Posterior
     LP <- LL + beta.prior + sigma.prior
     Modelout <- list(LP=LP, Dev=-2*LL, Monitor=mu[1],
          yhat=rnorm(length(mu), mu, sigma), parm=parm)
     return(Modelout)
     }

############################  Initial Values  #############################
#Initial.Values <- GIV(Model, MyData, PGF=TRUE)
Initial.Values <- rep(0,J+1)

#Fit <- VariationalBayes(Model, Initial.Values, Data=MyData, Covar=NULL,
#     Iterations=1000, Method="Salimans2", Stop.Tolerance=1e-3, CPUs=1)
#Fit
#print(Fit)
#PosteriorChecks(Fit)
#caterpillar.plot(Fit, Parms="beta")
#plot(Fit, MyData, PDF=FALSE)
#Pred <- predict(Fit, Model, MyData, CPUs=1)
#summary(Pred, Discrep="Chi-Square")
#plot(Pred, Style="Covariates", Data=MyData)
#plot(Pred, Style="Density", Rows=1:9)
#plot(Pred, Style="Fitted")
#plot(Pred, Style="Jarque-Bera")
#plot(Pred, Style="Predictive Quantiles")
#plot(Pred, Style="Residual Density")
#plot(Pred, Style="Residuals")
#Levene.Test(Pred)
#Importance(Fit, Model, MyData, Discrep="Chi-Square")

#Fit$Covar is scaled (2.38^2/d) and submitted to LaplacesDemon as Covar.
#Fit$Summary[,1] is submitted to LaplacesDemon as Initial.Values.
#End
# }

Run the code above in your browser using DataLab

Data Engineering and BI courses are free this week!