Vehtari, A., Gelman, A., and Gabry, J. (2015). Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models.
The package documentation is largely based on the contents of the paper.
Exact cross-validation requires re-fitting the model with different training sets. Approximate leave-one-out cross-validation (LOO) can be computed easily using importance sampling (Gelfand, Dey, and Chang, 1992, Gelfand, 1996) but the resulting estimate is noisy, as the variance of the importance weights can be large or even infinite (Peruggia, 1997, Epifani et al., 2008). Here we propose a novel approach that provides a more accurate and reliable estimate using importance weights that are smoothed using a Pareto distribution fit to the upper tail of the distribution of importance weights.
WAIC (the widely applicable or Watanabe-Akaike information criterion; Watanabe, 2010) can be viewed as an improvement on the deviance information criterion (DIC) for Bayesian models. DIC has gained popularity in recent years in part through its implementation in the graphical modeling package BUGS (Spiegelhalter, Best, et al., 2002; Spiegelhalter, Thomas, et al., 1994, 2003), but it is known to have some problems, arising in part from it not being fully Bayesian in that it is based on a point estimate (van der Linde, 2005, Plummer, 2008). For example, DIC can produce negative estimates of the effective number of parameters in a model and it is not defined for singular models. WAIC is fully Bayesian and closely approximates Bayesian cross-validation. Unlike DIC, WAIC is invariant to parametrization and also works for singular models. WAIC is asymptotically equal to LOO, and can thus be used as an approximation of LOO. In the finite case, WAIC often gives similar estimates as LOO, but for influential observations WAIC underestimates the effect of leaving out one observation.
One advantage of AIC and DIC has been their computational simplicity. In this package we present fast and stable computations for LOO and WAIC that can be performed directly on posterior simulations, thus allowing these newer tools to enter routine statistical practice. We compute LOO using very good importance sampling (VGIS), a new procedure for regularizing importance weights (Vehtari and Gelman, 2015). As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparing of predictive errors between two models.
The distribution of the importance weights used in LOO may have a long right tail. We use the empirical Bayes estimate of Zhang and Stephens (2009) to fit a generalized Pareto distribution (gPd) to the tail (20% largest importance ratios). By examining the shape parameter $k$ of the fitted gPd, we are able to obtain a sample based estimate of the existance of the moments (Koopman et al, 2009). This extends the diagnostic approach of Peruggia (1997) and Epifani et al. (2008) to be used routinely with (importance sampling) LOO for any model with a factorizing likelihood. Epifani et al. (2008) show that when estimating the leave-one-out predictive density, the central limit theorem holds if the variance of the weight distribution is finite. These results can be extended using the generalized central limit theorem for stable distributions. Thus, even if the variance of the importance weight distribution is infinite, if the mean exists the estimate's accuracy improves when additional draws are obtained. When the tail of the weight distribution is long, a direct use of importance sampling is sensitive to the one (or several) largest value(s). By fitting a gPd to the upper tail of the importance weights we smooth these values. The procedure goes as follows:
The above steps must be performed for each data point $i$, thus resulting in a vector of weights $w_{i}^{s}, s = 1,...,S$, for each $i$, which in general should be better behaved than the raw importance ratios $r_{i}^{s}$ from which they were constructed.
The results are then combined to compute the desired LOO estimates. The reliability of the estimates can be assessed using the estimates for the shape parameter $k$ of the gPd.
If the estimated tail shape parameter $k \ge 1/2$, the user should be warned. Even if the VGIS estimate has a finite variance, the user should consider sampling directly from $p(\theta^s | y_{-i})$ for the problematic $i$, use $k$-fold cross-validation, or use a more robust model.
Importance sampling is likely to work less well if the marginal posterior $p(\theta^s | y)$ and LOO posterior $p(\theta^s | y_{-i})$ are much different, which is more likely to happen with a non-robust model and highly influential observations. A robust model may reduce the sensitivity to highly influential observations.
Gelfand, A. E. (1996). Model determination using sampling-based methods. In Markov Chain Monte Carlo in Practice, ed. W. R. Gilks, S. Richardson, D. J. Spiegelhalter, 145-162. London: Chapman and Hall.
Gelfand, A. E., Dey, D. K., and Chang, H. (1992). Model determination using predictive distributions with implementation via sampling-based methods. In Bayesian Statistics 4, ed. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, 147-167. Oxford University Press.
Gelman, A., Hwang, J., and Vehtari, A. (2013). Understandarding predictive information criteria for Bayesian models. Statistics and Computing.
Ionides, E. L. (2008). Truncated importance sampling. Journal of Computational and Graphical Statistics 17, 295-311.
Koopman, S. J., Shephard, N., and Creal, D. (2009). Testing the assumptions behind importance sampling. Journal of Econometrics 149, 2-11.
Peruggia, M. (1997). On the variability of case-deletion importance sampling weights in the Bayesian linear model. Journal of the American Statistical Association 92, 199-207.
Stan Development Team (2014a). Stan: A C++ library for probability and
sampling, version 2.6.
Stan Development Team (2014b). RStan, version 2.6.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely application information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.
Zhang, J., and Stephens, M. A. (2009). A new and efficient estimation method for the generalized Pareto distribution. Technometrics 51, 316-325.