Vehtari, A., Gelman, A., and Gabry, J. (2016). Practical Bayesian model
evaluation using leave-one-out cross-validation and WAIC. arXiv preprint:
The package documentation is largely based on excerpts from the paper.
Exact cross-validation requires re-fitting the model with different training sets. Approximate leave-one-out cross-validation (LOO) can be computed easily using importance sampling (Gelfand, Dey, and Chang, 1992, Gelfand, 1996) but the resulting estimate is noisy, as the variance of the importance weights can be large or even infinite (Peruggia, 1997, Epifani et al., 2008). Here we propose a novel approach that provides a more accurate and reliable estimate using importance weights that are smoothed by fitting a generalized Pareto distribution to the upper tail of the distribution of the importance weights.
WAIC (the widely applicable or Watanabe-Akaike information criterion; Watanabe, 2010) can be viewed as an improvement on the deviance information criterion (DIC) for Bayesian models. DIC has gained popularity in recent years in part through its implementation in the graphical modeling package BUGS (Spiegelhalter, Best, et al., 2002; Spiegelhalter, Thomas, et al., 1994, 2003), but it is known to have some problems, arising in part from it not being fully Bayesian in that it is based on a point estimate (van der Linde, 2005, Plummer, 2008). For example, DIC can produce negative estimates of the effective number of parameters in a model and it is not defined for singular models. WAIC is fully Bayesian and closely approximates Bayesian cross-validation. Unlike DIC, WAIC is invariant to parametrization and also works for singular models. WAIC is asymptotically equal to LOO, and can thus be used as an approximation to LOO. In the finite case, WAIC and LOO often give very similar estimates, but for influential observations WAIC underestimates the effect of leaving out one observation.
One advantage of AIC and DIC has been their computational simplicity. In this package we present fast and stable computations for LOO and WAIC that can be performed directly on posterior simulations, thus allowing these newer tools to enter routine statistical practice. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for the comparison of predictive errors between two models.
Epifani et al. (2008) show that when estimating the leave-one-out predictive density, the central limit theorem holds if the variance of the weight distribution is finite. These results can be extended using the generalized central limit theorem for stable distributions. Thus, even if the variance of the importance weight distribution is infinite, if the mean exists the estimate's accuracy improves when additional draws are obtained. When the tail of the weight distribution is long, a direct use of importance sampling is sensitive to one or few largest values. By fitting a gPd to the upper tail of the importance weights we smooth these values. The procedure goes as follows:
The above steps must be performed for each data point $i$. The result is a vector of weights $w_{i}^{s}, s = 1,...,S$, for each $i$, which in general should be better behaved than the raw importance ratios $r_{i}^{s}$ from which they were constructed.
The results are then combined to compute the desired LOO estimates. The reliability of the estimates can be assessed using the estimates for the shape parameter $k$ of the gPd.
If the estimated tail shape parameter $k \ge 1/2$, the user should be warned. Even if the PSIS estimate has a finite variance, the user should consider sampling directly from $p(\theta^s | y_{-i})$ for the problematic $i$, use $k$-fold cross-validation, or use a more robust model.
Importance sampling is likely to work less well if the marginal posterior $p(\theta^s | y)$ and LOO posterior $p(\theta^s | y_{-i})$ are much different, which is more likely to happen with a non-robust model and highly influential observations. A robust model may reduce the sensitivity to highly influential observations.
Epifani, I., MacEachern, S. N., and Peruggia, M. (2008). Case-deletion importance sampling estimators: Central limit theorems and related results. Electronic Journal of Statistics 2, 774-806.
Gelfand, A. E. (1996). Model determination using sampling-based methods. In Markov Chain Monte Carlo in Practice, ed. W. R. Gilks, S. Richardson, D. J. Spiegelhalter, 145-162. London: Chapman and Hall.
Gelfand, A. E., Dey, D. K., and Chang, H. (1992). Model determination using predictive distributions with implementation via sampling-based methods. In Bayesian Statistics 4, ed. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, 147-167. Oxford University Press.
Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing 24, 997-1016.
Ionides, E. L. (2008). Truncated importance sampling. Journal of Computational and Graphical Statistics 17, 295-311.
Koopman, S. J., Shephard, N., and Creal, D. (2009). Testing the assumptions behind importance sampling. Journal of Econometrics 149, 2-11.
Peruggia, M. (1997). On the variability of case-deletion importance sampling weights in the Bayesian linear model. Journal of the American Statistical Association 92, 199-207.
Stan Development Team (2016). Stan: A C++ library for probability and
sampling, version 2.9.
Stan Development Team (2016). RStan, version 2.9.
Vehtari, A., and Gelman, A. (2015). Pareto smoothed importance sampling. arXiv:1507.02646.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely application information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.
Zhang, J., and Stephens, M. A. (2009). A new and efficient estimation method for the generalized Pareto distribution. Technometrics 51, 316-325.