pp_check: Graphical posterior predictive checks

Description

Various plots comparing the observed outcome variable $y$ to simulated datasets $yrep$ from the posterior predictive distribution.

Usage

pp_check(object, check = "distributions", nreps = NULL, seed = NULL, overlay = TRUE, test = "mean", ...)

Arguments

object

A fitted model object returned by one of the rstanarm modeling functions. See stanreg-objects.

check

The type of plot (possibly abbreviated) to show. One of "distributions", "residuals", "scatter", "test". See Details for descriptions.

nreps

The number of $yrep$ datasets to generate from the posterior predictive distribution (posterior_predict) and show in the plots. The default is nreps=3 for check="residuals" and nreps=8 for check="distributions". If check="test", nreps is ignored and the number of simulated datasets is the number of post-warmup draws from the posterior distribution. If check="scatter", nreps is not ignored but defaults to the number of post-warmup draws.

seed

An optional seed to pass to posterior_predict.

overlay

For check="distributions" only, should distributions be plotted as density estimates overlaid in a single plot (TRUE, the default) or as separate histograms (FALSE)?

test

For check="test" only, a character vector (of length 1 or 2) naming a single function or a pair of functions. The function(s) should take a vector input and return a scalar test statistic. See Details and Examples.

...

Optional arguments to geoms to control features of the plots (e.g. binwidth if the plot is a histogram).

Value

A ggplot object that can be further customized using the ggplot2 package.

Details

Descriptions of the plots corresponding to the different values of check:

distributions: The distributions of $y$ and nreps simulated $yrep$ datasets.
residuals: The distributions of residuals computed from $y$ and each of nreps simulated datasets. For binomial data, binned residual plots are generated (similar to binnedplot).
scatter: If nreps is NULL then $y$ is plotted against the average values of $yrep$, i.e., the points $(y_n, mean(yrep_n)), n = 1,...,N$, where each $yrep_n$ is a vector of length equal to the number of posterior draws. If nreps is a (preferably small) integer, then only nreps $yrep$ datasets are simulated and they are each plotted separately against $y$.
test: The distribution of a single test statistic $T(yrep)$ or a pair of test statistics over the nreps simulated datasets. If the test argument specifies only one function then the resulting plot is a histogram of $T(yrep)$ and the value of the test statistic in the observed data, $T(y)$, is shown in the plot as a vertical line. If two functions are specified then the plot is a scatterplot and $T(y)$ is shown as a large point.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. Chapman & Hall/CRC Press, London, third edition. (Ch. 6)

Examples

Run this code

if (!exists("example_model")) example(example_model)
# Compare distribution of y to distributions of yrep
(pp_dist <- pp_check(example_model, check = "dist", overlay = TRUE))
pp_dist + 
 ggplot2::scale_color_manual(values = c("red", "black")) + # change colors
 ggplot2::scale_size_manual(values = c(0.5, 3)) + # change line sizes 
 ggplot2::scale_fill_manual(values = c(NA, NA)) # remove fill

# Check residuals
pp_check(example_model, check = "resid", nreps = 6)

# Check histograms of test statistics
test_mean <- pp_check(example_model, check = "test", test = "mean")
test_sd <- pp_check(example_model, check = "test", test = "sd")
gridExtra::grid.arrange(test_mean, test_sd, ncol = 2)

# Scatterplot of two test statistics
pp_check(example_model, check = "test", test = c("mean", "sd"))

## Not run: 
# # Scatterplots of y vs. yrep
# fit <- stan_glm(mpg ~ wt, data = mtcars)
# pp_check(fit, check = "scatter") # y vs. average yrep
# pp_check(fit, check = "scatter", nreps = 3) # y vs. a few different yrep datasets 
# 
# 
# # Defining a function to compute test statistic 
# roaches$roach100 <- roaches$roach1 / 100
# fit_pois <- stan_glm(y ~ treatment + roach100 + senior, offset = log(exposure2), 
#                      family = "poisson", data = roaches)
# fit_nb <- update(fit_pois, family = "neg_binomial_2")
# 
# prop0 <- function(y) mean(y == 0) # function to compute proportion of zeros
# pp_check(fit_pois, check = "test", test = "prop0") # looks bad 
# pp_check(fit_nb, check = "test", test = "prop0")   # much better
# ## End(Not run)

Run the code above in your browser using DataLab