priors: Prior distributions and options

Description

These functions are used to specify the prior-related arguments of the various modeling functions in the rstanarm package.

Usage

normal(location = 0, scale = NULL)
student_t(df = 1, location = 0, scale = NULL)
cauchy(location = 0, scale = NULL)
hs(df = 3)
hs_plus(df1 = 3, df2 = 3)
decov(regularization = 1, concentration = 1, shape = 1, scale = 1)
dirichlet(concentration = 1)
R2(location = NULL, what = c("mode", "mean", "median", "log"))
prior_options(prior_scale_for_dispersion = 5, min_prior_scale = 1e-12, scaled = TRUE)

Arguments

location

Prior location. For normal and student_t (provided that df > 1) this is the prior mean. For cauchy (which is equivalent to student_t with df=1), the mean does not exist and location is the prior median. The default value is $0$, except for R2 which has no default value for location. For R2, location pertains to the prior location of the $R^2$ under a Beta distribution, but the interpretation of the location parameter depends on the specified value of the what argument (see the "R2 family" section in Details).

scale

Prior scale. The default depends on the family (see Details).

df, df1, df2

Prior degrees of freedom. The default is $1$ for student_t, in which case it is equivalent to cauchy. For the hierarchical shrinkage priors (hs and hs_plus) the degrees of freedom parameter(s) default to $3$.

regularization

Exponent for an LKJ prior on the correlation matrix in the decov prior. The default is $1$, implying a joint uniform prior.

concentration

Concentration parameter for a symmetric Dirichlet distribution. The defaults is $1$, implying a joint uniform prior.

shape

Shape parameter for a gamma prior on the scale parameter in the decov prior. If shape and scale are both $1$ (the default) then the gamma prior simplifies to the unit-exponential distribution.

what

A character string among 'mode' (the default), 'mean', 'median', or 'log' indicating how the location parameter is interpreted in the LKJ case. If 'log', then location is interpreted as the expected logarithm of the $R^2$ under a Beta distribution. Otherwise, location is interpreted as the what of the $R^2$ under a Beta distribution. If the number of predictors is less than or equal to two, the mode of this Beta distribution does not exist and an error will prompt the user to specify another choice for what.

prior_scale_for_dispersion

Prior scale for the standard error of the regression in Gaussian models, which is given a half-Cauchy prior truncated at zero.

min_prior_scale

Minimum prior scale for the intercept and coefficients.

scaled

A logical scalar, defaulting to TRUE. If TRUE the prior_scale is further scaled by the range of the predictor if the predictor has exactly two unique values and scaled by twice the standard deviation of the predictor if it has more than two unique values.

Value

A named list to be used internally by the rstanarm model fitting functions.

Details

The details depend on the family of the prior being used: Student t family Family members:

normal(location, scale)
student_t(df, location, scale)
cauchy(location, scale)

For the prior distribution for the intercept, location, scale, and df should be scalars. For the prior for the other coefficients they can either be vectors of length equal to the number of coefficients (not including the intercept), or they can be scalars, in which case they will be recycled to the appropriate length. As the degrees of freedom approaches infinity, the Student t distribution approaches the normal distribution and if the degrees of freedom are one, then the Student t distribution is the Cauchy distribution. If scale is not specified it will default to 10 for the intercept and 2.5 for the other coefficients, unless the probit link function is used, in which case these defaults are scaled by a factor of dnorm(0)/dlogis(0), which is roughly 1.6.

Hierarchical shrinkage family Family members:

hs(df)
hs_plus(df1, df2)

The hierarchical shrinkage priors are normal with a mean of zero and a standard deviation that is also a random variable. The traditional hierarchical shrinkage prior utilizes a standard deviation that is distributed half Cauchy with a median of zero and a scale parameter that is also half Cauchy. This is called the "horseshoe prior". The hierarchical shrinkage (hs) prior in the rstanarm package instead utilizes a half Student t distribution for the standard deviation (with 3 degrees of freedom by default), scaled by a half Cauchy parameter, as described by Piironen and Vehtari (2015). It is possible to change the df argument, the prior degrees of freedom, to obtain less or more shrinkage. The hierarhical shrinkpage plus (hs_plus) prior is a normal with a mean of zero and a standard deviation that is distributed as the product of two independent half Student t parameters (both with 3 degrees of freedom (df1, df2) by default) that are each scaled by the same square root of a half Cauchy parameter. These hierarchical shrinkage priors have very tall modes and very fat tails. Consequently, they tend to produce posterior distributions that are very concentrated near zero, unless the predictor has a strong influence on the outcome, in which case the prior has little influence. Hierarchical shrinkage priors often require you to increase the adapt_delta tuning parameter in order to diminish the number of divergent transitions. For more details on tuning parameters and divergent transitions see the Troubleshooting section of the How to Use the rstanarm Package vignette.

Dirichlet family Family members:

dirichlet(concentration)

The Dirichlet distribution is a multivariate generalization of the beta distribution. It is perhaps the easiest prior distribution to specify because the concentration parameters can be interpreted as prior counts (although they need not be integers) of a multinomial random variable. The Dirichlet distribution is used in stan_polr for an implicit prior on the cutpoints in an ordinal regression model. More specifically, the Dirichlet prior pertains to the prior probability of observing each category of the ordinal outcome when the predictors are at their sample means. Given these prior probabilities, it is straightforward to add them to form cumulative probabilities and then use an inverse CDF transformation of the cumulative probabilities to define the cutpoints. If a scalar is passed to the concentration argument of the dirichlet function, then it is replicated to the appropriate length and the Dirichlet distribution is symmetric. If concentration is a vector and all elements are $1$, then the Dirichlet distribution is jointly uniform. If all concentration parameters are equal but greater than $1$ then the prior mode is that the categories are equiprobable, and the larger the value of the identical concentration parameters, the more sharply peaked the distribution is at the mode. The elements in concentration can also be given different values to represent that not all outcome categories are a priori equiprobable.

Covariance matrices Family members:

decov(regularization, concentration, shape, scale)

Covariance matrices are decomposed into correlation matrices and variances. The variances are in turn decomposed into the product of a simplex vector and the trace of the matrix. Finally, the trace is the product of the order of the matrix and the square of a scale parameter. This prior on a covariance matrix is represented by the decov function. The prior for a correlation matrix is called LKJ whose density is proportional to the determinant of the correlation matrix raised to the power of a positive regularization parameter minus one. If regularization = 1 (the default), then this prior is jointly uniform over all correlation matrices of that size. If regularization > 1, then the identity matrix is the mode and in the unlikely case that regularization < 1, the identity matrix is the trough. The trace of a covariance matrix is equal to the sum of the variances. We set the trace equal to the product of the order of the covariance matrix and the square of a positive scale parameter. The particular variances are set equal to the product of a simplex vector --- which is non-negative and sums to $1$ --- and the scalar trace. In other words, each element of the simplex vector represents the proportion of the trace attributable to the corresponding variable. A symmetric Dirichlet prior is used for the simplex vector, which has a single (positive) concentration parameter, which defaults to $1$ and implies that the prior is jointly uniform over the space of simplex vectors of that size. If concentration > 1, then the prior mode corresponds to all variables having the same (proportion of total) variance, which can be used to ensure the the posterior variances are not zero. As the concentration parameter approaches infinity, this mode becomes more pronounced. In the unlikely case that concentration < 1, the variances are more polarized. If all the variables were multiplied by a number, the trace of their covariance matrix would increase by that number squared. Thus, it is reasonable to use a scale-invariant prior distribution for the positive scale parameter, and in this case we utilize a Gamma distribution, whose shape and scale are both $1$ by default, implying a unit-exponential distribution. Set the shape hyperparameter to some value greater than $1$ to ensure that the posterior trace is not zero. If regularization, concentration, shape and / or scale are positive scalars, then they are recycled to the appropriate length. Otherwise, each can be a positive vector of the appropriate length, but the appropriate length depends on the number of covariance matrices in the model and their sizes. A one-by-one covariance matrix is just a variance and thus does not have regularization or concentration parameters, but does have shape and scale parameters for the prior standard deviation of that variable.

R2 family Family members:

R2(location, what)

The stan_lm, stan_aov, and stan_polr functions allow the user to utilize a function called R2 to convey prior information about all the parameters. This prior hinges on prior beliefs about the location of $R^2$, the proportion of variance in the outcome attributable to the predictors, which has a Beta prior with first shape hyperparameter equal to half the number of predictors and second shape hyperparameter free. By specifying what to be the prior mode (the default), mean, median, or expected log of $R^2$, the second shape parameter for this Beta distribution is determined internally. If what = 'log', location should be a negative scalar; otherwise it should be a scalar on the $(0,1)$ interval. For example, if $R^2 = 0.5$, then the mode, mean, and median of the Beta distribution are all the same and thus the second shape parameter is also equal to half the number of predictors. The second shape parameter of the Beta distribution is actually the same as the shape parameter in the LKJ prior for a correlation matrix described in the previous subsection. Thus, the smaller is $R^2$, the larger is the shape parameter, the smaller are the prior correlations among the outcome and predictor variables, and the more concentrated near zero is the prior density for the regression coefficients. Hence, the prior on the coefficients is regularizing and should yield a posterior distribution with good out-of-sample predictions if the prior location of $R^2$ is specified in a reasonable fashion.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. Chapman & Hall/CRC Press, London, third edition. http://stat.columbia.edu/~gelman/book/

Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y. (2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics. 2(4), 1360--1383.

Piironen, J., and Vehtari, A. (2015). Projection predictive variable selection using Stan+R. http://arxiv.org/abs/1508.02502/

Stan Development Team. (2015). Stan Modeling Language Users Guide and Reference Manual. http://mc-stan.org/documentation/

Examples

Run this code

fmla <- mpg ~ wt + qsec + drat + am

# Draw from prior predictive distribution (by setting prior_PD = TRUE)
prior_pred_fit <- stan_glm(fmla, data = mtcars, chains = 1, prior_PD = TRUE,
                           prior = student_t(df = 4, 0, 2.5), 
                           prior_intercept = cauchy(0,10), 
                           prior_ops = prior_options(prior_scale_for_dispersion = 2))

## Not run: 
# # Can assign priors to names
# N05 <- normal(0, 5)
# fit <- stan_glm(fmla, data = mtcars, prior = N05, prior_intercept = N05)
# ## End(Not run)

# Visually compare normal, student_t, and cauchy
library(ggplot2)
compare_priors <- function(scale = 1, df_t = 2, xlim = c(-10, 10)) {
  dt_loc_scale <- function(x, df, location, scale) { 
    # t distribution with location & scale parameters
    1 / scale * dt((x - location) / scale, df)  
  }
  ggplot(data.frame(x = xlim), aes(x)) + 
    stat_function(fun = dnorm, 
                  args = list(mean = 0, sd = scale), 
                  color = "purple", size = .75) +
    stat_function(fun = dt_loc_scale, 
                  args = list(df = df_t, location = 0, scale = scale), 
                  color = "orange", size = .75) +
    stat_function(fun = dcauchy, 
                  args = list(location = 0, scale = scale), 
                  color = "skyblue", size = .75, linetype = 2) + 
    ggtitle("normal (purple) vs student_t (orange) vs cauchy (blue)")
}
# Cauchy has fattest tails, then student_t, then normal
compare_priors()

# The student_t with df = 1 is the same as the cauchy
compare_priors(df_t = 1) 

# Even a scale of 5 is somewhat large. It gives plausibility to rather 
# extreme values
compare_priors(scale = 5, xlim = c(-20,20)) 

# If you use a prior like normal(0, 1000) to be "non-informative" you are 
# actually saying that a coefficient value of e.g. -500 is quite plausible
compare_priors(scale = 1000, xlim = c(-1000,1000))

Run the code above in your browser using DataLab