gather_draws: Extract draws of variables in a Bayesian model fit into a tidy data format

Description

Extract draws from a Bayesian model for one or more variables (possibly with named dimensions) into one of two types of long-format data frames.

Usage

gather_draws(model, ..., regex = FALSE, sep = "[, ]", n = NULL, seed = NULL)
spread_draws(model, ..., regex = FALSE, sep = "[, ]", n = NULL, seed = NULL)

Arguments

model

A supported Bayesian model fit. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.

...

Expressions in the form of variable_name[dimension_1, dimension_2, ...] | wide_dimension. See Details.

regex

If TRUE, variable names are treated as regular expressions and all column matching the regular expression and number of dimensions are included in the output. Default FALSE.

sep

Separator used to separate dimensions in variable names, as a regular expression.

The number of draws to return, or NULL to return all draws.

seed

A seed to use when subsampling draws (i.e. when n is not NULL).

Value

A data frame.

Details

Imagine a JAGS or Stan fit named fit. The model may contain a variable named b[i,v] (in the JAGS or Stan language) with dimension i in 1:100 and dimension v in 1:3. However, the default format for draws returned from JAGS or Stan in R will not reflect this indexing structure, instead they will have multiple columns with names like "b[1,1]", "b[2,1]", etc.

spread_draws and gather_draws provide a straightforward syntax to translate these columns back into properly-indexed variables in two different tidy data frame formats, optionally recovering dimension types (e.g. factor levels) as it does so.

spread_draws and gather_draws return data frames already grouped by all dimensions used on the variables you specify.

The difference between spread_draws is that names of variables in the model will be spread across the data frame as column names, whereas gather_draws will gather variables into a single column named ".variable" and place values of variables into a column named ".value". To use naming schemes from other packages (such as broom), consider passing results through functions like to_broom_names() or to_ggmcmc_names().

For example, spread_draws(fit, a[i], b[i,v]) might return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number. NA if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".iteration": the iteration number. Guaranteed to be unique within-chain only. NA if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".draw": a unique number for each draw from the posterior. Order is not guaranteed to be meaningful.
column "i": value in 1:5
column "v": value in 1:10
column "a": value of "a[i]" for draw ".draw"
column "b": value of "b[i,v]" for draw ".draw"

gather_draws(fit, a[i], b[i,v]) on the same fit would return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:5
column "v": value in 1:10, or NA if ".variable" is "a".
column ".variable": value in c("a", "b").
column ".value": value of "a[i]" (when ".variable" is "a") or "b[i,v]" (when ".variable" is "b") for draw ".draw"

spread_draws and gather_draws can use type information applied to the fit object by recover_types() to convert columns back into their original types. This is particularly helpful if some of the dimensions in your model were originally factors. For example, if the v dimension in the original data frame data was a factor with levels c("a","b","c"), then we could use recover_types before spread_draws:

fit %>%
 recover_types(data) 
 spread_draws(fit, b[i,v])

Which would return the same data frame as above, except the "v" column would be a value in c("a","b","c") instead of 1:3.

For variables that do not share the same subscripts (or share some but not all subscripts), we can supply their specifications separately. For example, if we have a variable d[i] with the same i subscript as b[i,v], and a variable x with no subscripts, we could do this:

spread_draws(fit, x, d[i], b[i,v])

Which is roughly equivalent to this:

spread_draws(fit, x) %>%
 inner_join(spread_draws(fit, d[i])) %>%
 inner_join(spread_draws(fit, b[i,v])) %>%
 group_by(i,v)

Similarly, this:

gather_draws(fit, x, d[i], b[i,v])

Is roughly equivalent to this:

bind_rows(
 gather_draws(fit, x),
 gather_draws(fit, d[i]),
 gather_draws(fit, b[i,v])
)

The c and cbind functions can be used to combine multiple variable names that have the same dimensions. For example, if we have several variables with the same subscripts i and v, we could do either of these:

spread_draws(fit, c(w, x, y, z)[i,v])

spread_draws(fit, cbind(w, x, y, z)[i,v])  # equivalent

Each of which is roughly equivalent to this:

spread_draws(fit, w[i,v], x[i,v], y[i,v], z[i,v])

Besides being more compact, the c()-style syntax is currently also faster (though that may change).

Dimensions can be omitted from the resulting data frame by leaving their names blank; e.g. spread_draws(fit, b[,v]) will omit the first dimension of b from the output. This is useful if a dimension is known to contain all the same value in a given model.

The shorthand .. can be used to specify one column that should be put into a wide format and whose names will be the base variable name, plus a dot ("."), plus the value of the dimension at ... For example:

spread_draws(fit, b[i,..]) would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:20
column "b.1": value of "b[i,1]" for draw ".draw"
column "b.2": value of "b[i,2]" for draw ".draw"
column "b.3": value of "b[i,3]" for draw ".draw"

An optional clause in the form | wide_dimension can also be used to put the data frame into a wide format based on wide_dimension. For example, this:

spread_draws(fit, b[i,v] | v)

is roughly equivalent to this:

spread_draws(fit, b[i,v]) %>% spread(v,b)

The main difference between using the | syntax instead of the .. syntax is that the | syntax respects prototypes applied to dimensions with recover_types(), and thus can be used to get columns with nicer names. For example:

fit %>% recover_types(data) %>% spread_draws(b[i,v] | v)

would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the iteration number
column ".draw": the draw number
column "i": value in 1:20
column "a": value of "b[i,1]" for draw ".draw"
column "b": value of "b[i,2]" for draw ".draw"
column "c": value of "b[i,3]" for draw ".draw"

The shorthand . can be used to specify columns that should be nested into vectors, matrices, or n-dimensional arrays (depending on how many dimensions are specified with .).

For example, spread_draws(fit, a[.], b[.,.]) might return a data frame, with:

column ".chain": the chain number.
column ".iteration": the iteration number.
column ".draw": a unique number for each draw from the posterior.
column "a": a list column of vectors.
column "b": a list column of matrices.

Ragged arrays are turned into non-ragged arrays with missing entries given the value NA.

Finally, variable names can be regular expressions by setting regex = TRUE; e.g.:

spread_draws(fit, `b_.*`[i], regex = TRUE)

Would return a tidy data frame with variables starting with b_ and having one dimension.

Examples

Run this code

# NOT RUN {
library(dplyr)
library(ggplot2)

data(RankCorr, package = "tidybayes")

RankCorr %>%
  spread_draws(b[i, j])

RankCorr %>%
  spread_draws(b[i, j], tau[i], u_tau[i])


RankCorr %>%
  gather_draws(b[i, j], tau[i], u_tau[i])

RankCorr %>%
  gather_draws(tau[i], typical_r) %>%
  median_qi()

# }