spread_samples: Extract samples of parameters in a Bayesian model fit into a tidy data format

Description

Extract samples from a Bayesian/MCMC sampler for a variable with the given named indices into one of two types of long-format data frames.

Usage

spread_samples(model, ..., regex = FALSE, sep = "[, ]")
gather_samples(model, ..., regex = FALSE, sep = "[, ]")

Arguments

model

A supported Bayesian model fit / MCMC object. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.

...

Expressions in the form of variable_name[index_1, index_2, ...] | wide_index. See `Details`.

regex

If TRUE, parameter names are treated as regular expressions and all column matching the regular expression and number of indices are included in the output. Default FALSE.

sep

Separator used to separate indices in parameter names, as a regular expression.

Value

A data frame.

Details

Imagine a JAGS or Stan fit named fit. The model may contain a parameter named b[i,v] (in the JAGS or Stan language) with i in 1:100 and v in 1:3. However, samples returned from JAGS or Stan in R will not reflect this indexing structure, instead they will have multiple columns with names like "b[1,1]", "b[2,1]", etc.

spread_samples and gather_samples provide a straightforward syntax to translate these columns back into properly-indexed variables in two different tidy data frame formats, optionally recovering index types (e.g. factor levels) as it does so.

spread_samples and gather_samples return data frames already grouped by all indices used on the variables you specify.

The difference between spread_samples is that names of parameters in the model will be spread across the data frame as column names, whereas gather_samples will gather terms into a single column named "term" and place estimates of terms into a column names "estimate". The "term" and "estimate" naming scheme is used in order to be consistent with output from the tidy function in the broom package, to make it easier to use tidybayes with broom for model comparison.

For example, spread_samples(fit, a[i], b[i,v]) might return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number
column ".iteration": the interation number
column "i": value in 1:5
column "v": value in 1:10
column "a": value of "a[i]" for iteration number ".iteration" on chain number ".chain"
column "b": value of "b[i,v]" for iteration number ".iteration" on chain number ".chain"

gather_samples(fit, a[i], b[i,v]) on the same fit would return a grouped data frame (grouped by i and v), with:

column ".chain": the chain number
column ".iteration": the interation number
column "i": value in 1:5
column "v": value in 1:10, or NA if "term" is "a".
column "term": value in c("a", "b").
column "estimate": value of "a[i]" (when "term" is "a") or "b[i,v]" (when "term" is "b") for iteration number ".iteration" on chain number ".chain"

spread_samples and gather_samples can use type information applied to the fit object by recover_types to convert columns back into their original types. This is particularly helpful if some of the indices in your model were originally factors. For example, if the v index in the original data frame data was a factor with levels c("a","b","c"), then we could use recover_types before spread_samples:

fit %>%
 recover_types(data) 
 spread_samples(fit, b[i,v])

Which would return the same data frame as above, except the "v" column would be a value in c("a","b","c") instead of 1:3.

For variables that do not share the same subscripts (or share some but not all subscripts), we can supply their specifications separately. For example, if we have a variable d[i] with the same i subscript as b[i,v], and a variable x with no subscripts, we could do this:

spread_samples(fit, x, d[i], b[i,v])

Which is roughly equivalent to this:

spread_samples(fit, x) %>%
 inner_join(spread_samples(fit, d[i])) %>%
 inner_join(spread_samples(fit, b[i,v])) %>%
 group_by(i,v)

Similarly, this:

gather_samples(fit, x, d[i], b[i,v])

Is roughly equivalent to this:

bind_rows(
 gather_samples(fit, x),
 gather_samples(fit, d[i]),
 gather_samples(fit, b[i,v])
)

The c and cbind functions can be used to combine multiple variable names that have the same indices. For example, if we have several variables with the same subscripts i and v, we could do either of these:

spread_samples(fit, c(w, x, y, z)[i,v])

spread_samples(fit, cbind(w, x, y, z)[i,v])

# equivalent

Each of which is roughly equivalent to this:

spread_samples(fit, w[i,v], x[i,v], y[i,v], z[i,v])

Besides being more compact, the c()-style syntax is currently also faster (though that may change).

Indices can be omitted from the resulting data frame by leaving their names blank; e.g. spread_samples(fit, b[,v]) will omit the first index of b from the output. This is useful if an index is known to contain all the same value in a given model.

The shorthand .. can be used to specify one column that should be put into a wide format and whose names will be the base variable name, plus a dot ("."), plus the value of the index at ... For example:

spread_samples(fit, b[i,..]) would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the interation number
column "i": value in 1:20
column "b.1": value of "b[i,1]" for iteration number ".iteration" on chain number ".chain"
column "b.2": value of "b[i,2]" for iteration number ".iteration" on chain number ".chain"
column "b.3": value of "b[i,3]" for iteration number ".iteration" on chain number ".chain"

An optional clause in the form | wide_index can also be used to put the data frame into a wide format based on wide_index. For example, this:

spread_samples(fit, b[i,v] | v)

is roughly equivalent to this:

spread_samples(fit, b[i,v]) %>% spread(v,b)

The main difference between using the | syntax instead of the .. syntax is that the | syntax respects prototypes applied to indices with recover_types, and thus can be used to get columns with nicer names. For example:

fit %>% recover_types(data) %>% spread_samples(b[i,v] | v) would return a grouped data frame (grouped by i), with:

column ".chain": the chain number
column ".iteration": the interation number
column "i": value in 1:20
column "a": value of "b[i,1]" for iteration number ".iteration" on chain number ".chain"
column "b": value of "b[i,2]" for iteration number ".iteration" on chain number ".chain"
column "c": value of "b[i,3]" for iteration number ".iteration" on chain number ".chain"

Finally, parameter names can be regular expressions by setting regex = TRUE; e.g.:

spread_samples(fit, `b_.*`[i], regex = TRUE)

Would return a tidy data frame with parameters starting with `b_` and having one index.

Examples

Run this code

# NOT RUN {
library(magrittr)
library(ggplot2)

data(RankCorr, package = "tidybayes")

RankCorr %>%
  spread_samples(b[i, j])

RankCorr %>%
  spread_samples(b[i, j], tau[i], u_tau[i])


RankCorr %>%
  gather_samples(b[i, j], tau[i], u_tau[i])

RankCorr %>%
  gather_samples(tau[i], typical_r) %>%
  mean_qi()

# }