Extract samples from a Bayesian/MCMC sampler for a variable with the given named indices into one of two types of long-format data frames.
spread_samples(model, ..., regex = FALSE, sep = "[, ]")gather_samples(model, ..., regex = FALSE, sep = "[, ]")
A supported Bayesian model fit / MCMC object. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.
Expressions in the form of
variable_name[index_1, index_2, ...] | wide_index
. See `Details`.
If TRUE
, parameter names are treated as regular expressions and all column matching the
regular expression and number of indices are included in the output. Default FALSE
.
Separator used to separate indices in parameter names, as a regular expression.
A data frame.
Imagine a JAGS or Stan fit named fit
. The model may contain a parameter named
b[i,v]
(in the JAGS or Stan language) with i
in 1:100
and v
in 1:3
.
However, samples returned from JAGS or Stan in R will not reflect this indexing structure, instead
they will have multiple columns with names like "b[1,1]"
, "b[2,1]"
, etc.
spread_samples
and gather_samples
provide a straightforward
syntax to translate these columns back into properly-indexed variables in two different
tidy data frame formats, optionally recovering index types (e.g. factor levels) as it does so.
spread_samples
and gather_samples
return data frames already grouped by
all indices used on the variables you specify.
The difference between spread_samples
is that names of parameters in the model will
be spread across the data frame as column names, whereas gather_samples
will
gather terms into a single column named "term"
and place estimates of terms into a
column names "estimate"
. The "term"
and "estimate"
naming scheme
is used in order to be consistent with output from the tidy
function
in the broom package, to make it easier to use tidybayes with broom for model comparison.
For example, spread_samples(fit, a[i], b[i,v])
might return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number
column ".iteration"
: the interation number
column "i"
: value in 1:5
column "v"
: value in 1:10
column "a"
: value of "a[i]"
for iteration number
".iteration"
on chain number ".chain"
column "b"
: value of "b[i,v]"
for iteration number
".iteration"
on chain number ".chain"
gather_samples(fit, a[i], b[i,v])
on the same fit would return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number
column ".iteration"
: the interation number
column "i"
: value in 1:5
column "v"
: value in 1:10
, or NA
if "term"
is "a"
.
column "term"
: value in c("a", "b")
.
column "estimate"
: value of "a[i]"
(when "term"
is "a"
)
or "b[i,v]"
(when "term"
is "b"
) for iteration number
".iteration"
on chain number ".chain"
spread_samples
and gather_samples
can use type information
applied to the fit
object by recover_types
to convert columns
back into their original types. This is particularly helpful if some of the indices in
your model were originally factors. For example, if the v
index
in the original data frame data
was a factor with levels c("a","b","c")
,
then we could use recover_types
before spread_samples
:
fit %>% recover_types(data) spread_samples(fit, b[i,v])
Which would return the same data frame as above, except the "v"
column
would be a value in c("a","b","c")
instead of 1:3
.
For variables that do not share the same subscripts (or share some but not all subscripts), we can supply their specifications separately. For example, if we have a variable d[i] with the same i subscript as b[i,v], and a variable x with no subscripts, we could do this:
spread_samples(fit, x, d[i], b[i,v])
Which is roughly equivalent to this:
spread_samples(fit, x) %>% inner_join(spread_samples(fit, d[i])) %>% inner_join(spread_samples(fit, b[i,v])) %>% group_by(i,v)
Similarly, this:
gather_samples(fit, x, d[i], b[i,v])
Is roughly equivalent to this:
bind_rows( gather_samples(fit, x), gather_samples(fit, d[i]), gather_samples(fit, b[i,v]) )
The c
and cbind
functions can be used to combine multiple variable names that have
the same indices. For example, if we have several variables with the same
subscripts i
and v
, we could do either of these:
spread_samples(fit, c(w, x, y, z)[i,v])
spread_samples(fit, cbind(w, x, y, z)[i,v])
# equivalent
Each of which is roughly equivalent to this:
spread_samples(fit, w[i,v], x[i,v], y[i,v], z[i,v])
Besides being more compact, the c()
-style syntax is currently also
faster (though that may change).
Indices can be omitted from the resulting data frame by leaving their names
blank; e.g. spread_samples(fit, b[,v])
will omit the first index of
b
from the output. This is useful if an index is known to contain all
the same value in a given model.
The shorthand ..
can be used to specify one column that should be put
into a wide format and whose names will be the base variable name, plus a dot
("."), plus the value of the index at ..
. For example:
spread_samples(fit, b[i,..])
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the interation number
column "i"
: value in 1:20
column "b.1"
: value of "b[i,1]"
for iteration number
".iteration"
on chain number ".chain"
column "b.2"
: value of "b[i,2]"
for iteration number
".iteration"
on chain number ".chain"
column "b.3"
: value of "b[i,3]"
for iteration number
".iteration"
on chain number ".chain"
An optional clause in the form | wide_index
can also be used to put
the data frame into a wide format based on wide_index
. For example, this:
spread_samples(fit, b[i,v] | v)
is roughly equivalent to this:
spread_samples(fit, b[i,v]) %>% spread(v,b)
The main difference between using the |
syntax instead of the
..
syntax is that the |
syntax respects prototypes applied to
indices with recover_types
, and thus can be used to get
columns with nicer names. For example:
fit %>% recover_types(data) %>% spread_samples(b[i,v] | v)
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the interation number
column "i"
: value in 1:20
column "a"
: value of "b[i,1]"
for iteration number
".iteration"
on chain number ".chain"
column "b"
: value of "b[i,2]"
for iteration number
".iteration"
on chain number ".chain"
column "c"
: value of "b[i,3]"
for iteration number
".iteration"
on chain number ".chain"
Finally, parameter names can be regular expressions by setting regex = TRUE
; e.g.:
spread_samples(fit, `b_.*`[i], regex = TRUE)
Would return a tidy data frame with parameters starting with `b_` and having one index.
# NOT RUN {
library(magrittr)
library(ggplot2)
data(RankCorr, package = "tidybayes")
RankCorr %>%
spread_samples(b[i, j])
RankCorr %>%
spread_samples(b[i, j], tau[i], u_tau[i])
RankCorr %>%
gather_samples(b[i, j], tau[i], u_tau[i])
RankCorr %>%
gather_samples(tau[i], typical_r) %>%
mean_qi()
# }
Run the code above in your browser using DataLab