baggr: Bayesian aggregate treatment effects model

Description

Bayesian inference on parameters of an average treatment effects model that's appropriate to the supplied individual- or group-level data, using Hamiltonian Monte Carlo in Stan. (For overall package help file see baggr-package)

Usage

baggr(
  data,
  model = NULL,
  pooling = c("partial", "none", "full"),
  effect_label = NULL,
  covariates = c(),
  prior_hypermean = NULL,
  prior_hypersd = NULL,
  prior_hypercor = NULL,
  prior_beta = NULL,
  prior_cluster = NULL,
  prior_control = NULL,
  prior_control_sd = NULL,
  prior_selection = NULL,
  prior_sigma = NULL,
  prior = NULL,
  ppd = FALSE,
  pooling_control = c("none", "partial", "remove"),
  test_data = NULL,
  quantiles = seq(0.05, 0.95, 0.1),
  outcome = "outcome",
  group = "group",
  treatment = "treatment",
  cluster = NULL,
  selection = NULL,
  silent = FALSE,
  warn = TRUE,
  ...
)

Value

baggr class structure: a list including Stan model fit alongside input data, pooling metrics, various model properties. If test data is used, mean value of -2*lpd is reported as mean_lpd

Arguments

data: data frame with summary or individual level data to meta-analyse; see Details section for how to format your data
model: if NULL, detected automatically from input data otherwise choose from rubin, rubin_full, logit, mutau (see Details).
pooling: Type of pooling; choose from "none", "partial" (default) and "full". If you are not familiar with the terms, consult the vignette; "partial" can be understood as random effects and "full" as fixed effects
effect_label: How to label the effect(s). These labels are used in various print and plot outputs. Will default to "mean" in most models, "log OR" in logistic model etc. If you plan on comparing models (see baggr_compare), use the same labels.
covariates: Character vector with column names in data. The corresponding columns are used as covariates (fixed effects) in the meta-regression model (in case of aggregate data). In the case of individual level data the model does not differentiate between group-level variables (same values of the covariate for all rows related to a given group) and individual-level covariates.
prior_hypermean: prior distribution for hypermean; you can use "plain text" notation like prior_hypermean=normal(0,100) or uniform(-10, 10). See Details:Priors section below for more possible specifications. If unspecified, the priors will be derived automatically based on data (and printed out in the console).
prior_hypersd: prior for hyper-standard deviation, used by Rubin and "mutau" models; same rules apply as for _hypermean;
prior_hypercor: prior for hypercorrelation matrix, used by the "mutau" model
prior_beta: prior for regression coefficients if covariates are specified; will default to experimental normal(0, 10^2) distribution
prior_cluster: priors for SDs of cluster random effects in each study (i.e. assuming normal(0, sigma_k^2), with different sigma in each group)
prior_control: prior for the mean in the control arm (baseline), currently used in "logit" model only; if pooling_control = "partial", the prior is hyperprior for all baselines, if "none", then it is an independent prior for all baselines
prior_control_sd: prior for the SD in the control arm (baseline), currently used in "logit" model only; this can only be used if pooling_control = "partial"
prior_selection: prior for the log of relative publication probability (see Selection section below and argument selection)
prior_sigma: prior for error terms in linear regression models ("rubin_full" or "mutau_full")
prior: alternative way to specify all of the priors above as a single named list, with hypermean, hypersd, hypercor, beta etc., with names dropping prior_ prefix, e.g. prior = list(hypermean = normal(0,10), beta = uniform(-5, 5))
ppd: logical; use prior predictive distribution? (p.p.d.) If ppd=TRUE, Stan model will sample from the prior distribution(s) and ignore data in inference. However, data argument might still be used to infer the correct model (if model=NULL) and to set the default priors, therefore you must specify it.
pooling_control: Pooling for group-specific control mean terms in models using individual-level data. Typically we use either "none" or "partial", but if you want to remove the group-specific intercept altogether, set this to "remove".
test_data: data for cross-validation; NULL for no validation, otherwise a data frame with the same columns as data argument. See "Cross-validation" section below.
quantiles: if model = "quantiles", a vector indicating which quantiles of data to use (with values between 0 and 1)
outcome: column name in data (used in individual-level only) with outcome variable values
group: column name in data with grouping factor; it's necessary for individual-level data, for summarised data it will be used as labels for groups when displaying results
treatment: column name in (individual-level) data with treatment factor;
cluster: optional; column name in (individual-level) data; if defined, random cluster effects will be fitted in each study
selection: optional vector of z-values which implements symmetrical selection (relative probability of publication) model; most commonly you'd set this to 1.96 (p=0.05) of c(1.96, 2.58) (p=0.05 and p=0.01)
silent: Whether to silence messages about prior settings and about other automatic behaviour.
warn: print an additional warning if Rhat exceeds 1.05
...: extra options passed to Stan function, e.g. control = list(adapt_delta = 0.99), number of iterations etc.

Details

Below we briefly discuss 1/ data preparation, 2/ choice of model, 3/ choice of priors. All three are discussed in more depth in the package vignette, vignette("baggr").

Data. For aggregate data models you need a data frame with columns tau and se (Rubin model) or tau, mu, se.tau, se.mu ("mu & tau" model). An additional column can be used to provide labels for each group (by default column group is used if available, but this can be customised -- see the example below). For individual level data three columns are needed: outcome, treatment, group. These are identified by using the outcome, treatment and group arguments.

Many data preparation steps can be done through a helper function prepare_ma. It can convert individual to summary-level data, calculate odds/risk ratios (with/without corrections) in binary data, standardise variables and more. Using it will automatically format data inputs to work with baggr().

Models. Available models are:

for the continuous variable means: "rubin" model for average treatment effect (using summary data), "mutau" version which takes into account means of control groups (also using summary data), "rubin_full", which is the same model as "rubin" but works with individual-level data
for binary data: "logit" model can be used on individual-level data; you can also analyse continuous statistics such as log odds ratios and logs risk ratios using the models listed above; see vignette("baggr_binary") for tutorial with examples

If no model is specified, the function tries to infer the appropriate model automatically. Additionally, the user must specify type of pooling. The default is always partial pooling.

Covariates. Both aggregate and individual-level data can include extra columns, given by covariates argument (specified as a character vector of column names) to be used in regression models. We also refer to impact of these covariates as fixed effects.

Two types of covariates may be present in your data:

In "rubin" and "mutau" models, covariates that change according to group unit. In that case, the model accounting for the group covariates is a meta-regression model. It can be modelled on summary-level data.
In "logit" and "rubin_full" models, covariates that change according to individual unit. Then, such a model is often referred to as a mixed model. It has to be fitted to individual-level data. Note that meta-regression is a special case of a mixed model for individual-level data.

Priors. It is optional to specify priors yourself, as the package will try propose an appropriate prior for the input data if you do not pass a prior argument. To set the priors yourself, use prior_ arguments. For specifying many priors at once (or re-using between models), a single prior = list(...) argument can be used instead. Meaning of the prior parameters may slightly change from model to model. Details and examples are given in vignette("baggr"). Setting ppd=TRUE can be used to obtain prior predictive distributions, which is useful for understanding the prior assumptions, especially useful in conjunction with effect_plot. You can also baggr_compare different priors by setting baggr_compare(..., compare="prior").

Cross-validation. When test_data are specified, an extra parameter, the log predictive density, will be returned by the model. (The fitted model itself is the same regardless of whether there are test_data.) To understand this parameter, see documentation of loocv, a function that can be used to assess out of sample prediction of the model using all available data. If using individual-level data model, test_data should only include treatment arms of the groups of interest. (This is because in cross-validation we are not typically interested in the model's ability to fit heterogeneity in control arms, but only heterogeneity in treatment arms.) For using aggregate level data, there is no such restriction.

Selection model. If the selection argument is not NULL, baggr() fits a symmetric selection-on-z-values model (currently only in the "rubin" summary-data model). The values in selection are cut-points on |z| = |tau / se|; for example, selection = c(1.96, 2.58) gives three intervals, [0, 1.96), [1.96, 2.58), and [2.58, Inf). Note how inequality works: 1.96 is "already" treated as more significant than 1.95. (You should also consider defining it to more significant digits in large datasets.) Each interval has its own relative publication probability (weight), with the highest-|z| interval normalised to 1, and these weights are estimated jointly with the usual Rubin parameters.

The selection model assumes that publication depends only on |z| (not on the sign or other study features). Inference can be very sensitive to the choice of cut-points and priors on the selection weights, which you should set manually. For more complex cases you should consider using other methods.

Outputs. By default, some outputs are printed. There is also a plot method for baggr objects which you can access via baggr_plot (or simply plot()). Other standard functions for working with baggr object are

treatment_effect for distribution of hyperparameters
group_effects for distributions of group-specific parameters (alias: study_effects, we use the two interchangeably)
fixed_effects for coefficients in (meta-)regression
effect_draw and effect_plot for posterior predictive distributions
baggr_compare for comparing multiple baggr models
loocv for cross-validation

Examples

Run this code

df_pooled <- data.frame("tau" = c(1, -1, .5, -.5, .7, -.7, 1.3, -1.3),
                        "se" = rep(1, 8),
                        "state" = datasets::state.name[1:8])
baggr(df_pooled) #baggr automatically detects the input data
# same model, but with correct labels,
# different pooling & passing some options to Stan
baggr(df_pooled, group = "state", pooling = "full", iter = 500)
# model with non-default (and very informative) priors

baggr(df_pooled, prior_hypersd = normal(0, 2))

# \donttest{
# "mu & tau" model, using a built-in dataset
# prepare_ma() can summarise individual-level data
ms <- microcredit_simplified
microcredit_summary_data <- prepare_ma(ms, outcome = "consumption")
baggr(microcredit_summary_data, model = "mutau",
      iter = 500, #this is just for illustration -- don't set it this low normally!
      pooling = "partial", prior_hypercor = lkj(1),
      prior_hypersd = normal(0,10),
      prior_hypermean = multinormal(c(0,0),matrix(c(10,3,3,10),2,2)))
# }

Run the code above in your browser using DataLab