infer_data_and_decision_model: Sample the Posterior of the data and decision model and generate point estimates

Description

Function to sample the posterior of the Bayesian data and decision model. It first produces the needed inputs for Stan's sampling for each peptide (or protein, PTM, etc.). It then runs the sampling for the data and decision model. From the posterior, it then collects estimates and sampling statistics from the posterior of data model and integrates the decision distribution D. It then returns a tibble with all the information for each peptide's posterior (see Value below). There are major time gains to be made by running this procedure in parallel. infer_data_and_decision_model has an efficient wrapper around multidplyr. This will let you to evenly distribute all peptides evenly to each worker. E.g., two workers will each run half of the peptides in parallel.

Usage

infer_data_and_decision_model(
  data,
  id_col,
  design_matrix,
  contrast_matrix,
  uncertainty_matrix,
  stan_model = empirical_bayes,
  clusters = 1,
  h_not = 0,
  auxiliary_columns = c(),
  ...
)

Value

A tibble or data.frame() annotated with statistics of the posterior and p(error). err is the probability of error, i.e., the two tail-density supporting the null-hypothesis, lfc is the estimated $\log_2$-fold change, sigma is the common variance, and lp is the log-posterior. Columns without suffix shows the mean estimate from the posterior, while the suffixes _025, _50, and _975, are the 2.5, 50.0, and 97.5, percentiles, respectively. The suffixes _eff and _rhat are the diagnostic variables returned by Stan. In general, a larger _eff

indicates effective sample size and _rhat indicates the mixing within chains and between the chains and should be smaller than 1.05 (please see the Stan manual for more details). In addition it returns a column warnings with potential warnings given by the sampler.

Arguments

data: A tibble or data.frame with alpha,beta priors annotated
id_col: A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.)
design_matrix: A design matrix for the data. For the empirical_bayes prior only mean models are allowed (see example). For the weakly_informative prior more complicated design can be used.
contrast_matrix: A contrast matrix of the decisions to test. Columns should sum to 0 and only mean comparisons are allowed. That is, the absolute value of the positive and negative values in each column has to sum to 2. E.g., a column can be [0.5, 0.5, -1]$^T$ but not [1, 1, -1]$^T$ or [0.5, 0.5, -2]$^T$. That is, sum(abs(x))=2 where x is a column in the contrast matrix.
uncertainty_matrix: A matrix of observation specific uncertainties
stan_model: Which Bayesian model to use. Defaults to empirical_bayes but also allows weakly_informative, or an user supplied function.
clusters: The number of parallel threads/workers to run on.
h_not: The value of the null hypothesis for the difference in means
auxiliary_columns: Names of columns in the design matrix that does not have corresponding data in the data set. For example, this can be co-founding variables such as bashes.
...: Additional arguments passed to sampling. Note that verbose will always be forced to FALSE to avoid console flooding.

Details

The data model of Baldur assumes that the observations of a peptide, $\boldsymbol{Y}$, is a normally distributed with a standard deviation, $\sigma$, common to all measurements. In addition, it assumes that each measurement has a unique uncertainty $u$. It then models all measurements in the same condition with a common mean $\mu$. It then assumes that the common variation of the peptide is caused by the variation in the $\mu$ As such, it models $\mu$ with the common variance $\sigma$ and a non-centered parametrization for condition level effects. $$ \boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{X}\boldsymbol{\mu},\sigma\boldsymbol{u})\quad \boldsymbol{\mu}\sim\mathcal{N}(\mu_0+\boldsymbol{\eta}\sigma,\sigma) $$ It then assumes $\sigma$ to be gamma distributed with hyperparameters infered from either a gamma regression fit_gamma_regression or a latent gamma mixture regression fit_lgmr. $$\sigma\sim\Gamma(\alpha,\beta)$$ For details on the two priors for $\mu_0$ see empirical_bayes or weakly_informative. Baldur then builds a posterior distribution of the difference(s) in means for contrasts of interest. In addition, Baldur increases the precision of the decision as the number of measurements increase. This is done by weighting the sample size with the contrast matrix. To this end, Baldur limits the possible contrast of interest such that each column must sum to zero, and the absolute value of each column must sum to two. That is, only mean comparisons are allowed. $$ \boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{i}=\sqrt{\sum_{c=1}^{C}|k_{cm}|n_c^{-1}} $$ where $\boldsymbol{K}$ is a contrast matrix of interest and $k_{cm}$ is the contrast of the c:th condition in the m:th contrast of interest, and $n_c$ is the number of measurements in the c:th condition. Baldur then integrates the tails of $\boldsymbol{D}$ to determine the probability of error. $$P(\text{\textbf{error}})=2\Phi(-\left|\boldsymbol{\mu}_{\boldsymbol{D}}-H_0\right|\odot\boldsymbol{\tau}_{\boldsymbol{D}})$$ where $H_0$ is the null hypothesis for the difference in means, $\Phi$ is the CDF of the standard normal, $\boldsymbol{\mu}_{\boldsymbol{D}}$ is the posterior mean of $\boldsymbol{D}$, $\boldsymbol{\tau}_{\boldsymbol{D}}$ is the posterior precision of $\boldsymbol{D}$, and $\odot$ is the Hadamard product.

Examples

Run this code

# (Please see the vignette for a tutorial)
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))

yeast_norm <- yeast %>%
# Remove missing data
  tidyr::drop_na() %>%
  # Normalize data
  psrn('identifier') %>%
  # Add mean-variance trends
  calculate_mean_sd_trends(design)

# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)

# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design)

yeast_norm <- gam %>%
   # Add hyper-priors for sigma
   estimate_gamma_hyperparameters(yeast_norm)
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)
# \donttest{
yeast_norm %>%
  head() %>% # Just running a few for the example
  infer_data_and_decision_model(
    'identifier',
    design,
    contrast,
    unc,
    clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
                 # this will greatly reduce the time of running baldur
  )
# }

Run the code above in your browser using DataLab