Baldur
Baldur is a hierarchical Bayesian model for the analysis of proteomics data. By leveraging empirical Bayes methods, Baldur estimates hyperparameters for variance and measurement-specific uncertainty. It then computes the posterior difference in means between conditions for each peptide, protein, or PTM, and integrates the posterior to estimate error probabilities.
Features
- Hierarchical Bayesian modeling of proteomics data
- Empirical Bayes estimation of variance and uncertainty
- Posterior probability calculations for differential analysis
- Supports peptide, protein, and PTM data
Installation
Install the stable release from CRAN:
install.packages('baldur')Or, install the development version from GitHub (after installing rstan):
Follow the instructions for installing rstan:
https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
Then:
devtools::install_github('PhilipBerg/baldur', build_vignettes = TRUE)Note:
- On Ubuntu, pandoc may be needed to build vignettes.
- On Windows, sometimes the development version of
rstanis required.
Usage
For detailed examples, see the package vignettes:
vignette('baldur_yeast_tutorial')
vignette('baldur_ups_tutorial')Main Modeling Work and Equations
Baldur implements a hierarchical Bayesian framework for label-free proteomics quantification, designed to robustly estimate differential abundance while accounting for the mean-variance relationship in mass spectrometry data. For exact details please see the original paper.
1. Observation Model
For each feature (peptide, protein, or PTM) $i$ in sample $j$, the observed intensity $y_{ij}$ is modeled as:
$$y_{ij}\sim\text{Normal}(\mu_{j},\sigma u_{ij})$$ $$\mu_{j}\sim\text{Normal}(\mu_{0j}+\eta_j\sigma,\sigma)$$
2. Mean-Variance Modeling
Gamma Regression
The measurement standard deviation $s_{j}$ is not constant, but depends on the mean intensity. This relationship is modeled with gamma regression: $$s_{j} \sim \Gamma(\alpha, \frac{\alpha}{\beta(\bar{y}_j)})$$ where:
- $\alpha$: shape parameter (estimated empirically)
- $\beta(\bar{y}_j)$: rate parameter as a function of peptide/protein mean intensity
Latent Gamma Mixture Regression
Regression Function
For each observation, the expected mean-variance relationship is modeled as:
$$\beta_i=\kappa\cdot\exp(\theta_i\cdot(I_L-S_Lx_i))+\exp(I-S\bar{y}_i)$$
where:
- $S, S_L$: slope parameters (common and latent)
- $I, I_L$: intercepts (common and latent)
- $\bar{y}_i$: mean
- $\theta_i$: feature-specific mixture parameter
Likelihood
Given the expected mean-variance, the observed standard deviation $\sigma_i$ is modeled as: $$\sigma_i\sim\Gamma(\alpha,\frac{\alpha}{\beta_i})$$ where:
- $\alpha$: gamma shape parameter
- $\beta_i$: expected mean-variance for observation $i$
Priors
- $\alpha\sim\text{Cauchy}(0,25)$
- $\eta\sim\text{Normal}(0,1)$
- $I_L\sim\text{SkewNormal}(2,15,35)$
- $\theta_i\sim\text{Uniform}(0,1)$
NRMSE (Model Fit Metric)
The normalized root-mean-square error (NRMSE) is calculated for model diagnostics.
3. Hierarchical Modeling of Condition Means
Empirical Bayes Prior
- What is it?
A prior that is estimated from your actual data, rather than set by hand. - How does it work?
- Baldur looks at the spread and center of observed means across features.
- It sets the mean hyper-prior for each group to match the average observed mean, and its uncertainty to match the variability in your data.
- Why use it?
- Strength: Provides "shrinkage" toward realistic values, reducing noise and false positives, especially with small sample sizes.
- Mathematical form:
$\mu_{0j}\sim\text{Normal}(\bar{y}_j,\sigma n_R)$- Here, $\bar{y}_j$ is the estimated mean for group $j$, and $\sigma n_R$ is the estimated standard deviation for the prior.
Weakly Informative Prior
- What is it?
A broad, generic prior that doesn't make strong assumptions—it's like saying "I have no idea what the mean should be, but it's probably not infinite." - How does it work?
- The prior mean is set to zero for all groups.
- The uncertainty (standard deviation) is set to a large value (e.g., $10$), meaning the model expects almost any value is possible.
- Why use it?
- Strength: Maximizes flexibility and lets the data speak for itself, at the cost of potentially adding more noise or less stability if data is limited.
- Mathematical form:
$$\mu_{0j}\sim\text{Normal}(0,10)$$- For group $j$, the mean is $0$ and the standard deviation is $10$.
4. Differential Abundance
For differential analysis, Baldur estimates the posterior distribution of the difference in means between conditions: $$\boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{m}=\sqrt{\sum_{i=1}^{C}\frac{|k_{im}|}{n_i}}$$
where:
- $\boldsymbol{K}$: contrast matrix
- $k_{im}$: contrast coefficient for condition $i$ in contrast $m$
- $n_i$: number of samples in condition $i$
- $\boldsymbol{\xi}$: scaling factor for each contrast
The probability of error for contrast $c$ is then:
$$P(\mathrm{error}) = 2\Phi(-|\mu_{D_c} - \mu_{h_0}| \odot \tau_{D_c})$$
where:
- $\Phi$: cumulative distribution function (CDF) of the standard normal
- $\mu_{h_0}$: null hypothesis mean (often zero)
- $\boldsymbol{\tau}_{\boldsymbol{D}}$: precision (inverse standard deviation) for each contrast
- $\odot$: element-wise multiplication
Summary:
Baldur combines hierarchical modeling, mean-variance trend estimation via gamma regression, and empirical Bayes to robustly quantify differential abundance and propagate uncertainty from individual measurements to protein/PTM level, outputting interpretable error probabilities for each feature.
For full details, see the reference publication.
Reference
Berg, Philip, and George Popescu.
“Baldur: Bayesian Hierarchical Modeling for Label-Free Proteomics with Gamma Regressing Mean-Variance Trends.”
Molecular & Cellular Proteomics (2023): 2023-12.
https://doi.org/10.1016/j.mcpro.2023.100658