Baldur has the option to use the Latent Gamma Regression Model (LGMR). This model attempts to estimate local Mean-Variance (M-V) trend for each peptide (or protein, PTM, etc.). To this end is describes the M-V trend as a regression with a common link function and a latent one. While there may not be a latent trend in real data, it allows for a mathematical description that can estimate the local trends of each peptide. The model describes the standard deviation (S) as a gamma random variable with mean dependency described by the sample mean (\(\bar{y}\)): $$\boldsymbol{S}\sim\Gamma(\alpha,\beta),\quad\boldsymbol{\beta}=\frac{\alpha}{\boldsymbol{\mu}}$$ $$\boldsymbol{\mu}=\exp(\gamma_0-\gamma_{\bar{y}}\,f(\boldsymbol{\bar{y}})) + \kappa \exp(\boldsymbol{\theta}(\gamma_{0L}-\gamma_{\bar{y}L}\,f(\boldsymbol{\bar{y}})),\quad f(\boldsymbol{x})=\frac{\boldsymbol{x}-\mu_{\bar{y}}}{\sigma_{\bar{y}}}$$ Here, \(\exp(\gamma_0-\gamma_{\bar{y}}f(\boldsymbol{\bar{y}}))\) is the common trend of the regression. Then, the mixing variable, \(\boldsymbol{\theta}\), describes the proportion of the mixture that each peptide has, and \(\kappa\) is just some small constant such that when \(\theta\) is zero the latent term is small.
A stanmodel that can be used in fit_lgmr.
The Stan code for this model is given by:
lgmr_model
S4 class stanmodel 'lgmr_model' coded as follows:
functions {
// mu
vector reg_function(
vector x,
vector theta,
real I,
real I_L,
real S,
real S_L,
int N
) {
vector[N] exp_beta = .001*exp(theta .* (I_L - S_L*x));
exp_beta += exp(I - S*x);
return exp_beta;
}
}
data {
int<lower=0> N; // Number of observations
vector<lower=0>[N] y; // sd
vector[N] x; // mean
}
transformed data {
real v_y = variance(y); // for NRMSE
vector[N] x_star = (x - mean(x))/sd(x); // f(y_bar)
vector[3] lb = [0, 0, negative_infinity()]'; // Boundaries normal coefs.
}
parameters {
real<lower = 0> alpha; // Shape
vector<lower = lb>[3] eta; // For S, S_L, I
vector<lower = 0, upper = 1>[N] theta; // Mixture parameter
real I_L; // Intercept Latent
}
transformed parameters {
real<lower = 0> S = eta[1]; // Slope common
real<lower = 0> S_L = eta[2]; // Slope Latent
real I = eta[3]; // Intercep common
}
model {
//Priors
alpha ~ cauchy(0, 25);
eta ~ std_normal();
I_L ~ skew_normal(2, 15, 35);
theta ~ beta(1, 1);
{
vector[N] exp_beta = reg_function(x_star, theta, I, I_L, S, S_L, N);
// Likelihood
y ~ gamma(alpha, alpha ./ exp_beta);
}
}
generated quantities {
// NRMSE calculations
real nrmse;
{
vector[N] se = reg_function(x_star, theta, I, I_L, S, S_L, N);
se -= y;
se = square(se);
nrmse = mean(se) / v_y;
}
nrmse = sqrt(nrmse);
}
Next we will describe the hierarchical prior setup for the
regression variables.
For \(\alpha\) baldur uses a half-Cauchy as a weakly informative prior:
$$\alpha\sim\text{Half-Cauchy}(25)$$
For the latent contribution to the i:th peptide, \(\theta_i\), has an uniform distribution:
$$\theta_i\sim\mathcal{U}(0, 1)$$ Further, it can be seen that Baldur assumes
that S always decreases (on average) with the sample mean. To this end, both
slopes (\(\gamma_{\bar{y}},\gamma_{\bar{y}L}\)) are defined on the real
positive line. Hence, we used Half-Normal (HN)
distributions to describe the slope parameters: $$\gamma_{\bar{y}}\sim
HN(1)$$ $$\gamma_{\bar{y}L}\sim HN(1)$$ For the intercepts, we assume a
standard normal prior for the common intercept. Then, we use a skew-normal to
model the latent intercept. The reason for this is two-fold, one,
\(\kappa\) will decrease the value of the latent term and, two, to push the
latent trend upwards. The latter reason is so that the latent intercept is
larger than the common and so that the priors prioritize a shift in intercept
over a increase in slope.
For the intercepts, Baldur uses a standard normal prior for the common intercept.
$$\gamma_0\sim\mathcal{N}(0,1)$$
While for the latent trend, it uses a skew-normal (SN) to push up the second
trend and to counteract the shrinkage of \(\kappa\).
$$\gamma_{0L}\sim\mathcal{SN}(2,15,35)$$