bayes_fit: Bayesian estimation of mixture distributions

Description

Estimation of a univariate mixture with unknown number of components using a sparse finite mixture Markov chain Monte Carlo (SFM MCMC) algorithm.

Usage

bayes_fit(
  data,
  K,
  dist,
  priors = list(),
  nb_iter = 2000,
  burnin = nb_iter/2,
  print = TRUE
)

Value

A list of class bayes_mixture containing:

data: Same as argument.
mcmc: Matrix of MCMC draws where the rows corresponding to burnin have been discarded;
mcmc_all: Matrix of MCMC draws.
loglik: Log likelihood at each MCMC draw.
K: Number of components.
dist: Same as argument.
pdf_func: The pdf/pmf of the mixture components.
dist_type: Type of the distribution, i.e. continuous or discrete.
pars_names: Names of the mixture components' parameters.
loc: Name of the location parameter of the mixture components.
nb_var: Number of variables/parameters in the mixture distribution.

Arguments

data: Vector of observations.
K: Maximum number of mixture components.
dist: String indicating the distribution of the mixture components; currently supports "normal", "skew_normal", "poisson" and "shifted_poisson".
priors: List of priors; default is an empty list which implies the following priors:
a0 = 1,
A0 = 200,
b0 = median(y),
B0 = (max(y) - min(y))^2 (normal),
D_xi = 1,
D_psi =1, (skew normal: B0 = diag(D_xi,D_psi)),
c0 = 2.5,
l0 = 1.1 (poisson),
l0 = 5 (shifted poisson),
L0 = 1.1/median(y),
L0 = l0 - 1 (shifted poisson),
g0 = 0.5,
G0 = 100*g0/c0/B0 (normal),
G0 = g0/(0.5*var(y)) (skew normal).
nb_iter: Number of MCMC iterations; default is 2000.
burnin: Number of MCMC iterations used as burnin; default is nb_iter/2.
print: Showing MCMC progression ? Default is TRUE.

Details

Let $y_i$, $i=1,\dots,n$ denote observations. A general mixture of $K$ distributions from the same parametric family is given by: $$y_i \sim \sum_{k=1}^{K}\pi_k p(\cdot|\theta_k)$$ with $\sum_{k=1}^{K}\pi_k=1$ and $\pi_k\geq 0$, $k=1, ...,K$.

The exact number of components does not have to be known a priori when using an SFM MCMC approach. Rather, an upper bound is specified for the number of components and the weights of superfluous components are shrunk towards zero during estimation. Following malsiner-walli_model-based_2016;textualBayesMultiMode a symmetric Dirichlet prior is used for the mixture weights: $$\pi_k \sim \text{Dirichlet}(e_0,\dots,e_0),$$ where a Gamma hyperprior is used on the concentration parameter $e_0$:

$$e_0 \sim \text{Gamma}\left(a_0, A_0\right).$$

Mixture of Normal distributions

Normal components take the form: $$p(y_i|\mu_k,\sigma_k) = \frac{1}{\sqrt{2 \pi} \ \sigma_k} \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \mu_k}{\sigma_k} \right)^2 \right).$$

Independent conjugate priors are used for $\mu_k$ and $\sigma^2_k$ (see for instance Malsiner-Walli et al. 2016): $$\mu_k \sim \text{Normal}( \text{b}_0, \text{B}_0),$$ $$\sigma^{-2}_k \sim \text{Gamma}( \text{c}_0, \text{C}_0),$$ $$C_0 \sim \text{Gamma}( \text{g}_0, \text{G}_0).$$

Mixture of skew-Normal distributions

We use the skew-Normal of azzalini_1985;textualBayesMultiMode which takes the form: $$p(y_i| \xi_k,\omega_k,\alpha_k) = \frac{1}{\omega_k\sqrt{2\pi}} \ \exp\left( - \, \frac{1}{2} \left( \frac{y_i - \xi_k}{\omega_k} \right)^2\right) \ \left(1 + \text{erf}\left( \alpha_k\left(\frac{y_i - \xi_k}{\omega_k\sqrt{2}}\right)\right)\right),$$ where $\xi_k$ is a location parameter, $\omega_k$ a scale parameter and $\alpha_k$ the shape parameter introducing skewness. For Bayesian estimation, we adopt the approach of fruhwirth-schnatter_bayesian_2010;textualBayesMultiMode and use the following reparameterised random-effect model: $$z_i \sim TN_{[0,\infty)}(0, 1),$$ $$y_i|(S_i = k) = \xi_k + \psi_k z_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2_k),$$ where the parameters of the skew-Normal are recovered with $$\omega_k = \frac{\psi_k}{\sigma_k}, \qquad \omega^2_k = \sigma^2_k + \psi^2_k.$$ By defining a regressor $x_i = (1, z_i)'$, the skew-Normal mixture can be seen as random effect model and sampled using standard techniques. Thus we use priors similar to the Normal mixture model: $$(\xi_k, \psi_k)' \sim \text{Normal}(\text{b}_0, \text{B}_0),$$ $$\sigma^{-2}_k \sim \text{Gamma}(\text{c}_0, \text{C}_0),$$ $$\text{C}_0 \sim \text{Gamma}( \text{g}_0, \text{G}_0).$$ We set $$\text{b}_0 = (\text{median}(y), 0)'$$ and $$\text{B}_0 = \text{diag}(\text{D}\_\text{xi}, \text{D}\_\text{psi})$$ with D_xi = D_psi = 1.

Mixture of Poisson distributions

Poisson components take the form: $$p(y_i|\lambda_k) = \frac{1}{y_i!} \, \lambda^{y_i}_k \,\exp(-\lambda_k).$$ The prior for $\lambda_k$ follows from viallefont2002bayesian;textualBayesMultiMode: $$\lambda_k \sim \text{Gamma}(\text{l}_0,\text{L}_0).$$

Mixture of shifted-Poisson distributions

Shifted-Poisson components take the form $$p(y_i |\lambda_k, \kappa_k) = \frac{1}{(y_i - \kappa_k)!} \, \lambda^{(y_i - \kappa_k)!}_k \,\exp(-\lambda_k)$$ where $\kappa_k$ is a location or shift parameter with uniform prior, see Cross2024;textualBayesMultiMode.

References

Examples

Run this code

# Example with galaxy data ================================================
set.seed(123)

# retrieve galaxy data
y <- galaxy

# estimation
bayesmix <- bayes_fit(
  data = y,
  K = 5, # not many to run the example rapidly
  dist = "normal",
  nb_iter = 500, # not many to run the example rapidly
  burnin = 100
)

# plot estimated mixture
# plot(bayesmix, max_size = 200)

# Changing priors ================================================
set.seed(123)

# retrieve galaxy data
y <- galaxy

# estimation
K <- 5
bayesmix <- bayes_fit(
  data = y,
  K = K, # not many to run the example rapidly
  dist = "normal",
  priors = list(
    a0 = 10,
    A0 = 10 * K
  ),
  nb_iter = 500, # not many to run the example rapidly
  burnin = 100
)

# plot estimated mixture
# plot(bayesmix, max_size = 200)

# Example with DNA data =====================================================
# \donttest{
set.seed(123)

# retrieve DNA data
y <- d4z4

# estimation
bayesmix <- bayes_fit(
  data = y,
  K = 5, # not many to run the example rapidly
  dist = "shifted_poisson",
  nb_iter = 500, # not many to run the example rapidly
  burnin = 100
)

# plot estimated mixture
# plot(bayesmix, max_size = 200)
# }

Run the code above in your browser using DataLab