# Introduction

Image segmentation can be viewed as the task of labelling the observed pixels $\mathbf{y}$ according to a finite set of discrete states $\mathbf{z} \in { 1, \dots, k }$. The hidden \citet{Potts1952} model allows for spatial correlation between neighbouring labels in the form of a Markov random field. The latent labels follow a Gibbs distribution, which is specified in terms of its conditional probabilities: \label{eq:Potts} p(zi | z{\setminus i}, \beta) = \frac{\exp\left{\beta\sum_{i \sim \ell}\delta(zi,z\ell)\right}}{\sum{j=1}^k \exp\left{\beta\sum{i \sim \ell}\delta(j,z\ell)\right}} where $\beta$ is the inverse temperature, $z{\setminus i}$ represents all of the labels except $zi$, $i \sim \ell$ are the neighbouring pixels of $i$, and $\delta(u,v)$ is the Kronecker delta function. Thus, $\sum{i \sim \ell}\delta(zi,z\ell)$ is a count of the neighbours that share the same label.

The observation equation links the latent labels to the corresponding pixel values: \label{eq:obs} p(\mathbf{y} | \mathbf{z}, \boldsymbol\theta) = \prod_{i=1}^n p(y_i | zi, \theta{zi}) where $\theta{j}$ are the parameters that govern the distribution of the pixel values with label $j$. The hidden Potts model can thus be viewed as a spatially-correlated generalisation of the finite mixture model \citep{r}. We assume that the pixels with label $j$ share a common mean $\mu_j$ corrupted by additive Gaussian noise with variance $\sigma_j^2$: $$\label{eq:obs2} y_i | z_i=j, \mu_j, \sigma^2_j \;\sim\; \mathcal{N}\left( \mu_j, \sigma^2_j \right)$$

The Gibbs distribution is a member of the exponential family and so there is a sufficient statistic for this model, as noted by \citet{Grelaud2009}: \label{eq:pottsstat} \mathrm{S}(\mathbf{z}) = \sum{i \sim \ell \in \mathcal{E}} \delta(zi,z\ell) This statistic represents the total number of like neighbour pairs in the image. The likelihood $p(\mathbf{y},\mathbf{z} | \boldsymbol\theta, \beta)$ can therefore be factorised into $p(\mathbf{y} | \mathbf{z}, \boldsymbol\theta) p(\mathrm{S}(\mathbf{z}) | \beta)$, where the second factor does not depend on the observed data, but only on the sufficient statistic. The joint posterior is then: $$\label{eq:joint_post} p(\boldsymbol\theta, \beta, \mathbf{z} | \mathbf{y}) \propto p(\mathbf{y} | \mathbf{z}, \boldsymbol\theta) \pi(\boldsymbol\theta) p(\mathrm{S}(\mathbf{z}) | \beta) \pi(\beta)$$ The conditional distributions $p(\boldsymbol\theta | \mathbf{z}, \mathbf{y})$ and $p(zi | z{\setminus i}, \beta, yi, \boldsymbol\theta{z_i})$ can be simulated using Gibbs sampling, but $p(\beta | \mathbf{y}, \mathbf{z}, \boldsymbol\theta)$ involves an intractable normalising constant $\mathcal{C}(\beta)$: \begin{eqnarray} \label{eq:betapost} p(\beta \mid \mathbf{y}, \mathbf{z}, \boldsymbol\theta) &=& \frac{p(\mathrm{S}(\mathbf{z}) | \beta) \pi(\beta)}{\int\beta p(\mathrm{S}(\mathbf{z}) | \beta) \pi(d \beta)} \ \label{eq:beta} &\propto& \frac{\exp\left{ \beta\, \mathrm{S}(\mathbf{z}) \right}}{\mathcal{C}(\beta)} \pi(\beta) \end{eqnarray} The normalising constant is also known as a partition function in statistical physics. It has computational complexity of $\mathcal{O}(n k^n)$, since it involves a sum over all possible combinations of the labels $\mathbf{z} \in \mathcal{Z}$: $$\label{eq:norm} \mathcal{C}(\beta) = \sum_{\mathbf{z} \in \mathcal{Z}} \exp\left{\beta\, \mathrm{S}(\mathbf{z})\right}$$ It is infeasible to calculate this value exactly for nontrivial images, thus computational approximations are required.

This paper describes the \proglang{r}.

There are a number of contributed packages available for \proglang{r}

# Informative Priors

It can be useful to place informative priors on the parameters of the mixture components

The external field prior \citep{MHHM}

# Algorithms for Intractable Likelihoods

## Pseudolikelihood and Composite Likelihood

Pseudolikelihood is the simplest of the methods that we have implemented and also the fastest. \citet{r}): \label{eq:pseudo} p(\mathrm{S}(\mathbf{z}) | \beta) \approx \prod_{i=1}^n p(zi | z{\setminus i}, \beta) This enables updates for the inverse temperature at iteration $t$ to be simulated using a Metropolis-Hastings (M-H) step, with acceptance ratio: \label{mh:ratio} \rho = \min\left( 1, \frac{p(\mathbf{z}|\beta') \pi(\beta') q(\beta{t-1} | \beta')}{p(\mathbf{z}|\beta{t-1}) \pi(\beta{t-1}) q(\beta' | \beta{t-1})} \right)

The M-H proposal density $q(\beta'|\beta{t-1})$ can be any distribution such that $\int q(\beta'|\beta{t-1})\, d\beta' = 1$. However, there is a tradeoff between exploring the full state space and ensuring that the probability of acceptance is sufficiently high. We use the adaptive random walk (RWMH) algorithm of \citet{Garthwaite2010}, which automatically tunes the bandwidth of the proposal density to target a given M-H acceptance rate. When a symmetric proposal density is used, $q(\beta'|\beta{t-1}) = q(\beta{t-1}|\beta')$ and so this term cancels out in the M-H ratio \eqref{mh:ratio}. Likewise, under a uniform prior for the inverse temperature, $\pi(\beta') = \pi(\beta_{t-1}) = 1$. The natural logarithm of $\rho$ is used in practice to improve numerical stability.

\begin{figure} \centering \begin{subfigure}{0.65\textwidth} \includegraphics[width=\textwidth]{pl_exp_n12k3.eps} \caption{Expectation.} \label{f:pl_exp} \end{subfigure}% \qquad \begin{subfigure}{0.65\textwidth} \includegraphics[width=\textwidth]{pl_sd_n12k3.eps} \caption{Standard deviation.} \label{f:plsd} \end{subfigure}% \caption{Approximation error of pseudolikelihood for $n=12,\,k=3$ in comparison to the exact likelihood calculated using a brute force method: (a) $\sum{\mathbf{z} \in \mathcal{Z}} \mathrm{S}(\mathbf{z}) p(\mathrm{S}(\mathbf{z}) | \beta)$ using either Equation~(\ref{eq:beta}) or (\ref{eq:pseudo}); (b) $\sqrt{\sum{\mathbf{z} \in \mathcal{Z}} \left( \mathrm{S}(\mathbf{z}) - \mathbb{E}{\mathbf{z}|\beta}[\mathrm{S}(\mathbf{z})] \right)^2 p(\mathrm{S}(\mathbf{z}) | \beta)}$} \label{f:pl} \end{figure} Pseudolikelihood is exact when $\beta=0$ and provides a reasonable approximation for small values of the inverse temperature. However, the approximation error increases rapidly for $\beta \ge \beta_{crit}$, as illustrated by Figure \ref{f:pl}. This is due to long-range dependence between the labels, which is inadequately modelled by the local approximation. The implications of this inaccuracy for posterior inference will be demonstrated in Section \ref{s:results}.

\citet{r} is modified to be the product of the blocks: p(\mathbf{z}|\beta) \approx \prod_{i=1}^{NB} p(\mathbf{z}{Bi} | \mathbf{z}{\setminus B_i}, \beta) \label{eq:pl_comp} where $NB$ is the number of blocks, $\mathbf{z}{B_i}$ are the labels of the pixels in block $Bi$, and $\mathbf{z}{\setminus Bi}$ are all of the labels except for $\mathbf{z}{B_i}$. This is a form of composite likelihood, where the likelihood function is approximated as a product of simplified factors \citep{Varin2011}. \citet{Friel2012} compared point pseudolikelihood to composite likelihood with blocks of $3 \times 3$, $4 \times 4$, $5 \times 5$, and $6 \times 6$ pixels. \citeauthor{Friel2012} showed that (\ref{eq:plcomp}) outperformed (\ref{eq:pseudo}) for the Ising ($k=2$) model with $\beta < \beta{crit}$. \citet{Okabayashi2011} discuss composite likelihood for the Potts model with $k > 2$ and have provided an open source implementation in the \proglang{r}.

Evaluating the conditional likelihood in (\ref{eq:plcomp}) involves the normalising constant for $\mathbf{z}{Bi}$, which is a sum over all of the possible configurations $\mathcal{Z}{B_i}$. This is a limiting factor on the size of blocks that can be used. The brute-force method that was used to compute Figure \ref{f:pl} is too computationally intensive for this purpose. \citet{Pettitt2003} showed that the normalising constant can be calculated exactly for a cylindrical lattice by computing eigenvalues of a $k^r \times k^r$ matrix, where $r$ is the smaller of the number of rows or columns. The value of (\ref{eq:norm}) for a free boundary lattice can then be approximated using path sampling. \citet{Friel2004} extended this method to larger lattices using a composite likelihood approach.

The reduced dependence approximation (RDA) is another form of composite likelihood. \citet{r} divided the image lattice into sub-lattices of size $r1 < r$, then approximated the normalising constant of the full lattice using RDA: \mathcal{C}(\beta) \approx \frac{\mathcal{C}{r} \label{eq:rda} \citet{McGrory2009} compared RDA to pseudolikelihood and the exact method of \citet{Moeller2006}, reporting similar computational cost to pseudolikelihood but with improved accuracy in estimating $\beta$. \citet{Ogden2017} \dots

Source code for RDA is available in the online supplementary material for \citet{McGrory2012}.

# Examples

\label{s:results}

# Conclusion

\section*{Acknowledgements}

MTM was supported by the UK EPSRC as part of the i-Like programme grant (ref: EP/K014463/1). KLM was supported by an ARC Laureate Fellowship. KLM and ANP received funding from the ARC Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS). Landsat imagery courtesy of NASA Goddard Space Flight Center and U.S. Geological Survey. Computational resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia.