boral
fits independent column GLMs such that the columns of $y$ are assumed to be independent; 2) With no covariates, boral
fits a purely latent variable model to perform model-based unconstrained ordination (Hui et al., 2014); 3) With covariates and latent variables, boral
fits correlated column GLMs, with latent variables accounting for correlations between the columns of $y$. Model selection is performed using either information criterion, or via Stochastic Search Variable Selection (SSVS, George and McCulloch, 1993)boral(y, ...)## S3 method for class 'default':
boral(y, X = NULL, family, trial.size = NULL, num.lv = 0,
row.eff = FALSE, n.burnin = 5000, n.iteration = 35000, n.thin = 5,
save.model = FALSE, seed = 123, calc.ics = TRUE,
hypparams = c(100,100,100,100), ssvs.index = -1, ...)
## S3 method for class 'boral':
print(x, ...)
NULL
, in which case no model matrix was used. No intercept column should be included in X
.n.burnin
, n.iteration
and n.thin
, this leads to a total of 6000 MCMC samples.coda
boral
, a set.seed(seed)
command is run immediately before starting the MCMC sampler. Defaults to the value 123.get.measures
). Defaults to TRUE
.X
. Each element can take values of -1 (no SSVS is performed on tX
.get.hpdintervals
for more details.calc.ics = TRUE
, then a list of different information criteria values for the model calculated using get.measures
is run. Please see help file for get.measures
regarding details on the criteria. Also, please note the ics returned are based on get.measures
with more.measures = FALSE
.save.model = TRUE
, the raw jags model fitted is returned. This can be quite large!y
, the response and model matrix used, distributional assumptions and trial sizes, number of latent variables, the number of covariates, whether information criteria values were calculated, hyperparameters used in the Bayesian estimation, indices for SSVS, and the number of levels for ordinal responses.X
. Column-specific intercepts are estimated automatically and given by the first column oflv.coefs
.num.lv > 1
to allow flexible modeling of the residual correlation matrix.y
is large! The calculation of information criteria (calc.ics = TRUE
) can also take a while.save.model = TRUE
, the raw jags model is also returned. This can be quite very memory-consuming, since it indirectly saves all the MCMC samples.Independent column models: boral
allows explanatory variables to be entered into the model via a model matrix X
. This model matrix can contain anything the user wants, provided factors have been parameterized as dummy variables. It should not include an intercept column.
Without latent variables i.e., num.lv = 0
, boral
fits separate GLMs to each column of $y$, where the columns are assumed to be independent. That is,
$$g(\mu_{ij}) = \alpha_i + \theta_{0j} + \bm{x}'_i\bm{\beta}_j; \quad i = 1,\ldots,n; j = 1,\ldots,p,$$
where the mean response for element $(i,j)$, denoted as $mu_{ij}$, is regressed against the covariates $\bm{x}_i$ via a link function $g(\cdot)$. $\theta_{0j}$ and $\bm{\beta}_j$ are the column-specific intercepts and coefficients respectively, while $\alpha_i$ is an optional row effect (if row.eff = TRUE
). Fitting the above type of model is thus more or less a Bayesian analog of the manyglm
function in the mvabund
package (Wang et al., 2013). Unlike manyglm
though, different distributional assumptions are allowed however for each column of $y$, and row effects can be added as a type of "row-standardization".
A not-so-brief tangent on distributions: In the event different responses are collect for different columns, e.g., some columns of $y$ are counts, while other columns are presence-absence, one can specify different distributions for each column. Aspects such as variable selection, residual analysis, and plotting of the latent variables are not affected by having different distributions. Naturally though, one has to be more careful with interpretation of the row effects $\alpha_i$ and latent variables $\bm{z}_i$, with different scalings due to different link functions used. A situation where different distributions may prove useful is when $y$ is a species-traits matrix, where each row is a species and each column is a trait such as specific leaf area, presence of thorns and so on. In this case, traits could be of different response types, and the goal perhaps is to perform unconstrained ordination to look for patterns between species on an underlying trait surface e.g., a defense index for a species (Moles et al., 2013; see the discussion below on how to perform model-based unconstrained ordination).
For multivariate abundance data in ecology (also known as community composition data, Legendre and Gallagher, 2001), species counts are often overdispersed. Using a negative binomial distribution (family = "negative.binomial"
) to model the counts usually helps to account for this overdispersion. For non-negative continuous data such as biomass, the lognormal and tweedie distributions may be used (Foster and Bravington, 2013). Note however that a common power parameter is used for tweedie columns -- there almost always insufficient information to model column-specific power parameters. Normal responses are also implemented, just in case you encounter some normal stuff in ecology (pun intended)!
The beta distribution can be used to model data between values between but not including 0 and 1. In principle, this would make it quite useful for percent cover data in ecology, if it not were for the fact that percent cover data is commonly characterized by having lots of zeros (which are not permitted for beta regression). An ad-hoc fix to this would be to add a very small value to pull the data away from exact zeros and/ones. This is however quite heuristic, and pulls the model towards producing conservative results (see Smithson and Verkuilen, 2006, for a detailed discussion on beta regression, and Korhonen et al., 2007, for an example of an application to forest canopy cover data). Note the parameterization of the beta distribution used here is directly in terms of the mean $\mu$ and sample size (referred to here as the dispersion parameter) $\phi$. In terms of the two shape parameters, this is equivalent to $shape1 = a = \mu\phi$ and $shape2 = b = (1-\mu)\phi$.
For ordinal response columns, cumulative logit regression is used (also known as proportional odds regression, Agresti, 2010). boral
assumes all ordinal columns are measured using the same scale i.e., all columns have the same number of theoretical levels. The number of levels is then assumed to be given by the maximum value from all the ordinal columns of $y$. Thus, the parameterization of proportional odds regression is such that all ordinal columns have the same cutoff points, $\bm{\tau}$, while the column-specific intercept effect, $\theta_{0j}$, allows for deviations away from these common cutoffs. That is,
$$logit(P(y_{ij} \le k)) = \tau_k + \theta_{0j} + \ldots,$$
where $logit(\cdot)$ is the logit function, $P(y_{ij} \le k)$ is the cumulative probability of element $y_{ij}$ being less than or equal to level $k$, $\tau_k$ is the cutoff linking levels $k$ and $k+1$ (and which are increasing in $k$), $\theta_{0j}$ are the column effects, and "$\ldots$" generically represents what else is included in the model e.g., latent variables and related coefficients. A sum-to-zero constraint is imposed on the $\theta_{0j}$'s of all ordinal columns, to ensure model identifiability.
The parameterization above is useful for modeling multivariate abundance data in ecology. When ordinal responses are recorded, usually the same ordinal scale is applied to all species e.g., level 1 = not there, level 2 = a bit there, level 3 = lots there, level 4 = everywhere! The quantity $\tau_k$ can be interpreted as this common ordinal scale, while $\theta_{0j}$ allows for deviations away from these to account for differences in species prevalence. Admittedly, the current implementation of boral
for ordinal data can be quite slow.
Purely latent variable models: If no explantory variables are included and num.lv
> 0, boral
fits a purely latent variable model to perform model-based unconstrained ordination (Hui et al., 2014),
$$g(\mu_{ij}) = \alpha_i + \theta_{0j} + \bm{z}'_i\bm{\theta}_j,$$
where instead of measured covariates, we now have a vector of latent variables $\bm{z}_i$ with $\bm{\theta}_j$ being the column-specific coefficients relating to these latent variables. Unconstrained ordination is used for visualizing multivariate data in a low-dimensional space, without reference to explanatory variables (see Chapter 9, Legendre and Legendre, 2012). Typically, num.lv
= 1 or 2 latent variables is used, allowing the latent variables to plotted (using lvsplot
, for instance). The resulting plot can be interpreted in the same manner as plots from Nonmetric Multi-dimensional Scaling (NMDS, Kruskal, 1964) and Correspondence Analysis (CA, Hill, 1974), for example. For instance, with multivariate abundance data, unconstrained ordination is used to visualize the relationships between sites in terms of species abundance or composition. The column-specific intercept, $\theta_{0j}$, accounts for differences between species prevalence, while the row effect, $\alpha_i$, is included to account for differences in site total abundance. The inclusion of the latter ensures the ordination is in terms of species composition. If $\alpha_i$ is omitted from the model i.e., row.eff = FALSE
, the ordination will be in terms of relative species abundance.
Correlated column models: When one or more latent variables are included in conjunction with covariates i.e., X
is non-empty and num.lv
> 1, boral
fits separate GLMs to each column of $y$ while allowing for any potential correlation between columns via the latent variables. This is useful for multivariate abundance data in ecology, where a separate GLM is fitted to species (modeling its response against environmental covariates, say), while accounting for the fact species at a site are correlated. The model fitted takes the following form,
$$g(\mu_{ij}) = \theta_{0j} + \bm{x}'_i\bm{\beta}_j, + \bm{z}'_i\bm{\theta}_j,$$
which we can see is a mash of the first two types of models. The linear predictor $\bm{z}'_i\bm{\theta}_j$ induces a ``residual covariance/correlation"" between the columns of $y$, which is of rank num.lv
. For multivariate abundance data, this leads to a parsimonious method of accounting for correlation between species not due to the shared environmental responses. After fitting the model, the residual correlation matrix can be obtained using the get.residual.cor
function. Note num.lv > 1
is necessarily in order to flexibly model the residual correlations (see also Pollock et al., 2014, for residual correlation matrices in the context of Joint Species Distribution Models).
Estimation: For boral
models, estimation is performed using Bayesian Markov Chain Monte Carlo (MCMC) methods via JAGS (Plummer, 2003). By default, uninformative priors are used for all parameters. That is, normal priors with mean zero and variance given by hypparams[1]
are assigned to all intercepts, coefficients relating to latent variables, cutoffs for ordinal responses. Normal priors with mean zero and variance given by hypparams[2]
are assigned to all coefficients relating to covariates in X
. For the negative binomial, normal, lognormal, and tweedie distributions, uniform priors from 0 to $hypparams[3]$ are used on the dispersion parameters.
Using information criteria at your own risk: Using information criterion from calc.ics = TRUE
for model selection should be done with extreme caution, for two reasons: 1) The implementation of some of these criteria is heuristic and experimental. 2) Deciding what model to fit should also be driven by the science. For example, it may be the case that a criterion suggests a model with 3 or 4 latent variables is more appropriate. However, if we are interested in visualizing the data for ordination purposes, then models with 1 or 2 latent variables are more appropriate. As another example, whether or not we include row effects when ordinating multivariate abundance data depends on if we are interested in differences between sites in terms of relative species abundance (site.eff = FALSE
) or species composition (site.eff = TRUE
).
SSVS: As an alternative to using information criterion for model selection, Stochastic Search Variable Selection (SSVS, George and McCulloch, 1993) is also implemented for the column-specific coefficients relating to the model matrix X
, $\bm{\beta}_j$. Basically, SSVS works by placing a spike-and-slab priors on these coefficients, such that the spike is a narrow normal distribution concentrated around zero and the spike is a normal distribution with a large variance.
$$\rho(\beta) = I_{\beta = 1}\times\mathcal{N}(0,\sigma^2) + (1-I_{\beta = 1})\times \mathcal{N}(0,0.0001*\sigma^2),$$
where $\sigma^2$ is determined by hypparams[2]
(see section on Estimation above) and $I_{\beta = 1} = P(\beta = 1)$ is an indicator function representing whether coefficient is included in the model, and is given a Bernoulli prior with probability of inclusion 0.5. After fitting, the posterior probability of $\beta$ being included in the model is returned based on posterior mean of the indicator function $I_{\beta = 1}$. Note this is NOT the same as a p-value seen in maximum likelihood estimation -- a p-value provides an indication of how much evidence there is against the null hypothesis of $\beta = 0$, while the posterior probability provides a (direct) measure of how likely it is for $\beta \ne 0$ given the data.
In boral
, SSVS can be applied at a grouped or individual coefficient level, and this is governed by ssvs.index
. For elements of ssvs.index
equal to -1, SSVS is not applied on the corresponding covariate of the model matrix X
. For elements equal to 0, SSVS is applied to each individual coefficient of the corresponding covariate in X
. That is, the fitted model will return $p$ posterior probabilities for this covariate, one for each column of $\bm{y}$. For elements taking positive integers {1,2,...}, SSVS is applied to each group of coefficients of the corresponding covariate in X
. That is, the fitted model will return a single posterior probability for this covariate, indicating whether this covariate should be included for all columns of $\bm{y}$. Please see O'Hara and Sillanpaa (2009) for an discussion of Bayesian variable selection methods.
Note the last application of SSVS allows multiple covariates to be tested simultaneously. For example, suppose X
consists of five columns -- the first two columns are environmental covariates, while the last three correspond to quadratic terms of the two covariates as well as their interaction. If we want to ``test" whether any quadratic terms are required, then we can set ssvs.index = c(-1,-1,1,1,1)
, so a single posterior probability of inclusion is returned for the last three columns of X
.
Finally, note the use of information criterion (and possibly residual analysis) in the presence of variable selection being performed using SSVS is questionable, and so the it maybe advised to separate out their use e.g., choose the explanatory variables first using SSVS, and then use information criterion to select the number of latent variables?
lvsplot
for a scatter plot of the latent variables when num.lv = 1
or 2
, summary.boral
for a summary of the fitted boral model, get.measures
and get.more.measures
for information criteria from the fitted boral model, get.residual.cor
for calculating the residual correlation matrix.library(mvabund) ## Load a dataset from the mvabund package
data(spider)
y <- spider$abun
n <- nrow(y); p <- ncol(y);
## Example 1 - model with two latent variables, site effects,
## and no environmental covariates
spider.fit.nb <- boral(y, family = "negative.binomial", num.lv = 2,
row.eff = TRUE, n.burnin = 10, n.iteration = 100,
n.thin = 1, calc.ics = FALSE)
summary(spider.fit.nb)
plot(spider.fit.nb) ## Plots used in residual analysis,
## Used to check if assumptions such an mean-variance relationship
## are adequately satisfied.
lvsplot(spider.fit.nb) ## Plots a 2-D scatterplot of the latent variables,
## which can be interpreted in the same manner as an ordination plot.
## Example 2 - model where the first six columns are assumed to be Poisson,
## and the second six columns are assumed to be negative binomial...why not?
spider.fit.mix <- boral(y, family = rep(c("poisson","negative.binomial"),each=p/2),
num.lv = 2, row.eff = TRUE, n.burnin = 10, n.iteration = 100,
n.thin = 1, calc.ics = FALSE)
summary(spider.fit.mix)
## Example 3 - model with no latent variables, no site effects,
## and environmental covariates
spider.fit.nb <- boral(y, X = spider$x, family = "negative.binomial", num.lv = 0,
row.eff = FALSE, n.burnin = 10, n.iteration = 100, n.thin = 1,
calc.ics = FALSE)
summary(spider.fit.nb)
## The results can be compared with the default example from
## the manyglm() function in mvabund. Hopefully they are similar =D
## Example 4 - extend example 3 with grouped covariate selection
spider.fit.nb2 <- boral(y, X = spider$x, family = "negative.binomial", num.lv = 0,
row.eff = FALSE, n.burnin = 10, n.iteration = 100, n.thin = 1,
calc.ics = FALSE, ssvs.index = c(-1,-1,-1,1,2,3))
summary(spider.fit.nb2)
## Example 5 - extend example 3 with individual and grouped covariate selection
spider.fit.nb2 <- boral(y, X = spider$x, family = "negative.binomial", num.lv = 0,
row.eff = FALSE, n.burnin = 10, n.iteration = 100, n.thin = 1,
calc.ics = FALSE, ssvs.index = c(-1,-1,-1,1,0,0))
Run the code above in your browser using DataLab