boral
fits independent column GLMs such that the columns of $y$ are assumed to be independent; 2) With no covariates, boral
fits a purely latent variable model to perform model-based unconstrained ordination (Hui et al., 2014); 3) With covariates and latent variables, boral
fits correlated column GLMs, with latent variables accounting for correlations between the columns of $y$.boral(y, ...)## S3 method for class 'default':
boral(y, X = NULL, family, trial.size = NULL, num.lv = 0, site.eff = FALSE,
n.burnin = 4000, n.iteration = 24000, n.thin = 5, save.model = FALSE,
seed = 123, calc.ics = TRUE, hypparams = c(100,100), ...)
## S3 method for class 'boral':
print(x, ...)
NULL
, in which case no model matrix was used. No intercept column should be included in X
.n.burin, n.iteration
and n.thin
, this leads to a total of 5000 MCMC samples.coda
boral
, a set.seed(seed)
command is run immediately before starting the MCMC sampler. Defaults to the value 123.get.measures
). Defaults to TRUE
.X
.get.hpdintervals
for more details.calc.ics = TRUE
, then a vector of different information criteria values for the model. Please see get.measures
for details on the criteria.save.model = TRUE
, the raw jags model fitted is returned. This can be quite large!y
, the response and model matrix used, distributional assumptions and trial sizes, number of latent variables, the number of covariates, whether information criteria values were calculated, hyper-parameters used in the Bayesian estimation, and the number of levels for ordinal responses.X
. Column-specific intercepts are estimated automatically and given by the first column oflv.coefs
.num.lv > 1
to allow flexible modeling of the residual correlation matrix.y
is large! The calculation of information criteria (calc.ics = TRUE
) can also take a while.save.model = TRUE
, the raw jags model is also returned. This can be quite very memory-consuming, since it indirectly saves all the MCMC samples.Independent column models: boral
also allows explanatory variables to be entered into the model via a model matrix X
. This model matrix can contain anything the user wants, provided factors have been parameterized as dummy variables. It should not include an intercept column.
Without latent variables i.e., num.lv = 0
, boral
fits separate GLMs to each column of $y$, where the columns are assumed to be independent. That is,
$$g(\mu_{ij}) = \alpha_i + \theta_{0j} + \bm{x}'_i\bm{\beta}_j; \quad i = 1,\ldots,n; j = 1,\ldots,p,$$
where the mean response for element $(i,j)$, denoted as $mu_{ij}$, is regressed against the covariates $\bm{x}_i$ via a link function $g(\cdot)$. $\theta_{0j}$ and $\bm{\beta}_j$ are the column-specific intercepts and regression coefficients respectively, while $\alpha_i$ is an optional row effect (if site.eff = TRUE
). Fitting the above type of model is thus more or less a Bayesian analog of the manyglm
function in the mvabund
package (Wang et al., 2013). Unlike manyglm
though, different distributional assumptions are allowed however for each column of $y$, and row effects can be added as a type of "row-standardization".
A not-so-brief tangent on distributions: In the event different responses are collect for different columns, e.g., some columns of $y$ are counts, while other columns are presence-absence, one can specify different distributions for each column. Aspects such as variable selection, residual analysis, and plotting of the latent variables are not affected by having different distributions. Naturally though, one has to be more careful with interpretation of the row effects $\alpha_i$ and latent variables $\bm{z}_i$, with different scalings due to different link functions used. A situation where different distributions may prove useful is when $y$ is a species-traits matrix, where each row is a species and each column is a trait such as specific leaf area, presence of thorns and so on. In this case, traits could be of different response types, and the goal perhaps is to perform unconstrained ordination to look for patterns between species on an underlying trait surface e.g., a defense index for a species (Moles et al., 2013; see the discussion below on how to perform model-based unconstrained ordination).
For multivariate abundance data in ecology (also known as community composition data, Legendre and Gallagher, 2001), species counts are often overdispersed. Using a negative binomial distribution (family = "negative.binomial"
) to model the counts usually helps to account for this overdispersion. For non-negative continuous data such as biomass, the lognormal and tweedie distributions may be used (Foster and Bravington, 2013). Note though a a common power paramter is used for tweedie columns -- there almost always insufficient information to model column-specific power parameters. Normal responses are also implemented, just in case you encounter some normal stuff in ecology (pun intended)!
For ordinal response columns, cumulative logit regression is used (also known as proportional odds regression, Agresti, 2010). boral
assumes all ordinal columns are measured using the same scale i.e., all columns have the same number of theoretical levels. The number of levels is then assumed to be given by the maximum value from all the ordinal columns of $y$. Thus, the parameterization of proportional odds regression is such that all ordinal columns have the same cutoff points, $\bm{\tau}$, while the column-specific intercept effect, $\theta_{0j}$, allows for deviations away from these common cutoffs. That is,
$$logit(P(y_{ij} \le k)) = \tau_k + \theta_{0j} + \ldots,$$
where $logit(\cdot)$ is the logit function, $P(y_{ij} \le k)$ is the cumulative probability of element $y_{ij}$ being less than or equal to level $k$, $\tau_k$ is the cutoff linking levels $k$ and $k+1$ (and which are increasing in $k$), $\theta_{0j}$ are the column effects, and "$\ldots$" generically represents what else is included in the model e.g., latent variables and related coefficients. A sum-to-zero constraint is imposed on the $\theta_{0j}$'s of all ordinal columns, to ensure model identifiability.
The parameterization above is useful for modeling multivariate abundance data in ecology. When ordinal responses are recorded, usually the same ordinal scale is applied to all species e.g., level 1 = not there, level 2 = a bit there, level 3 = lots there, level 4 = everywhere! The quantity $\tau_k$ can be interpreted as this common ordinal scale, while $\theta_{0j}$ allows for deviations away from these to account for differences in species prevalence. Admittedly, the current implementation of boral
for ordinal data can be rather slow.
Purely latent variable models: If no explantory variables are included and num.lv
> 0, boral
fits a purely latent variable model to perform model-based unconstrained ordination (Hui et al., 2014),
$$g(\mu_{ij}) = \alpha_i + \theta_{0j} + \bm{z}'_i\bm{\theta}_j,$$
where instead of measured covariates, we now have a vector of latent variables $\bm{z}_i$ with $\bm{\theta}_j$ being the column-specific coefficients relating to these latent variables. Unconstrained ordination is used for visualizing multivariate data in a low-dimensional space, without reference to explanatory variables (see Chapter 9, Legendre and Legendre, 2012). Typically, num.lv
= 1 or 2 latent variables is used, allowing the latent variables to plotted (using lvsplot
, for instance). The resulting plot can be interpreted in the same manner as plots from Nonmetric Multi-dimensional Scaling (NMDS, Kruskal, 1964) and Correspondence Analysis (CA, Hill, 1974), for example. For instance, with multivariate abundance data, unconstrained ordination is used to visualize the relationships between sites in terms of species abundance or composition. The column-specific intercept, $\theta_{0j}$, accounts for differences between species prevalence, while the row effect, $\alpha_i$, is included to account for differences in site total abundance. The inclusion of the latter ensures the ordination is in terms of species composition. If $\alpha_i$ is omitted from the model i.e., site.eff = FALSE
, the ordination will be in terms of species abundance.
Correlated column models: When one or more latent variables are included in conjunction with covariates i.e., X
is non-empty and num.lv
> 1, boral
fits separate GLMs to each column of $y$ while allowing for any potential correlation between columns via the latent variables. This is useful for multivariate abundance data in ecology, where a separate GLM is fitted to species (modeling its response against environmental covariates, say), while accounting for the fact species at a site are correlated. The model fitted takes the following form,
$$g(\mu_{ij}) = \theta_{0j} + \bm{x}'_i\bm{\beta}_j, + \bm{z}'_i\bm{\theta}_j,$$
which we can see is a mash of the first two types of models. The linear predictor $\bm{z}'_i\bm{\theta}_j$ induces a ``residual covariance/correlation"" between the columns of $y$, which is of rank num.lv
. For multivariate abundance data, this leads to a parsimonious method of accounting for correlation between species not due to the shared environmental responses. After fitting the model, the residual correlation matrix can be obtained using the get.residual.cor
function. Note num.lv > 1
is necessarily in order to flexibly model the residual correlations (see also Pollock et al., 2014, for residual correlation matrices in the context of Joint Species Distribution Models).
Using information criteria at your own risk: Using information criteria from calc.ics = TRUE
for model selection should be done with extreme caution, for two reasons: 1) The implementation of some of these criteria is heuristic and experimental. 2) Deciding what model to fit should also be driven by the science. For example, it may be the case that criteria suggests a model with 3 or 4 latent variables is more appropriate. However, if we are interested in visualizing the data for ordination purposes, then models with 1 or 2 latent variables are more appropriate. As another example, whether or not we include row effects when ordinating multivariate abundance data depends on if we are interested in differences between sites in terms of species abundance (site.eff = FALSE
) or species composition (site.eff = TRUE
).
Estimation for boral
is performed using Bayesian Markov Chain Monte Carlo (MCMC) methods via JAGS (Plummer, 2003). Uninformative priors are used for all parameters i.e., normal priors with mean zero and variance given by hypparams[1]
for all coefficients, row and column effects, and cutoffs for ordinal responses. For the negative binomial, normal, and tweedie distributions, uniform priors from 0 to $1/hypparams[2]$ are used on $\phi$.
lvsplot
for a scatter plot of the latent variables when num.lv = 1
or 2
, summary.boral
for a summary of the fitted boral model, get.measures
and get.more.measures
for information criteria from the fitted boral model, get.residual.cor
for calculating the residual correlation matrix.library(mvabund) ## Load a dataset from the mvabund package
data(spider)
y <- spider$abun
n <- nrow(y); p <- ncol(y);
## Example 1 - model with two latent variables, site effects,
## and no environmental covariates
spider.fit.nb <- boral(y, family = "negative.binomial", num.lv = 2,
site.eff = TRUE, n.burnin = 10, n.iteration = 100,
n.thin = 1, calc.ics = FALSE)
summary(spider.fit.nb)
plot(spider.fit.nb) ## Plots used in residual analysis,
## Used to check if assumptions such an mean-variance relationship
## are adequately satisfied.
lvsplot(spider.fit.nb) ## Plots a 2-D scatterplot of the latent variables,
## which can be interpreted in the same manner as an ordination plot.
## Example 2 - model where the first six columns are assumed to be Poisson,
## and the second six columns are assumed to be negative binomial...why not?
spider.fit.mix <- boral(y, family = rep(c("poisson","negative.binomial"),each=p/2),
num.lv = 2, site.eff = TRUE, n.burnin = 10, n.iteration = 100,
n.thin = 1, calc.ics = FALSE)
summary(spider.fit.mix)
## Example 3 - model with no latent variables, no site effects,
## and environmental covariates
spider.fit.nb <- boral(y, X = spider$x, family = "negative.binomial", num.lv = 0,
site.eff = FALSE, n.burnin = 10, n.iteration = 100, n.thin = 1,
calc.ics = FALSE)
summary(spider.fit.nb)
## The results can be compared with the default example from
## the manyglm() function in mvabund. Hopefully they are similar =D
Run the code above in your browser using DataLab