DImulti: DImulti

Description

A function to fit Diversity-Interactions models to data with multiple ecosystem functions and/or multiple time points

Usage

DImulti(
  y,
  eco_func = c("NA", "NA"),
  time = c("NA", "NA"),
  unit_IDs,
  prop,
  data,
  DImodel,
  FG = NULL,
  ID = NULL,
  extra_fixed = NULL,
  estimate_theta = FALSE,
  theta = 1,
  method,
  method_theta = "univariate"
)

Value

DImulti - a custom class object containing the gls model fit with additional DI model attributes

Arguments

y: If the dataset is in a wide format, this argument is a vector of k column names identifying the ecosystem function values recorded from each experimental unit in the dataset. For example, if the ecosystem function columns are labelled Y1 to Y4, then y = c("Y1","Y2","Y3","Y4"). Alternatively, the column numbers can be specified, for example, y = 8:11, where ecosystem function values are in the 8th to 11th columns. If the dataset is in the long/stacked format, this argument is the column name of the response value vector, for example, y = "yield", alternatively, the column number can be supplied, y = 8
eco_func: A vector of size two, with the first index holding the column name containing the character or factor indicator of which response was recorded (for use when stacked/long data passed through parameter y), otherwise it is the string "NA", case insensitive. The second index should contain a string referring to the autocorrelation structure of the responses (for use in multivariate data case), options include "un" for unstructured/general and "cs" for compound symmetry. For example, eco_func = c("Function", "cs"), pertaining to multivariate data in a stacked format with a compound symmetry structure, or eco_func = c("na", "un"), meaning either univariate or wide multivariate data with an unstructured/general variance covariance matrix
time: A vector of size two, with the first index holding the column name containing the repeated measures identifier (i.e., indicating which time point the corresponding response was recorded at), otherwise it is the string "NA", case insensitive. The second index should contain a string referring to the autocorrelation structure of the repeated measures, options include "un" for unstructured/general, "cs" for compound symmetry, and "ar1" for an autoregressive model of order 1 (AR(1)). For example, time = c("reading", "ar1"), pertaining to repeated measures data with multiple readings on each unit with an AR(1) structure, or time = c("na", "un"), meaning multivariate data with no repeated measures (the correlation structure will be ignored)
unit_IDs: A vector of columns names/indices containing identifiers for the experimental units from which the observations (either multiple readings of the same response or a single reading of multiple responses) are taken, e.g., unit_IDs = "Plot"
prop: A vector of S column names identifying the species proportions in each community in the dataset. For example, if the species proportions columns are labelled p1 to p4, then prop = c("p1","p2","p3","p4"). Alternatively, the column numbers can be specified, for example, prop = 4:7, where species proportions are in the 4th to 7th columns
data: A dataframe or tibble containing all previously input columns
DImodel: A string, referring to the interaction structure of model to be fit from the full list: "STR", "ID", "FULL", "E", "AV", "ADD", "FG"
FG: If species are classified by g functional groups, this argument takes a string vector (of length S) of the functional group to which each species referenced in the prop argument belongs. For example, for four grassland species with two grasses and two legumes: FG could be FG = c("G","G","L","L"), where G stands for grass and L stands for legume. This argument is optional but must be supplied if the "FG" interaction structure is specified in the DImodel argument
ID: A text list (of length s) describing groupings for the identity of the effects of the species. These groupings will constrain some of the identity effects to be equal. For example, if there are four species and you wish to have two identity effect groups where species 1 and 3 and species 2 and 4 are grouped together: ID could be ID = c("ID1", "ID2", "ID1", "ID2"), where "ID1" and "ID2" are the names of the ID groups. This changes the identity component of the model from 'beta_1p_1 + beta_2p_2 + beta_3p_3 + beta_4p_4' to 'beta_a(p_1 + p_3) + beta_b(p_2 + p_4)'. This grouping does not affect the interaction terms.
extra_fixed: A formula expression for any additional fixed effect terms. For example, extra_fixed = ~ Treatment or extra_fixed = ~ Treatment + Density. Interactions across the regular formula can easily be included by beginning this argument with 1* or 1:, e.g. extra_fixed = ~ 1:Density. Any interactions included through this parameter will not be affected by theta. Any terms included will automatically be crossed with the column(s) specified through eco_func and/or time, therefore these interactions do not need to be specified here.
estimate_theta: By default, $\theta$ (the power parameter on all p_i * p_j components of each interaction variable in the model) is set equal to one. Specify estimate_theta = TRUE to include the estimation of $\theta$ in the specified model. A value of $\theta$ will be estimated for each ecosystem function present in the data, possibly changing the fixed effects across functions.
theta: Users may specify a value of $\theta$ different than 1 to fit the DI model. Note that if estimate_theta = TRUE, then $\theta$ will be estimated via maximum profile log-likelihood and the value specified for theta will be overridden. Specify a vector of positive non-zero numerical values indicating the value for the non-linear parameter of the model for each ecosystem function (in alphabetical order, use sort() to find this) present in the dataset, changing the fixed effects across functions, or a single value to be used for all. For example, theta = 0.5 or theta = c(1, 0.5, 1).
method: The string "REML" or "ML", referring to the estimation method to be used. "REML" is suitable for model comparisons only if the fixed effects are held constant but provides unbiased estimates. If fixed effects are going to be tested between models, use "ML" for comparisons and then refit the chosen model using "REML".
method_theta: The string "single", "joint", or "univariate", referring to the profiling method to be used. This refers to whether each theta value in a multivariate model is profiled separately (over a line), all at once (over a surface), or using a series of univariate DI models. Defaults to "univariate", which has been currently found to be the more efficient/parsimonious method via simulation study.

Author

Laura Byrne [aut, cre], Rishabh Vishwakarma [aut], Rafael de Andrade Moral [aut], Caroline Brophy [aut]
Maintainer: Laura Byrne byrnel54@tcd.ie

Details

Data

This package is intended for use with data containing multiple ecosystem function responses and/or time points from a biodiversity and ecosystem function (BEF) relationship study. The dataset should contain a column for each species' proportion, so that each row of these columns sum to one. Each row of the data should also contain an identifier for the experimental unit being referenced, these identifiers must be unique to each experimental unit, but remain consistent over time and across functions. For each experimental unit included there must be recordings of either:
- a single community level ecosystem function response variable taken at multiple time points (repeated measures),
- multiple community level ecosystem function responses (multivariate), or
- multiple community level ecosystem function responses taken at multiple time points (multivariate repeated measures).
Ecosystem functions may be stored in a long or wide format while repeated measures must be stored in a long format.
Additionally, by calling the function DImultiApp(), you can interact with the provided R Shiny app, which aids in fitting and visualising multivariate or repeated measures models through a user-friendly interface.

Introduction to multivariate & repeated measures Diversity-Interactions models

Diversity-Interactions (DI) models are a regression based approach to modelling the biodiversity ecosystem function (BEF) relationship which assumes that the main driver behind changes in ecosystem functioning is the initial relative abundance (or proportions) of the species present. These models can be estimated using least squares estimation methods.
An example of a univariate DI model can be seen below,
$$ y = \sum^{S}_{i=1}{\beta_{i} p_{i}} + \sum^{S}_{\substack{i,j=1 \\ i<j}}{\delta_{ij}(p_{i}p_{j})^{\theta} + \alpha A + \epsilon }$$ where the response $y$ represents the recorded ecosystem function, $p_{i}$ represents the initial proportion of the $i^{th}$ species, therefore the $p$ values sum to 1 and form a simplex space, and scales the ID effect of the species, $\beta_{i}$; if no species interactions or treatments are required in the model, the response $y$ is the weighted average of the species identity effects. $S$ represents the number of unique species present in the study. Similarly to the ID effect, the interaction effect, $\delta$, between species is scaled by some combination of the products of species proportions, which depends on the interaction structure chosen. The example above shows the full pairwise structure, which has a unique interaction term, $\delta_{ij}$, per pair of species $i$ & $j$. The nonlinear term $\theta$ (Connolly et al., 2013; Vishwakarma et al., 2023) is included in the model to allow the shape of the BEF relationship to change. This parameter can be estimated using profile log-likelihood optimisation (Brent, 1973) or can be assigned a set value based on an a priori assumption/knowledge. $A$ may include blocks or treatment terms, and $\alpha$ is a vector of the corresponding effect coefficients.
For further details of univariate DI modelling, see ?DImodels, Kirwan et al., 2009, and Moral et al., 2023.

The multivariate DI model (Dooley et al., 2015) extends the DI modelling framework to allow for the estimation of multiple ecosystem functions simultaneously, accounting for any existing covariance between functions through the error term. These models can be further extended through the introduction of repeated measures over multiple time points.
The structure for such models is: $$ y_{kmt} = \sum^{S}_{i=1}{\beta_{ikt} p_{im}} + \sum^{S}_{\substack{i,j=1 \\ i<j}}{\delta_{ijkt}(p_{im}p_{jm})^{\theta_{k}}} + \alpha_{kt}A + \epsilon_{kmt} $$

where $y_{kmt}$ refers to the value of the $k^{th}$ ecosystem function from the $m^{th}$ experimental unit at a time point $t$. For an experimental unit $m$, $\beta_{ikt}$ scaled by $p_{im}$ is the expected contribution of the $i^{th}$ species to the $k^{th}$ response at time point $t$ and is referred to as the $i^{th}$ species' ID effect. The value of the nonlinear parameter $\theta$ is allowed to vary between ecosystem functions, in turn allowing the fixed effect structure to change across functions, in recognition that the nature of the species interactions could change between ecosystem functions.

In the case that a dataset contains only a single ecosystem function, the corresponding subscript $k$ can simply be removed from the equation, the same can be said for the removal of the subscript $t$ in the instance that a dataset contains a single time point.

The structure of the error term

For a univariate DI model, the error term is assumed to follow a normal distribution with mean $0$ and variance $\sigma^{2}$. $$ \epsilon \sim N(0, \sigma^{2})$$

When the model is extended to fit multivariate ($k>1$) and/or repeated measures ($t>1$) data, the error term is now assumed to follow a multivariate normal distribution with mean $0$ and variance $\Sigma^{*}$. $$ \epsilon \sim MVN(0, \Sigma^{*}) $$ $\Sigma^{*}$ is a block diagonal matrix, with one $kt$ x $kt$ block, $\Sigma$, for each experimental unit $m$. We refer to $\Sigma$ as the variance covariance matrix for our ecosystem functions and time points. Typically, it includes a unique variance per combination of ecosystem functions and time points along the diagonal and a unique covariance between each pair of combinations on the off-diagonal. Autocorrelation structures may be implemented on the matrix $\Sigma$, either to simplify the estimation process or based on a priori knowledge. One structure is chosen for the ecosystem functions and another for repeated measures/time points, the two matrices are then estimated independently and combined using the Kronecker product ($\otimes$), a matrix multiplication method. In the case that the data is only multivariate ($k>1$ & $t=1$) or only has repeated measures ($k=1$ & $t>1$), only one autocorrelation structure needs to be chosen, with no multiplication necessary.
Three such structures are currently available in this package for repeated measures responses, and two are available for multivariate responses:

UN: When each element of $\Sigma$ is set to estimate independently, it is said to be unstructured or follow the general structure and is the preferred option in the case that there is no a priori information on the nature of these relationships. corSymm
CS: A simpler structure is compound symmetry, where it is assumed that each ecosystem function or time point has the same variance value $\sigma^{2}$ and each pair has the same covariance value $\sigma^{2}\rho$. This structure is not preferred for use with multiple ecosystem functions as it provides no meaningful interpretation, however it is allowed in this package if the model requires simplification. corCompSymm
AR(1): An autocorrelation structure exclusive to repeated measures data is an autoregressive model of order one, which assumes that, each time point has the same variance, $\sigma^{2}$, and as the distance in pairs of time points increases, the covariance between them changes by a factor of $\rho$. corAR1

References

Vishwakarma, R., Byrne, L., Connolly, J., de Andrade Moral, R. and Brophy, C., 2023.
Estimation of the non-linear parameter in Generalised Diversity-Interactions models is unaffected by change in structure of the interaction terms.
Environmental and Ecological Statistics, 30(3), pp.555-574.

Moral, R.A., Vishwakarma, R., Connolly, J., Byrne, L., Hurley, C., Finn, J.A. and Brophy, C., 2023.
Going beyond richness: Modelling the BEF relationship using species identity, evenness, richness and species interactions via the DImodels R package.
Methods in Ecology and Evolution, 14(9), pp.2250-2258.

Dooley, A., Isbell, F., Kirwan, L., Connolly, J., Finn, J.A. and Brophy, C., 2015.
Testing the effects of diversity on ecosystem multifunctionality using a multivariate model.
Ecology Letters, 18(11), pp.1242-1251.

Finn, J.A., Kirwan, L., Connolly, J., Sebastia, M.T., Helgadottir, A., Baadshaug, O.H., Belanger, G., Black, A., Brophy, C., Collins, R.P. and Cop, J., 2013.
Ecosystem function enhanced by combining four functional types of plant species in intensively managed grassland mixtures: a 3-year continental-scale field experiment.
Journal of Applied Ecology, 50(2), pp.365-375 .

Connolly, J., Bell, T., Bolger, T., Brophy, C., Carnus, T., Finn, J.A., Kirwan, L., Isbell, F., Levine, J., Luscher, A. and Picasso, V., 2013.
An improved model to predict the effects of changing biodiversity levels on ecosystem function.
Journal of Ecology, 101(2), pp.344-355.

Kirwan, L., Connolly, J., Finn, J.A., Brophy, C., Luscher, A., Nyfeler, D. and Sebastia, M.T., 2009.
Diversity-interaction modeling: estimating contributions of species identities and interactions to ecosystem function.
Ecology, 90(8), pp.2032-2038.

Brent, R.P., 1973.
Some efficient algorithms for solving systems of nonlinear equations.
SIAM Journal on Numerical Analysis, 10(2), pp.327-344.

Examples

Run this code


#################################################################################################
#################################################################################################
#################################################################################################
# \dontshow{
## Set up for R markdown for crayon and cli output if user has packages installed

if(requireNamespace("fansi", quietly = TRUE) &
   requireNamespace("crayon", quietly = TRUE) &
   requireNamespace("cli", quietly = TRUE))
{
  options(crayon.enabled = TRUE)
  ansi_aware_handler <- function(x, options)
  {
    paste0(
      "",
      fansi::sgr_to_html(x = x, warn = FALSE, term.cap = "256"),
      ""
    )
  }
  old_hooks <- fansi::set_knit_hooks(knitr::knit_hooks,
                                     which = c("output", "message", "error", "warning"))
  knitr::knit_hooks$set(
    output = ansi_aware_handler,
    message = ansi_aware_handler,
    warning = ansi_aware_handler,
    error = ansi_aware_handler
  )
}
# }

## Modelling Examples

# For a more thorough example of the workflow of this package, please see vignette
# DImulti_workflow using the following code:

# \donttest{
vignette("DImulti_workflow")
# }

## The simMVRM dataset
#
# This simulated dataset contains multiple ecosystem functions (k=3) and multiple time points
# (t=2). The dataset was
# simulated using unstructured Sigma matrices.
# The true values can be found in the help file for the data, ?simMVRM

data(simMVRM)
head(simMVRM)

# We will start the analysis with a call to the package's main function, DImulti().
# We begin the call by specifying the column indices holding the initial species proportions
# (p_i) through 'prop' and the columns which hold the ecosystem response values through 'y'.
# Since our data is multivariate, we include the 'eco_func' parameter, specifying "na" as our
# data is in a wide format (multiple columns in 'y') and "un" to fit an unstructured Sigma
# for our ecosystem functions.
# The data also contains repeated measures, so we include the 'time' parameter, specifying "time"
# as the column containing the time point indicator for each row and "ar1" to fit an AR(1)
# autocorrelation structure for our time points.
# Next, we specify that the experimental unit identifier is in column 1 through 'unit_IDs'. We
# indicate that we do not want to estimate the non-linear parameter theta, but do not provide
# any values, opting for the default value of 1.
# We specify a full pairwise (FULL) interaction structure through 'DImodel' and estimate the
# model using maximum likelihood (ML) as we may compare models.
# Finally, we provide the data object simMVRM through 'data'.

simModel_FULL <- DImulti(prop = 2:5, y = 6:8, eco_func = c("na", "un"), time = c("time", "ar1"),
                         unit_IDs = 1, estimate_theta = FALSE, DImodel = "FULL", method = "ML",
                         data = simMVRM)

# simModel_FULL is an object of custom class DImulti, which can be used with a number of S3
# methods and any method compatible with gls objects. We use summary() to examine the model fit,
# including fixed effect coefficients and the variance covariance matrix Sigma.

print(simModel_FULL)

# From this summary, we can see that there are many coefficients, a number of which are not
# statistically significant at an
# alpha level of 0.05, therefore this model may not be ideal for our data. We refit the model
# using an average interaction structure instead as it reduces the number of interaction terms
# to 1 per ecosystem function and time point.

simModel_AV <- DImulti(prop = 2:5, y = 6:8, eco_func = c("na", "un"), time = c("time", "ar1"),
                       unit_IDs = 1, estimate_theta = FALSE, DImodel = "AV", method = "ML",
                       data = simMVRM)
print(simModel_AV)

# This model looks like it is a better fit, with less insignificant terms, but we can test this
# formally using a likelihood ratio test, as the DImodels interaction structures are nested in
# nature (with the exception of "ADD" and "FG", which are on the same level in the nesting
# hierarchy).
# This will test the null hypothesis that the likelihood of the two models do not significantly
# differ, in other words, the added parameters of the more complex model are not worth it.

anova(simModel_FULL, simModel_AV)

# As the p-value is greater than our selected alpha value (0.05), we fail to reject the null
# hypothesis and so continue with the simpler model, simModel_AV.
# We can confirm this result by using information criteria such as AIC or BIC.
# We select the model with the lower value as it indicates a better fit.
# We use the second order versions of AIC and BIC (AICc and BICc) as we have a large number of
# terms, which can cause AIC and BIC to favour more complex models.

AICc(simModel_FULL); AICc(simModel_AV)
BICc(simModel_FULL); BICc(simModel_AV)

# We refit our chosen model using the REML estimation method to have unbiased estimates.

simModel_AV <- DImulti(prop = 2:5, y = 6:8, eco_func = c("na", "un"), time = c("time", "ar1"),
                       unit_IDs = 1, estimate_theta = FALSE, DImodel = "AV", method = "REML",
                       data = simMVRM)

# We can now examine the variance covariance matrix, Sigma, and the fixed effect coefficients,
# which can be retrieved from our initial summary() check or individually.

coef(simModel_AV)
simModel_AV$vcov

# An example of what we can infer from this is that ecosystem functions Y1 and Y2 have a positive
# covariance at time 1, while Y3 has a negative covariance with both Y1 and Y2 at the same time
# point. This means that maximising Y1 would not negatively impact Y2 but it would be at the cost
# of Y3. However, as our interaction term is positive and significant at this time point for all
# three ecosystem functions, we should be able to include a mixture of species that have positive
# ID effects for each response to help mitigate this trade-off.
# We can also predict from the model if we want to compare responses from different conditions,
# even those not included in the original data.

predict(simModel_AV, newdata = simMVRM[which(simMVRM$plot == 1:2), ])


#################################################################################################
## The Belgium dataset
#
# This real world dataset contains multiple ecosystem functions (k=3) at a single time point
# (t=1). The dataset also contains seeding density as a treatment in the form of a factor with
# two levels (1, -1). More detail can be found at ?dataBEL.

data(dataBEL)
head(dataBEL)

# We begin with the main function DImulti(), passing the initial species proportions column names
# through 'prop' and the response value column name through 'y'.
# As the data is in a long or stacked format, we specify the ecosystem function type through the
# first index of the 'eco_func' parameter, along with specifying that we want an unstructured
# Sigma for these response types.
# The experimental unit ID is stored in the column "Plot" and so we pass this to 'unit_IDs'.
# Rather than estimating the nonlinear parameter theta, we opt to provide a value for each
# ecosystem function type through the parameter 'theta', which will be applied to the
# interaction terms as a power. In this case, we use functional group (FG) interactions, which
# requires an additional argument FG to be provided, specifying which group each species present
# in 'prop' belongs to. In this case, we group the grasses and the legumes.
# We include the treatment Density as an additional fixed effect.
# We opt to use the REML estimation method as we will not be doing any model comparisons.
# Finally, we specify the data object, dataBEL, through 'data'.

belModel <- DImulti(prop = c("G1", "G2", "L1", "L2"), y = "Y", eco_func = c("Var", "un"),
                    unit_IDs = "Plot", theta = c(0.5, 1, 1.2), DImodel = "FG",
                    FG = c("Grass", "Grass", "Legume", "Legume"), extra_fixed = ~ Density,
                    method = "REML", data = dataBEL)

# We can now print the output from our model, stored in belModel with class DImulti.

print(belModel)




#################################################################################################
## The Sweden dataset
#
# This real-world dataset contains a single ecosystem function read at multiple time points
# (k=1 & t=3). The data contains two treatments, TREAT and DENS, each with two levels
# 1 & 2 and High & Low, respectively.
# More details can be found at ?dataSWE

data(dataSWE)
head(dataSWE)

# We transform the "YEARN" column to factors to better act as groups in our models, giving us a
# coefficient per year number as opposed to acting as a continuous scaling factor.

dataSWE$YEARN <- as.factor(dataSWE$YEARN)

# We use the DImulti() function to fit a repeated measures DI model to this data.
# We specify the column indices 5 through 8 for our initial species proportions and the response
# value column name "YIELD".
# As there are multiple time points (repeated measures), we use the parameter 'time', providing
# the column name containing the time identifier through the first index and the desired Sigma
# structure (compound symmetry) through the second.
# The experimental unit ID is stored in column index two, which is passed through 'unit_IDs'.
# The interaction structure chosen is average interaction term, "AV".
# We include both of the treatment terms, TREAT and DENS, as extra fixed effects crossed with
# each ID effect, the interaction effect, and with each other.
# We opt to use the REML estimation method for the model as we will not be doing any model
# comparisons.
# Finally, we specify the data object, dataSWE, through 'data'

SWEmodel <- DImulti(prop = 5:8, y = c("YIELD"), time = c("YEARN", "CS"),
                    unit_IDs = 2, DImodel = "AV", extra_fixed = ~ 1:(TREAT:DENS),
                    method = "REML", data = dataSWE)

print(SWEmodel)

Run the code above in your browser using DataLab