# cqo

##### Fitting Constrained Quadratic Ordination (CQO)

A *constrained quadratic ordination* (CQO; formerly called
*canonical Gaussian ordination* or CGO) model is fitted using
the *quadratic reduced-rank vector generalized linear model*
(QRR-VGLM) framework.

- Keywords
- models, regression

##### Usage

```
cqo(formula, family = stop("argument 'family' needs to be assigned"),
data = list(), weights = NULL, subset = NULL,
na.action = na.fail, etastart = NULL, mustart = NULL,
coefstart = NULL, control = qrrvglm.control(...), offset = NULL,
method = "cqo.fit", model = FALSE, x.arg = TRUE, y.arg = TRUE,
contrasts = NULL, constraints = NULL, extra = NULL,
smart = TRUE, ...)
```

##### Arguments

- formula
a symbolic description of the model to be fit. The RHS of the formula is applied to each linear predictor. Different variables in each linear predictor can be chosen by specifying constraint matrices.

- family
a function of class

`"vglmff"`

(see`vglmff-class`

) describing what statistical model is to be fitted. This is called a ``VGAM family function''. See`CommonVGAMffArguments`

for general information about many types of arguments found in this type of function. Currently the following families are supported:`poissonff`

,`binomialff`

(`logit`

and`cloglog`

links available),`negbinomial`

,`gamma2`

,`gaussianff`

. Sometimes special arguments are required for`cqo()`

, e.g.,`binomialff(multiple.responses = TRUE)`

. Also,`quasipoissonff`

and`quasibinomialff`

may or may not work.- data
an optional data frame containing the variables in the model. By default the variables are taken from

`environment(formula)`

, typically the environment from which`cqo`

is called.- weights
an optional vector or matrix of (prior) weights to be used in the fitting process. Currently, this argument should not be used.

- subset
an optional logical vector specifying a subset of observations to be used in the fitting process.

- na.action
a function which indicates what should happen when the data contain

`NA`

s. The default is set by the`na.action`

setting of`options`

, and is`na.fail`

if that is unset. The ``factory-fresh'' default is`na.omit`

.- etastart
starting values for the linear predictors. It is a \(M\)-column matrix. If \(M = 1\) then it may be a vector. Currently, this argument probably should not be used.

- mustart
starting values for the fitted values. It can be a vector or a matrix. Some family functions do not make use of this argument. Currently, this argument probably should not be used.

- coefstart
starting values for the coefficient vector. Currently, this argument probably should not be used.

- control
a list of parameters for controlling the fitting process. See

`qrrvglm.control`

for details.- offset
This argument must not be used.

- method
the method to be used in fitting the model. The default (and presently only) method

`cqo.fit`

uses*iteratively reweighted least squares*(IRLS).- model
a logical value indicating whether the

*model frame*should be assigned in the`model`

slot.- x.arg, y.arg
logical values indicating whether the model matrix and response matrix used in the fitting process should be assigned in the

`x`

and`y`

slots. Note the model matrix is the LM model matrix.- contrasts
an optional list. See the

`contrasts.arg`

of`model.matrix.default`

.- constraints
an optional list of constraint matrices. The components of the list must be named with the term it corresponds to (and it must match in character format). Each constraint matrix must have \(M\) rows, and be of full-column rank. By default, constraint matrices are the \(M\) by \(M\) identity matrix unless arguments in the family function itself override these values. If

`constraints`

is used it must contain*all*the terms; an incomplete list is not accepted. Constraint matrices for \(x_2\) variables are taken as the identity matrix.- extra
an optional list with any extra information that might be needed by the family function.

- smart
logical value indicating whether smart prediction (

`smartpred`

) will be used.- …
further arguments passed into

`qrrvglm.control`

.

##### Details

QRR-VGLMs or *constrained quadratic ordination* (CQO) models
are estimated here by maximum likelihood estimation. Optimal linear
combinations of the environmental variables are computed, called
*latent variables* (these appear as `latvar`

for \(R=1\)
else `latvar1`

, `latvar2`

, etc. in the output). Here, \(R\)
is the *rank* or the number of ordination axes. Each species'
response is then a regression of these latent variables using quadratic
polynomials on a transformed scale (e.g., log for Poisson counts, logit
for presence/absence responses). The solution is obtained iteratively
in order to maximize the log-likelihood function, or equivalently,
minimize the deviance.

The central formula (for Poisson and binomial species data) is
given by
$$\eta = B_1^T x_1 + A \nu +
\sum_{m=1}^M (\nu^T D_m \nu) e_m$$
where \(x_1\) is a vector (usually just a 1 for an intercept),
\(x_2\) is a vector of environmental variables, \(\nu=C^T
x_2\) is a \(R\)-vector of latent variables, \(e_m\) is
a vector of 0s but with a 1 in the \(m\)th position.
The \(\eta\) are a vector of linear/additive predictors,
e.g., the \(m\)th element is \(\eta_m = \log(E[Y_m])\) for the \(m\)th species. The matrices \(B_1\),
\(A\), \(C\) and \(D_m\) are estimated from the data, i.e.,
contain the regression coefficients. The tolerance matrices
satisfy \(T_s = -\frac12 D_s^{-1}\).
Many important CQO details are directly related to arguments
in `qrrvglm.control`

, e.g., the argument `noRRR`

specifies which variables comprise \(x_1\).

Theoretically, the four most popular VGAM family functions
to be used with `cqo`

correspond to the Poisson, binomial,
normal, and negative binomial distributions. The latter is a
2-parameter model. All of these are implemented, as well as the
2-parameter gamma. The Poisson is or should be catered for by
`quasipoissonff`

and `poissonff`

, and the
binomial by `quasibinomialff`

and `binomialff`

.
Those beginning with `"quasi"`

have dispersion parameters that
are estimated for each species.

For initial values, the function `.Init.Poisson.QO`

should
work reasonably well if the data is Poisson with species having equal
tolerances. It can be quite good on binary data too. Otherwise the
`Cinit`

argument in `qrrvglm.control`

can be used.

It is possible to relax the quadratic form to an additive model. The
result is a data-driven approach rather than a model-driven approach,
so that CQO is extended to *constrained additive ordination*
(CAO) when \(R=1\). See `cao`

for more details.

In this documentation, \(M\) is the number of linear predictors, \(S\) is the number of responses (species). Then \(M=S\) for Poisson and binomial species data, and \(M=2S\) for negative binomial and gamma distributed species data.

Incidentally,
*Unconstrained quadratic ordination* (UQO)
may be performed by, e.g., fitting a Goodman's RC association model;
see `uqo`

and the Yee and Hadi (2014) referenced there.
For UQO, the response is the usual site-by-species matrix and
there are no environmental variables;
the site scores are free parameters.
UQO can be performed under the assumption that all species
have the same tolerance matrices.

##### Value

An object of class `"qrrvglm"`

.

##### Note

The input requires care, preparation and
thought---*a lot more* than other ordination methods.
Here is a partial **checklist**.

- (1)
The number of species should be kept reasonably low, e.g., 12 max. Feeding in 100+ species wholesale is a recipe for failure. Choose a few species carefully. Using 10 well-chosen species is better than 100+ species thrown in willy-nilly.

- (2)
Each species should be screened individually first, e.g., for presence/absence is the species totally absent or totally present at all sites? For presence/absence data

`sort(colMeans(data))`

can help avoid such species.- (3)
The number of explanatory variables should be kept low, e.g., 7 max.

- (4)
Each explanatory variable should be screened individually first, e.g., is it heavily skewed or are there outliers? They should be plotted and then transformed where needed. They should not be too highly correlated with each other.

- (5)
Each explanatory variable should be scaled, e.g., to mean 0 and unit variance. This is especially needed for

`I.tolerance = TRUE`

.- (6)
Keep the rank low. Only if the data is very good should a rank-2 model be attempted. Usually a rank-1 model is all that is practically possible even after a lot of work. The rank-1 model should always be attempted first. Then might be clever and try use this for initial values for a rank-2 model.

- (7)
If the number of sites is large then choose a random sample of them. For example, choose a maximum of 500 sites. This will reduce the memory and time expense of the computations.

- (8)
Try

`I.tolerance = TRUE`

or`eq.tolerance = FALSE`

if the inputted data set is large, so as to reduce the computational expense. That's because the default,`I.tolerance = FALSE`

and`eq.tolerance = TRUE`

, is very memory hungry.

By default, a rank-1 equal-tolerances QRR-VGLM model is fitted
(see `qrrvglm.control`

for the default control
parameters).
If `Rank > 1`

then the latent variables are always transformed
so that they are uncorrelated.
By default, the argument `trace`

is `TRUE`

meaning a running
log is printed out while the computations are taking place. This is
because the algorithm is computationally expensive, therefore users
might think that their computers have frozen if `trace = FALSE`

!

The argument `Bestof`

in `qrrvglm.control`

controls
the number of models fitted (each uses different starting values) to
the data. This argument is important because convergence may be to a
*local* solution rather than the *global* solution. Using
more starting values increases the chances of finding the global
solution. Always plot an ordination diagram (use the generic function
`lvplot`

) and see if it looks sensible. Local solutions
arise because the optimization problem is highly nonlinear, and this is
particularly true for CAO.

Many of the arguments applicable to `cqo`

are common to
`vglm`

and `rrvglm.control`

.
The most important arguments are
`Rank`

,
`noRRR`

,
`Bestof`

,
`I.tolerances`

,
`eq.tolerances`

,
`isd.latvar`

, and
`MUXfactor`

.

When fitting a 2-parameter model such as the negative binomial
or gamma, it pays to have `eq.tolerances = TRUE`

and
`I.tolerances = FALSE`

. This is because numerical problems can
occur when fitting the model far away from the global solution when
`I.tolerances = TRUE`

. Setting the two arguments as described will
slow down the computation considerably, however it is numerically
more stable.

In Example 1 below, an unequal-tolerances rank-1 QRR-VGLM is fitted to the
hunting spiders dataset, and
Example 2 is the equal-tolerances version. The latter is less likely to
have convergence problems compared to the unequal-tolerances model.
In Example 3 below, an equal-tolerances rank-2 QRR-VGLM is fitted to the
hunting spiders dataset.
The numerical difficulties encountered in fitting the rank-2 model
suggests a rank-1 model is probably preferable.
In Example 4 below, constrained binary quadratic ordination (in old
nomenclature, constrained Gaussian logit ordination) is fitted to some
simulated data coming from a species packing model.
With multivariate binary responses, one must
use `multiple.responses = TRUE`

to
indicate that the response (matrix) is multivariate. Otherwise, it is
interpreted as a single binary response variable.
In Example 5 below, the deviance residuals are plotted for each species.
This is useful as a diagnostic plot.
This is done by (re)regressing each species separately against the latent
variable.

Sometime in the future, this function might handle input of the form
`cqo(x, y)`

, where `x`

and `y`

are matrices containing
the environmental and species data respectively.

##### Warning

Local solutions are not uncommon when fitting CQO models. To increase
the chances of obtaining the global solution, increase the value
of the argument `Bestof`

in `qrrvglm.control`

.
For reproducibility of the results, it pays to set a different
random number seed before calling `cqo`

(the function
`set.seed`

does this). The function `cqo`

chooses initial values for **C** using `.Init.Poisson.QO()`

if `Use.Init.Poisson.QO = TRUE`

, else random numbers.

Unless `I.tolerances = TRUE`

or `eq.tolerances = FALSE`

,
CQO is computationally expensive with memory and time.
It pays to keep the rank down to 1
or 2. If `eq.tolerances = TRUE`

and `I.tolerances = FALSE`

then
the cost grows quickly with the number of species and sites (in terms of
memory requirements and time). The data needs to conform quite closely
to the statistical model, and the environmental range of the data should
be wide in order for the quadratics to fit the data well (bell-shaped
response surfaces). If not, RR-VGLMs will be more appropriate because
the response is linear on the transformed scale (e.g., log or logit)
and the ordination is called *constrained linear ordination* or CLO.

Like many regression models, CQO is sensitive to outliers (in the environmental and species data), sparse data, high leverage points, multicollinearity etc. For these reasons, it is necessary to examine the data carefully for these features and take corrective action (e.g., omitting certain species, sites, environmental variables from the analysis, transforming certain environmental variables, etc.). Any optimum lying outside the convex hull of the site scores should not be trusted. Fitting a CAO is recommended first, then upon transformations etc., possibly a CQO can be fitted.

For binary data, it is necessary to have `enough' data. In general,
the number of sites \(n\) ought to be much larger than the number of
species *S*, e.g., at least 100 sites for two species. Compared
to count (Poisson) data, numerical problems occur more frequently
with presence/absence (binary) data. For example, if `Rank = 1`

and if the response data for each species is a string of all absences,
then all presences, then all absences (when enumerated along the latent
variable) then infinite parameter estimates will occur. In general,
setting `I.tolerances = TRUE`

may help.

This function was formerly called `cgo`

. It has been renamed to
reinforce a new nomenclature described in Yee (2006).

##### References

Yee, T. W. (2004)
A new technique for maximum-likelihood
canonical Gaussian ordination.
*Ecological Monographs*,
**74**, 685--701.

ter Braak, C. J. F. and Prentice, I. C. (1988)
A theory of gradient analysis.
*Advances in Ecological Research*,
**18**, 271--317.

Yee, T. W. (2006)
Constrained additive ordination.
*Ecology*, **87**, 203--213.

##### See Also

`qrrvglm.control`

,
`Coef.qrrvglm`

,
`predictqrrvglm`

,
`rcqo`

,
`cao`

,
`uqo`

,
`rrvglm`

,
`poissonff`

,
`binomialff`

,
`negbinomial`

,
`gamma2`

,
`lvplot.qrrvglm`

,
`perspqrrvglm`

,
`trplot.qrrvglm`

,
`vglm`

,
`set.seed`

,
`hspider`

,
`trapO`

.

##### Examples

```
# NOT RUN {
# Example 1; Fit an unequal tolerances model to the hunting spiders data
hspider[,1:6] <- scale(hspider[,1:6]) # Standardized environmental variables
set.seed(1234) # For reproducibility of the results
p1ut <- cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi,
Auloalbi, Pardlugu, Pardmont, Pardnigr, Pardpull,
Trocterr, Zoraspin) ~
WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
fam = poissonff, data = hspider, Crow1positive = FALSE,
eq.tol = FALSE)
sort(deviance(p1ut, history = TRUE)) # A history of all the iterations
if (deviance(p1ut) > 1177) warning("suboptimal fit obtained")
S <- ncol(depvar(p1ut)) # Number of species
clr <- (1:(S+1))[-7] # Omits yellow
lvplot(p1ut, y = TRUE, lcol = clr, pch = 1:S, pcol = clr,
las = 1) # Ordination diagram
legend("topright", leg = colnames(depvar(p1ut)), col = clr,
pch = 1:S, merge = TRUE, bty = "n", lty = 1:S, lwd = 2)
(cp <- Coef(p1ut))
(a <- latvar(cp)[cp@latvar.order]) # Ordered site scores along the gradient
# Names of the ordered sites along the gradient:
rownames(latvar(cp))[cp@latvar.order]
(aa <- Opt(cp)[, cp@Optimum.order]) # Ordered optimums along the gradient
aa <- aa[!is.na(aa)] # Delete the species that is not unimodal
names(aa) # Names of the ordered optimums along the gradient
trplot(p1ut, which.species = 1:3, log = "xy", type = "b", lty = 1, lwd = 2,
col = c("blue","red","green"), label = TRUE) -> ii # Trajectory plot
legend(0.00005, 0.3, paste(ii$species[, 1], ii$species[, 2], sep = " and "),
lwd = 2, lty = 1, col = c("blue", "red", "green"))
abline(a = 0, b = 1, lty = "dashed")
S <- ncol(depvar(p1ut)) # Number of species
clr <- (1:(S+1))[-7] # Omits yellow
persp(p1ut, col = clr, label = TRUE, las = 1) # Perspective plot
# Example 2; Fit an equal tolerances model. Less numerically fraught.
set.seed(1234)
p1et <- cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi,
Auloalbi, Pardlugu, Pardmont, Pardnigr, Pardpull,
Trocterr, Zoraspin) ~
WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
poissonff, data = hspider, Crow1positive = FALSE)
sort(deviance(p1et, history = TRUE)) # A history of all the iterations
if (deviance(p1et) > 1586) warning("suboptimal fit obtained")
S <- ncol(depvar(p1et)) # Number of species
clr <- (1:(S+1))[-7] # Omits yellow
persp(p1et, col = clr, label = TRUE, las = 1)
# Example 3: A rank-2 equal tolerances CQO model with Poisson data
# This example is numerically fraught... need I.toler = TRUE too.
set.seed(555)
p2 <- cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi,
Auloalbi, Pardlugu, Pardmont, Pardnigr, Pardpull,
Trocterr, Zoraspin) ~
WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
poissonff, data = hspider, Crow1positive = FALSE,
I.toler = TRUE, Rank = 2, Bestof = 3, isd.latvar = c(2.1, 0.9))
sort(deviance(p2, history = TRUE)) # A history of all the iterations
if (deviance(p2) > 1127) warning("suboptimal fit obtained")
lvplot(p2, ellips = FALSE, label = TRUE, xlim = c(-3,4),
C = TRUE, Ccol = "brown", sites = TRUE, scol = "grey",
pcol = "blue", pch = "+", chull = TRUE, ccol = "grey")
# Example 4: species packing model with presence/absence data
set.seed(2345)
n <- 200; p <- 5; S <- 5
mydata <- rcqo(n, p, S, fam = "binomial", hi.abundance = 4,
eq.tol = TRUE, es.opt = TRUE, eq.max = TRUE)
myform <- attr(mydata, "formula")
set.seed(1234)
b1et <- cqo(myform, binomialff(multiple.responses = TRUE, link = "cloglog"),
data = mydata)
sort(deviance(b1et, history = TRUE)) # A history of all the iterations
lvplot(b1et, y = TRUE, lcol = 1:S, pch = 1:S, pcol = 1:S, las = 1)
Coef(b1et)
# Compare the fitted model with the 'truth'
cbind(truth = attr(mydata, "concoefficients"), fitted = concoef(b1et))
# Example 5: Plot the deviance residuals for diagnostic purposes
set.seed(1234)
p1et <- cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi,
Auloalbi, Pardlugu, Pardmont, Pardnigr, Pardpull,
Trocterr, Zoraspin) ~
WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
poissonff, data = hspider, eq.tol = TRUE, trace = FALSE)
sort(deviance(p1et, history = TRUE)) # A history of all the iterations
if (deviance(p1et) > 1586) warning("suboptimal fit obtained")
S <- ncol(depvar(p1et))
par(mfrow = c(3, 4))
for (ii in 1:S) {
tempdata <- data.frame(latvar1 = c(latvar(p1et)),
sppCounts = depvar(p1et)[, ii])
tempdata <- transform(tempdata, myOffset = -0.5 * latvar1^2)
# For species ii, refit the model to get the deviance residuals
fit1 <- vglm(sppCounts ~ offset(myOffset) + latvar1, poissonff,
data = tempdata, trace = FALSE)
# For checking: this should be 0
# print("max(abs(c(Coef(p1et)@B1[1,ii],Coef(p1et)@A[ii,1])-coef(fit1)))")
# print( max(abs(c(Coef(p1et)@B1[1,ii],Coef(p1et)@A[ii,1])-coef(fit1))) )
# Plot the deviance residuals
devresid <- resid(fit1, type = "deviance")
predvalues <- predict(fit1) + fit1@offset
ooo <- with(tempdata, order(latvar1))
plot(predvalues + devresid ~ latvar1, data = tempdata, col = "red",
xlab = "latvar1", ylab = "", main = colnames(depvar(p1et))[ii])
with(tempdata, lines(latvar1[ooo], predvalues[ooo], col = "blue"))
}
# }
```

*Documentation reproduced from package VGAM, version 1.0-4, License: GPL-3*