manyglm
is used to fit generalized linear models to high-dimensional data, such as multivariate abundance data in ecology. This is the base model-fitting function - see plot.manyglm
for assumption checking, and anova.manyglm
or summary.manyglm
for significance testing.manyglm(formula, family="negative.binomial", K=1, data=NULL, subset=NULL,
na.action=options("na.action"), theta.method = "PHI", model = FALSE,
x = TRUE, y = TRUE, qr = TRUE, cor.type= "I", shrink.param=NULL,
tol=sqrt(.Machine$double.eps), maxiter=25, maxiter2=10,
show.coef=FALSE, show.fitted=FALSE, show.residuals=FALSE,
show.warning=FALSE, offset, ...)
"formula"
(or one that
can be coerced to that class): a symbolic description of the
model to be fitted. The details of model specification are given
under Details.as.data.frame
to a data frame) containing
the variables in the model. If not found in data
, the variables
NA
s. The default is set by
the na.action
setting of options
, and is
TRUE
the corresponding
components of the fit (the model frame, the model matrix, the model
matrix, the response, the QR decomposition of the model matrix) are
returned.manyglm
, and
will be used as the default value for cor.
cor.type="shrink"
. If a numerical
value is not supplied, it will be estimated from the data by cross
validation-penalised normal likelihood as in Warton (2008). The parameter
value is stored as anmanyglm
returns an object inheriting from "manyglm"
,
"manylm"
and "mglm".
The function summary
(i.e. summary.manyglm
) can be used to obtain or print a summary of the results and the function
anova
(i.e. anova.manyglm
) to produce an
analysis of variance table.
The generic accessor functions coefficients
,
fitted.values
and residuals
can be used to
extract various useful features of the value returned by manyglm
.
An object of class "manyglm"
is a list containing at least the
following components:family
argument supplied.theta.method
argument supplied.cor.type
argument supplied.terms
object used.model
is TRUE
, this is the model.frame
.y
is TRUE
, this is the response variables used.x
is TRUE
, this is the design matrix used.qr
is TRUE
, this is the QR decomposition of the design matrix."glm"
are normally of class c("glm",
"lm")
, that is inherit from class "lm"
, and well-designed
methods for class "lm"
will be applied to the weighted linear
model at the final iteration of IWLS. However, care is needed, as
extractor functions for class "glm"
such as
residuals
and weights
do not just pick out
the component of the fit with the same name.manyglm
is used to calculate the parameter estimates of generalised linear models fitted to each of many variables simultaneously as in Warton et. al. (2012) and Wang et.al.(2012). Models for manyglm
are specified symbolically. For details on this compare the details section of lm
and formula
. A formula has an implied intercept term. To remove this use either y ~ x - 1
or y ~ 0 + x
. See formula
for more details of allowed formulae.
Generalised linear models are designed for non-normal data where a specific alternative distribution offers a reasonable model for data, as specified using the argument family
(Warton (2011)). Count data can be analysed using family="poisson"
if the data is not "overdispersed" (that is, if the variance is not larger than the mean), although for multivariate abundance data it has been shown that the negative binomial distribution (family="negative.binomial"
) is a good choice of model for counts (Warton (2005)). For binary (presence/absence) data, family="binomial"
should be used.
The negative binomial function requires knowledge of the overdispersion
parameter (theta
) in order to fit a generalised linear model. However, it
is usual not to have knowledge of the value of theta
. So we estimatea separate theta
for each variable from the data.
The method of estimating theta
is controlled by theta.method
for negative binomial distributions.
family
not only chooses the type of distribution to fit to data,
but also controls the link function - the function of the mean that is fitted
with a linear predictor. The default is the canonical link function, except
for negative binomial, which uses the log-link.
cor.type
is the structure imposed on the estimated correlation
matrix under the fitted model. Possible values are:
"I"(default) = independence is assumed (correlation matrix is the identity)
"shrink" = sample correlation matrix is shrunk towards I to improve its stability.
"R" = unstructured correlation matrix is used. (Only available when N>p.)
If cor.type=="shrink"
, a numerical value will be assigned to shrink.param
either through the argument or by internal estimation. The working horse function for the internal estimation is ridgeParamEst
, which is based on cross-validation (Warton 2008). The validation groups are chosen by random assignment, so some slight variation in the estimated values may be observed in repeat analyses. See ridgeParamEst
for more details. The shrinkage parameter can be any value between 0 and 1 (0="I" and 1="R", values closer towards 0 indicate more shrinkage towards "I").anova.manyglm
, summary.manyglm
,data(spider)
spiddat <- mvabund(spider$abund)
X <- spider$x
#To fit a log-linear model assuming counts are poisson:
glm.spid <- manyglm(spiddat~X, family="poisson")
glm.spid
summary(glm.spid, resamp="residual")
#To fit a binomial regression model to presence/absence data:
pres.abs <- spiddat
pres.abs[pres.abs>0] = 1
X <- data.frame(spider$x) #turn into a data frame to refer to variables in formula
glm.spid.bin <- manyglm(pres.abs~soil.dry+bare.sand+moss, data=X, family="binomial")
glm.spid.bin
drop1(glm.spid.bin) #AICs for one-term deletions, suggests dropping bare.sand
glm2.spid.bin <- manyglm(pres.abs~soil.dry+moss, data=X, family="binomial")
drop1(glm2.spid.bin) #backward elimination suggests settling on this model.
Run the code above in your browser using DataLab