manyglm: Fitting Generalized Linear Models for Multivariate Abundance Data

Description

manyglm is used to fit generalized linear models to high-dimensional data, such as multivariate abundance data in ecology. This is the base model-fitting function - see plot.manyglm for assumption checking, and anova.manyglm or summary.manyglm for significance testing.

Usage

manyglm(formula, family="negative.binomial", K=1, data=NULL, subset=NULL, 
    na.action=options("na.action"), theta.method = "PHI", model = FALSE,
    x = TRUE, y = TRUE, qr = TRUE, cor.type= "I", shrink.param=NULL, 
    tol=sqrt(.Machine$double.eps), maxiter=25, maxiter2=10, 
    show.coef=FALSE, show.fitted=FALSE, show.residuals=FALSE, 
    show.warning=FALSE, offset, ...)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under Details.

family

a description of the error distribution function to be used in the model. At the moment, this must be a character string naming one of the following members of the exponential family: 'gaussian', 'poisson', 'binomial', 'negative.binomial'

number of trials in binomial regression. By default, K=1 for presence-absence data using logistic regression.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is

theta.method

the method used for the estimation of the overdisperson parameter theta, such that the mean-variance relationship is V=m+m^2/theta for the negative binomial family. Here offers three options "PHI" = Maximum likelihood estimation with respect to p

model, x, y, qr

logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the model matrix, the response, the QR decomposition of the model matrix) are returned.

cor.type

the structure imposed on the estimated correlation matrix under the fitted model. Can be "I"(default), "shrink", or "R". See Details. This parameter is merely stored in manyglm, and will be used as the default value for cor.

shrink.param

shrinkage parameter to be used if cor.type="shrink". If a numerical value is not supplied, it will be estimated from the data by cross validation-penalised normal likelihood as in Warton (2008). The parameter value is stored as an

tol

the tolerance used in estimations.

maxiter

maximum allowed iterations in the weighted least square estimation of beta. The default value is 100.

maxiter2

maximum allowed iterations in the internal ML estimations of negative binomial regression. The default value is 5.

show.coef, show.fitted, show.residuals, show.warning

logical. Whether to show model coefficients, fitted values, standardized pearson residuals, or operation warnings.

offset

this can be used to specify an _a priori_ known component to be included in the linear predictor during fitting. This should be 'NULL' or a numeric vector of length equal to NROW (i.e. number of observations) or a matrix of NROW times p (i.e. number of sp

...

further arguments passed to or from other methods.

Value

manyglm returns an object inheriting from "manyglm", "manylm" and "mglm". The function summary (i.e. summary.manyglm) can be used to obtain or print a summary of the results and the function anova (i.e. anova.manyglm) to produce an analysis of variance table. The generic accessor functions coefficients, fitted.values and residuals can be used to extract various useful features of the value returned by manyglm. An object of class "manyglm" is a list containing at least the following components:
coefficientsa named matrix of coefficients.
fitted.valuesthe matrix of fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.
linear.predictorsthe linear fit on link scale.
residualsthe matrix of working residuals, that is the Pearson residuals standardized by the leverage adjustment h obtained from the diagonal elements of the hat matrix H.
sqrt.1_Hiithe matrix of scale terms used to standardize the Pearson reidusals.
variance.estimatorthe variance estimator for the corresponding family function.
sqrt.weightthe matrix of square root of working weights, estimated for the corresponding family function.
thetathe estimated nuisance parameters accounting for overdispersion
phithe inverse of theta
two.logliketwo times the log likelihood.
devianceup to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.
AICAkaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of coefficients.
AICsumthe sum of the AIC's over all variables.
iterthe number of iterations of IWLS used.
tolthe tolerance used in estimations.
familythe family argument supplied.
theta.methodthe theta.method argument supplied.
cor.typethe cor.type argument supplied.
shrink.paramthe shrink parameter to be used in subsequent inference.
callthe matched call.
termsthe terms object used.
formulathe formula supplied.
rankthe numeric rank of the fitted linear model.
xlevels(where relevant) a record of the levels of the factors used in fitting.
df.residualthe residual degrees of freedom.
modelif the argument model is TRUE, this is the model.frame.
yif the argument y is TRUE, this is the response variables used.
xif the argument x is TRUE, this is the design matrix used.
qrif the argument qr is TRUE, this is the QR decomposition of the design matrix.
Objects of class "glm" are normally of class c("glm", "lm"), that is inherit from class "lm", and well-designed methods for class "lm" will be applied to the weighted linear model at the final iteration of IWLS. However, care is needed, as extractor functions for class "glm" such as residuals and weights do not just pick out the component of the fit with the same name.

Details

manyglm is used to calculate the parameter estimates of generalised linear models fitted to each of many variables simultaneously as in Warton et. al. (2012) and Wang et.al.(2012). Models for manyglm are specified symbolically. For details on this compare the details section of lm and formula. A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae. Generalised linear models are designed for non-normal data where a specific alternative distribution offers a reasonable model for data, as specified using the argument family (Warton (2011)). Count data can be analysed using family="poisson" if the data is not "overdispersed" (that is, if the variance is not larger than the mean), although for multivariate abundance data it has been shown that the negative binomial distribution (family="negative.binomial") is a good choice of model for counts (Warton (2005)). For binary (presence/absence) data, family="binomial" should be used. The negative binomial function requires knowledge of the overdispersion parameter (theta) in order to fit a generalised linear model. However, it is usual not to have knowledge of the value of theta. So we estimatea separate theta for each variable from the data. The method of estimating theta is controlled by theta.method for negative binomial distributions. family not only chooses the type of distribution to fit to data, but also controls the link function - the function of the mean that is fitted with a linear predictor. The default is the canonical link function, except for negative binomial, which uses the log-link. cor.type is the structure imposed on the estimated correlation matrix under the fitted model. Possible values are: "I"(default) = independence is assumed (correlation matrix is the identity) "shrink" = sample correlation matrix is shrunk towards I to improve its stability. "R" = unstructured correlation matrix is used. (Only available when N>p.) If cor.type=="shrink", a numerical value will be assigned to shrink.param either through the argument or by internal estimation. The working horse function for the internal estimation is ridgeParamEst, which is based on cross-validation (Warton 2008). The validation groups are chosen by random assignment, so some slight variation in the estimated values may be observed in repeat analyses. See ridgeParamEst for more details. The shrinkage parameter can be any value between 0 and 1 (0="I" and 1="R", values closer towards 0 indicate more shrinkage towards "I").

References

Lawless, J. F. (1987) Negative binomial and mixed Poisson regression, Canadian Journal of Statistics 15, 209-225. Hilbe, J. M. (2008) Negative Binomial Regression, Cambridge University Press, Cambridge. Warton D.I. (2005) Many zeros does not mean zero inflation: comparing the goodness of-fit of parametric models to multivariate abundance data, Environmetrics 16(3), 275-289. Warton D.I. (2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association 103, 340-349. Warton D.I. (2011). Regularized sandwich estimators for analysis of high dimensional data using generalized estimating equations. Biometrics, 67(1), 116-123. Warton D. I., Wright S., and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3(1), 89-101. Wang Y., Neuman U., Wright S. and Warton D. I. (2012). mvabund: an R package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution. online 21 Feb 2012.

Examples

Run this code

data(spider)
spiddat <- mvabund(spider$abund)
X <- spider$x

#To fit a log-linear model assuming counts are poisson:
glm.spid <- manyglm(spiddat~X, family="poisson")
glm.spid 

summary(glm.spid, resamp="residual")

#To fit a binomial regression model to presence/absence data:
pres.abs <- spiddat
pres.abs[pres.abs>0] = 1
X <- data.frame(spider$x) #turn into a data frame to refer to variables in formula
glm.spid.bin <- manyglm(pres.abs~soil.dry+bare.sand+moss, data=X, family="binomial")
glm.spid.bin
drop1(glm.spid.bin) #AICs for one-term deletions, suggests dropping bare.sand

glm2.spid.bin <- manyglm(pres.abs~soil.dry+moss, data=X, family="binomial")
drop1(glm2.spid.bin) #backward elimination suggests settling on this model.

Run the code above in your browser using DataLab