dbglm: Distance-based generalized linear models

Description

dbglm is a variety of generalized linear model where explanatory information is coded as distances between individuals. These distances can either be computed from observed explanatory variables or directly input as a squared inter-distances matrix. Response and link function as in the glm function for ordinary generalized linear models. Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage

## S3 method for class 'formula':
dbglm(formula,data,family=gaussian,...,
        metric="euclidean",weights,maxiter=100,eps1=1e-10,
        eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,offset,mustart=NULL) 
   
# method for distance class 'dist' or 'dissimilary'
dbglm.dist(y,distance,family=gaussian,weights,
        maxiter=100,eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,
        offset,mustart=NULL,...)
                
#  method for distance class 'D2'
dbglm.D2(y,D2,...,family=gaussian,weights,maxiter=100,
        eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,offset,
        mustart=NULL)

#  method for class 'Gram'
dbglm.Gram(y,G,...,family=gaussian,weights,maxiter=100,
        eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,
        offset,mustart=NULL)

Arguments

formula

an object of class formula. A formula of the form y~Z. This argument is a remnant of the glm function, kept for compatibility.

data

an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).

(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, factor, matrix or data.frame.

distance

a dist or dissimilarity class object. See functions dist in the package stats and daisy in the package cluster

a D2 class object. Squared distances matrix between individuals. See the Details section in dblm to learn the usage.

a Gram class object. Doubly centered inner product matrix of the squared distances matrix D2. See details in dblm.

family

a description of the error distribution and link function to be used in the model. This can be a character string naming a family function, a family function or the result of a call to a family function. (See

metric

metric function to be used when computing distances from observed explanatory variables. One of "euclidean" (the default), "manhattan", or "gower".

weights

an optional numeric vector of prior weights to be used in the fitting process. By default all individuals have the same weight.

maxiter

maximum number of iterations in the iterated dblm algorithm. (Default = 100)

eps1

stopping criterion 1, "DevStat": convergence tolerance eps1, a positive (small) number; the iterations converge when |dev - dev_{old}|/(|dev|) < eps1. Stationarity of deviance has been attained.

eps2

stopping criterion 2, "mustat": convergence tolerance eps2, a positive (small) number; the iterations converge when |mu - mu_{old}|/(|mu|) < eps2. Stationarity of fitted.values mu has bee

rel.gvar

relative geometric variability (a real number between 0 and 1). In each dblm iteration, take the lowest effective rank, with a relative geometric variability higher or equal to rel.gvar. Default value (rel.gv

eff.rank

integer between 1 and the number of observations minus one. Number of Euclidean coordinates used for model fitting in each dblm iteration. If specified its value overrides rel.gvar. When eff.rank=NULL (default

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases.

mustart

starting values for the vector of means.

...

arguments passed to or from other methods to the low level. Currently not used.

Value

A list of class dbglm containing the following components:
residualsthe working residuals, that is the dblm residuals in the last iteration of dblm fit.
fitted.valuesthe fitted mean values, results of final dblm iteration.
familythe family object used.
deviancemeasure of discrepancy or badness of fit. Proportional to twice the difference between the maximum achievable log-likelihood and that achieved by the current model.
aic.modelA version of Akaike's Information Criterion. Equal to minus twice the maximized log-likelihood plus twice the number of parameters. Computed by the aic component of the family. For binomial and Poison families the dispersion is fixed at one and the number of parameters is the number of coefficients. For gaussian, Gamma and inverse gaussian families the dispersion is estimated from the residual deviance, and the number of parameters is the number of coefficients plus one. For a gaussian family the MLE of the dispersion is used so this is a valid value of AIC, but for Gamma and inverse gaussian families it is not. For families fitted by quasi-likelihood the value is NA.
null.deviancethe deviance for the null model. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.
iternumber of Fisher scoring (dblm) iterations.
prior.weightsthe original weights.
weightsthe working weights, that are the weights in the last iteration of dblm fit.
df.residualthe residual degrees of freedom.
df.nullthe residual degrees of freedom for the null model.
ythe response vector used.
convcritconvergence criterion. One of: "DevStat" (stopping criterion 1), "muStat" (stopping criterion 2), "maxiter" (maximum allowed number of iterations has been exceeded).
Hhathat matrix projector of the last dblm iteration.
rel.gvarthe relative geometric variabiliy in the last dblm iteration.
eff.rankthe working effective rank, that is the eff.rank in the last dblm iteration.
Objects of class "dbglm" are actually of class c("dbglm", "dblm"), inheriting the plot.dblm method from class "dblm".

Details

The various possible ways for inputting the model explanatory information through distances, or their squares, etc., are the same as in dblm. For gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. In particular, the linear predictor might be negative, obtaining an impossible negative mean. Should that event occur, dbglm stops with an error message. Proposed alternative is to use a non-canonical link function.

References

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437. Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98. Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609. Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279. Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

Examples

Run this code

## CASE POISSON
z <- rnorm(100)
y <- rpois(100, exp(1+z))
glm1<-glm(y ~z, family=poisson(link = "log"))
D2<-as.matrix(dist(z))^2
class(D2)<-"D2"
dbglm1<-dbglm.D2(y,D2,family=poisson(link = "log"))

plot(z,y)
points(z,glm1$fitted.values,col=2)
points(z,dbglm1$fitted.values,col=3)
sum((glm1$fitted.values-y)^2)
sum((dbglm1$fitted.values-y)^2)

## CASE BINOMIAL
y <- rbinom(100, 1, plogis(z))
# needs to set a starting value for the next fit
glm2<-glm(y ~z, family=binomial(link = "logit"))
D2<-as.matrix(dist(z))^2
class(D2)<-"D2"
dbglm2<-dbglm.D2(y,D2,family=binomial(link = "logit"))

plot(z,y)
points(z,glm2$fitted.values,col=2)
points(z,dbglm2$fitted.values,col=3)
sum((glm2$fitted.values-y)^2)
sum((dbglm2$fitted.values-y)^2)

Run the code above in your browser using DataLab