dbglm
is a variety of generalized linear model where explanatory
information is coded as distances between individuals. These distances
can either be computed from observed explanatory variables or directly
input as a squared distances matrix.
Response and link function as in the glm
function for ordinary
generalized linear models.
Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.
# S3 method for formula
dbglm(formula, data, family=gaussian, method ="GCV", full.search=TRUE,...,
metric="euclidean", weights, maxiter=100, eps1=1e-10,
eps2=1e-10, rel.gvar=0.95, eff.rank=NULL, offset, mustart=NULL, range.eff.rank)
# S3 method for dist
dbglm(distance,y,family=gaussian, method ="GCV", full.search=TRUE, weights,
maxiter=100,eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,
offset,mustart=NULL, range.eff.rank,...)
# S3 method for D2
dbglm(D2,y,...,family=gaussian, method ="GCV", full.search=TRUE, weights,maxiter=100,
eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,offset,
mustart=NULL, range.eff.rank)# S3 method for Gram
dbglm(G,y,...,family=gaussian, method ="GCV", full.search=TRUE, weights,maxiter=100,
eps1=1e-10,eps2=1e-10,rel.gvar=0.95,eff.rank=NULL,
offset,mustart=NULL, range.eff.rank)
A list of class dbglm
containing the following components:
the working
residuals, that is the dblm
residuals in the last iteration of dblm
fit.
the fitted mean values, results of final dblm
iteration.
the family
object used.
measure of discrepancy or badness of fit. Proportional to twice the difference between the maximum achievable log-likelihood and that achieved by the current model.
a version of Akaike's Information Criterion. Equal to minus twice the maximized log-likelihood plus twice the number of parameters. Computed by the aic component of the family. For binomial and Poison families the dispersion is fixed at one and the number of parameters is the number of coefficients. For gaussian, Gamma and inverse gaussian families the dispersion is estimated from the residual deviance, and the number of parameters is the number of coefficients plus one. For a gaussian family the MLE of the dispersion is used so this is a valid value of AIC, but for Gamma and inverse gaussian families it is not. For families fitted by quasi-likelihood the value is NA.
a version of the Bayessian Information Criterion. Equal to minus twice the maximized log-likelihood plus the logarithm of the number of observations by the number of parameters (see, e.g., Wood 2006).
a version of the Generalized Cross-Validation Criterion. We refer to Wood (2006) pp. 177-178 for details.
the deviance for the null model. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.
number of Fisher scoring (dblm
) iterations.
the original weights.
the working
weights, that are the weights in the
last iteration of dblm
fit.
the residual degrees of freedom.
the residual degrees of freedom for the null model.
the response vector used.
convergence criterion. One of: "DevStat"
(stopping criterion 1), "muStat"
(stopping criterion 2),
"maxiter"
(maximum allowed number of iterations
has been exceeded).
hat matrix projector of the last dblm
iteration.
the relative geometric variabiliy in the last dblm
iteration.
the working
effective rank, that is the eff.rank
in the last dblm
iteration.
vector of estimated variance of each observation.
deviance residuals
the matched call.
Objects of class "dbglm"
are actually of class
c("dbglm", "dblm")
, inheriting the plot.dblm
method
from class "dblm"
.
an object of class formula
. A formula of the form y~Z
.
This argument is a remnant of the glm
function,
kept for compatibility.
an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).
(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, factor, matrix or data.frame.
a dist
or dissimilarity
class object. See functions
dist
in the package stats
and daisy
in the package cluster
.
a D2
class object. Squared distances matrix between individuals.
See the Details section in dblm
to learn the usage.
a Gram
class object. Doubly centered inner product matrix of the
squared distances matrix D2
. See details in dblm
.
a description of the error distribution and link function to be used
in the model.
This can be a character string naming a family function, a family
function or the result of a call to a family function.
(See family
for details of family functions.)
metric function to be used when computing distances from observed
explanatory variables.
One of "euclidean"
(the default), "manhattan"
,
or "gower"
.
an optional numeric vector of prior weights to be used in the fitting process. By default all individuals have the same weight.
sets the method to be used in deciding the effective rank,
which is defined as the number of linearly independent Euclidean
coordinates used in prediction.
There are five different methods: "AIC"
, "BIC"
,
"GCV"
(default), "eff.rank"
and
"rel.gvar"
.
GCV
take the effective rank minimizing
a cross-validatory quantity.
AIC
and BIC
take the effective rank minimizing,
respectively, the Akaike or Bayesian Information Criterion
(see AIC
for more details).
sets which optimization procedure will be used to
minimize the modelling criterion specified in method
.
Needs to be specified only if method
is "AIC"
,
"BIC"
or "GCV"
.
If full.search=TRUE
, effective rank is set to its
global best value, after evaluating the criterion for all possible ranks.
Potentially too computationally expensive.
If full.search=FALSE
, the optimize
function
is called. Then computation time is shorter, but the result may be
found a local minimum.
maximum number of iterations in the iterated dblm
algorithm.
(Default = 100)
stopping criterion 1, "DevStat"
: convergence tolerance eps1
,
a positive (small) number;
the iterations converge when |dev - dev_{old}|/(|dev|) < eps1
.
Stationarity of deviance has been attained.
stopping criterion 2, "mustat"
: convergence tolerance eps2
,
a positive (small) number;
the iterations converge when |mu - mu_{old}|/(|mu|) < eps2
.
Stationarity of fitted.values mu
has been attained.
relative geometric variability (a real number between 0 and 1).
In each dblm
iteration, take the lowest effective rank, with
a relative geometric variability higher or equal to rel.gvar
.
Default value (rel.gvar=0.95
) uses the 95% of the total
variability.
integer between 1 and the number of observations minus one.
Number of Euclidean coordinates used for model fitting in
each dblm
iteration. If specified its value overrides
rel.gvar
. When eff.rank=NULL
(default),
calls to dblm
are made with method=rel.gvar
.
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases.
starting values for the vector of means.
vector of size two defining the range of values for the effective rank with which the dblm iterations
will be evaluated (must be specified when method
is "AIC"
, "BIC"
or "GCV"
). The range should
be restrict between c(1,n-1)
.
arguments passed to or from other methods to the low level.
Boj, Eva <evaboj@ub.edu>, Caballe, Adria <adria.caballe@upc.edu>, Delicado, Pedro <pedro.delicado@upc.edu> and Fortiana, Josep <fortiana@ub.edu>
The various possible ways for inputting the model explanatory
information through distances, or their squares, etc., are the
same as in dblm
.
For gamma distributions, the domain of the canonical link function
is not the same as the permitted range of the mean. In particular,
the linear predictor might be negative, obtaining an impossible
negative mean. Should that event occur, dbglm
stops with
an error message. Proposed alternative is to use a non-canonical link
function.
Boj E, Caballe, A., Delicado P, Esteve, A., Fortiana J (2016). Global and local distance-based generalized linear models. TEST 25, 170-195.
Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437.
Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98.
Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609.
Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279.
Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.
Wood SN (2006). Generalized Additive Models: An Introduction with R. Chapman & Hall, Boca Raton.
summary.dbglm
for summary.
plot.dbglm
for plots.
predict.dbglm
for predictions.
dblm
for distance-based linear models.
## CASE POISSON
z <- rnorm(100)
y <- rpois(100, exp(1+z))
glm1 <- glm(y ~z, family = poisson(link = "log"))
D2 <- as.matrix(dist(z))^2
class(D2) <- "D2"
dbglm1 <- dbglm(D2,y,family = poisson(link = "log"), method="rel.gvar")
plot(z,y)
points(z,glm1$fitted.values,col=2)
points(z,dbglm1$fitted.values,col=3)
sum((glm1$fitted.values-y)^2)
sum((dbglm1$fitted.values-y)^2)
## CASE BINOMIAL
y <- rbinom(100, 1, plogis(z))
# needs to set a starting value for the next fit
glm2 <- glm(y ~z, family = binomial(link = "logit"))
D2 <- as.matrix(dist(z))^2
class(D2) <- "D2"
dbglm2 <- dbglm(D2,y,family = binomial(link = "logit"), method="rel.gvar")
plot(z,y)
points(z,glm2$fitted.values,col=2)
points(z,dbglm2$fitted.values,col=3)
sum((glm2$fitted.values-y)^2)
sum((dbglm2$fitted.values-y)^2)
Run the code above in your browser using DataLab