ldblm
is a localized version of a distance-based linear model.
As in the global model dblm
, explanatory information is coded as
distances between individuals.
Neighborhood definition for localizing is done by the (semi)metric
dist1
whereas a second (semi)metric dist2
(which may coincide
with dist1
) is used for distance-based prediction.
Both dist1
and dist2
can either be computed from observed
explanatory variables or directly input as a squared distances
matrix or as a Gram
matrix. The response is a continuous variable
as in the ordinary linear model. The model allows for a mixture of
continuous and qualitative explanatory variables or, in fact, from more
general quantities such as functional data.
Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.
# S3 method for formula
ldblm(formula,data,...,kind.of.kernel=1,
metric1="euclidean",metric2=metric1,method.h="GCV",weights,
user.h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,eff.rank=NULL)# S3 method for dist
ldblm(dist1,dist2=dist1,y,kind.of.kernel=1,
method.h="GCV",weights,user.h=quantile(dist1,.25),
h.range=quantile(as.matrix(dist1),c(.05,.5)),noh=10,
k.knn=3,rel.gvar=0.95,eff.rank=NULL,...)
# S3 method for D2
ldblm(D2.1,D2.2=D2.1,y,kind.of.kernel=1,method.h="GCV",
weights,user.h=quantile(D2.1,.25)^.5,
h.range=quantile(as.matrix(D2.1),c(.05,.5))^.5,noh=10,k.knn=3,
rel.gvar=0.95,eff.rank=NULL,...)
# S3 method for Gram
ldblm(G1,G2=G1,y,kind.of.kernel=1,method.h="GCV",
weights,user.h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,
eff.rank=NULL,...)
A list of class ldblm
containing the following components:
the residuals (response minus fitted values).
the fitted mean values.
the optimal bandwidth h used in the fitting proces
(if method.h!=user.h
).
the Smoother hat projector.
the specified weights.
the response variable used.
the matched call.
the distance matrix (object of class "D2"
or "dist"
) used to calculate the weights of the observations.
the distance matrix (object of class "D2"
or "dist"
) used to fit the dblm
.
an object of class formula
. A formula of the form y~Z
.
This argument is a remnant of the loess
function,
kept for compatibility.
an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).
(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.
a dist
or dissimilarity
class object.
Distances between observations, used for neighborhood localizing
definition. Weights for observations are computed as a decreasing
function of their dist1
distances to the neighborhood
center, e.g. a new observation whose reoponse has to be predicted.
These weights are then entered to a dblm
, where distances
are evaluated with dist2
.
a dist
or dissimilarity
class object.
Distances between observations, used for fitting dblm
.
Default dist2=dist1
.
a D2
class object. Squared distances matrix between individuals.
One of the alternative ways of entering distance information
to a function. See the Details section in dblm
.
See above dist1
for explanation of its role in this function.
a D2
class object. Squared distances between observations.
One of the alternative ways of entering distance information
to a function. See the Details section in dblm
.
See above dist2
for explanation of its role in this function.
Default D2.2=D2.1
.
a Gram
class object. Doubly centered inner product matrix
associated with the squared distances matrix D2.1
.
a Gram
class object. Doubly centered inner product matrix
associated with the squared distances matrix D2.2
.
Default G2=G1
integer number between 1 and 6 which determines the user's choice of smoothing kernel. (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.
metric function to be used when computing dist1
from observed
explanatory variables.
One of "euclidean"
(default), "manhattan"
,
or "gower"
.
metric function to be used when computing dist2
from observed
explanatory variables.
One of "euclidean"
(default), "manhattan"
,
or "gower"
.
sets the method to be used in deciding the optimal bandwidth h.
There are five different methods, AIC
, BIC
, OCV
,
GCV
(default) and user.h
.
OCV
and GCV
take the optimal bandwidth minimizing
a cross-validatory quantity (either ocv
or gcv
).
AIC
and BIC
take the optimal bandwidth minimizing,
respectively, the Akaike or Bayesian Information Criterion
(see AIC
for more details).
When method.h
is user.h
, the bandwidth is explicitly
set by the user through the user.h
optional parameter
which, in this case, becomes mandatory.
an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight.
global bandwidth user.h
, set by the user, controlling the size
of the local neighborhood of Z.
Smoothing parameter (Default: 1st quartile of all the distances
d(i,j) in dist1
). Applies only if method.h="user.h"
.
a vector of length 2 giving the range for automatic bandwidth
choice. (Default: quantiles 0.05 and 0.5 of d(i,j) in dist1
).
number of bandwidth h
values within h.range
for
automatic bandwidth choice (if method.h!="user.h"
).
minimum number of observations with positive weight
in neighborhood localizing. To avoid runtime errors
due to a too small bandwidth originating neighborhoods
with only one observation. By default k.nn=3
.
relative geometric variability (a real number between 0 and 1).
In each dblm
iteration, take the lowest effective rank, with
a relative geometric variability higher or equal to rel.gvar
.
Default value (rel.gvar=0.95
) uses the 95% of the total
variability.
integer between 1 and the number of observations minus one.
Number of Euclidean coordinates used for model fitting in
each dblm
iteration. If specified its value overrides
rel.gvar
. When eff.rank=NULL
(default),
calls to dblm
are made with method=rel.gvar
.
arguments passed to or from other methods to the low level.
Boj, Eva <evaboj@ub.edu>, Caballe, Adria <adria.caballe@upc.edu>, Delicado, Pedro <pedro.delicado@upc.edu> and Fortiana, Josep <fortiana@ub.edu>
There are two semi-metrics involved in local linear distance-based estimation:
dist1
and dist2
. Both semi-metrics can coincide.
For instance, when dist1=||xi-xj||
and
dist2=||(xi,xi^2,xi^3)-(xj,xj^2,xj^3)||
the estimator
for new observations coincides with fitting a local cubic polynomial
regression.
The set of bandwidth h
values checked in automatic
bandwidth choice is defined by h.range
and noh
,
together with k.knn
. For each h
in it a local linear
model is fitted and the optimal h
is decided according to the
statistic specified in method.h
.
kind.of.kernel
designates which kernel function is to be used
in determining individual weights from dist1
values.
See density
for more information.
Boj E, Caballe, A., Delicado P, Esteve, A., Fortiana J (2016). Global and local distance-based generalized linear models. TEST 25, 170-195.
Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437.
Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98.
Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609.
Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279.
Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.
dblm
for distance-based linear models.
ldbglm
for local distance-based generalized linear models.
summary.ldblm
for summary.
plot.ldblm
for plots.
predict.ldblm
for predictions.
# example to use of the ldblm function
n <- 100
p <- 1
k <- 5
Z <- matrix(rnorm(n*p),nrow=n)
b1 <- matrix(runif(p)*k,nrow=p)
b2 <- matrix(runif(p)*k,nrow=p)
b3 <- matrix(runif(p)*k,nrow=p)
s <- 1
e <- rnorm(n)*s
y <- Z%*%b1 + Z^2%*%b2 +Z^3%*%b3 + e
D2 <- as.matrix(dist(Z)^2)
class(D2) <- "D2"
ldblm1 <- ldblm(y~Z,kind.of.kernel=1,method="GCV",noh=3,k.knn=3)
ldblm2 <- ldblm(D2.1=D2,D2.2=D2,y,kind.of.kernel=1,method="user.h",k.knn=3)
Run the code above in your browser using DataLab