ldblm: Local distance-based linear model

Description

ldblm is a localized version of a distance-based linear model. As in the global model dblm, explanatory information is coded as distances between individuals. Neighborhood definition for localizing is done by the (semi)metric dist1 whereas a second (semi)metric dist2 (which may coincide with dist1) is used for distance-based prediction. Both dist1 and dist2 can either be computed from observed explanatory variables or directly input as a squared interdistances matrix or as a Gram matrix. The response is a continuous variable as in the ordinary linear model. The model allows for a mixture of continuous and qualitative explanatory variables or, in fact, from more general quantities such as functional data. Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage

## S3 method for class 'formula':
ldblm(formula,data,...,kind.of.kernel=1,
        metric1="euclidean",metric2=metric1,method="GCV",weights,
        user_h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,eff.rank=NULL)

# method for distance class 'dist' or 'dissimilary'
ldblm.dist(y,dist1,dist2=dist1,kind.of.kernel=1,
        method="GCV",weights,user_h=quantile(dist1,.25)^.5,
        h.range=quantile(as.matrix(dist1),c(.05,0.5))^.5,noh=10,
        k.knn=3,rel.gvar=0.95,eff.rank=NULL,...)  

#  method for distance class 'D2'
ldblm.D2(y,D2_1,D2_2=D2_1,kind.of.kernel=1,method="GCV",
        weights,user_h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,
        eff.rank=NULL,...) 
         
#  method for class 'Gram'
ldblm.Gram(y,G1,G2=G1,kind.of.kernel=1,method="GCV",
        weights,user_h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,
        eff.rank=NULL,...)

Arguments

formula

an object of class formula. A formula of the form y~Z. This argument is a remnant of the loess function, kept for compatibility.

data

an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).

(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.

dist1

a dist or dissimilarity class object. Distances between observations, used for neighborhood localizing definition. Weights for observations are computed as a decreasing function of their dist1 distances

dist2

a dist or dissimilarity class object. Distances between observations, used for fitting dblm. Default dist2=dist1.

D2_1

a D2 class object. Squared distances matrix between individuals. One of the alternative ways of entering distance information to a function. See the Details section in dblm. See above

D2_2

a D2 class object. Squared distances between observations. One of the alternative ways of entering distance information to a function. See the Details section in dblm. See above d

a Gram class object. Doubly centered inner product matrix associated with the squared distances matrix D2_1.

a Gram class object. Doubly centered inner product matrix associated with the squared distances matrix D2_2. Default G2=G1

kind.of.kernel

integer number between 1 and 6 which determines the user's choice of smoothing kernel. (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.

metric1

metric function to be used when computing dist1 from observed explanatory variables. One of "euclidean" (default), "manhattan", or "gower".

metric2

metric function to be used when computing dist2 from observed explanatory variables. One of "euclidean" (default), "manhattan", or "gower".

method

sets the method to be used in deciding the optimal bandwidth h. There are five different methods, AIC, BIC, OCV, GCV (default) and user_h. OCV and GCV

weights

an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight.

user_h

global bandwidth user_h, set by the user, controlling the size of the local neighborhood of Z. Smoothing parameter (Default: 1st quartile of all the distances d(i,j) in dist1). Applies only if method="user_

h.range

a vector of length 2 giving the range for automatic bandwidth choice. (Default: quantiles 0.05 and 0.5 of d(i,j) in dist1).

noh

number of bandwidth h values within h.range for automatic bandwidth choice (if method!="user_h").

k.knn

minimum number of observations with positive weight in neighborhood localizing. To avoid runtime errors due to a too small bandwidth originating neighborhoods with only one observation. By default k.nn=3.

rel.gvar

relative geometric variability (a real number between 0 and 1). In each dblm iteration, take the lowest effective rank, with a relative geometric variability higher or equal to rel.gvar. Default value (rel.gv

eff.rank

integer between 1 and the number of observations minus one. Number of Euclidean coordinates used for model fitting in each dblm iteration. If specified its value overrides rel.gvar. When eff.rank=NULL (default

...

arguments passed to or from other methods to the low level. Currently not used.

Value

A list of class ldblm containing the following components:
residualsthe residuals (response minus fitted values).
fitted.valuesthe fitted mean values.
h_optthe optimal bandwidth h used in the fitting proces (if method!=user_h).
Shatthe Smoothing hat projector.
weightsthe specified weights.
ythe response variable used.
callthe matched call.
dist1the distance matrix (object of class "D2" or "dist") used to calculate the weights of the observations.
dist2the distance matrix (object of class "D2" or "dist") used to fit the dblm.

Details

There are two semi-metrics involved in local linear distance-based estimation: dist1 and dist2. Both semi-metrics can coincide. For instance, when dist1=||xi-xj|| and dist2=||(xi,xi^2,xi^3)-(xj,xj^2,xj^3)|| the estimator for new observations coincides with fitting a local cubic polynomial regression. The set of bandwidth h values checked in automatic bandwidth choice is defined by h.range and noh, together with k.knn. For each h in it a local linear model is fitted and the optimal h is decided according to the statistic specified in method. kind.of.kernel designates which kernel function is to be used in determining individual weights from dist1 values. See density for more information.

References

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437. Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98. Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609. Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279. Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

Examples

Run this code

# example to use of the ldblm function
n <- 100
p <- 1
k <- 5

Z <- matrix(rnorm(n*p),nrow=n)
b1 <- matrix(runif(p)*k,nrow=p)
b2 <- matrix(runif(p)*k,nrow=p)
b3 <- matrix(runif(p)*k,nrow=p)

s <- 1
e <- rnorm(n)*s


y <- Z%*%b1 + Z^2%*%b2 +Z^3%*%b3 + e

D2<-as.matrix(dist(Z)^2)
class(D2)<-"D2"

ldblm1<-ldblm(y~Z,kind.of.kernel=1,method="GCV",noh=3,k.knn=3)
ldblm2<-ldblm.D2(y,D2_1=D2,D2_2=D2,kind.of.kernel=1,method="user_h",k.knn=3)

Run the code above in your browser using DataLab