dblm: Distance-based linear model

Description

dblm is a variety of linear model where explanatory information is coded as distances between individuals. These distances can either be computed from observed explanatory variables or directly input as a squared interdistances matrix. The response is a continuous variable as in the ordinary linear model. Since distances can be computed from a mixture of continuous and qualitative explanatory variables or, in fact, from more general quantities, dblm is a proper extension of lm. Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage

## S3 method for class 'formula':
dblm(formula,data,...,metric="euclidean",method="OCV",
        full_search=FALSE,weights,rel.gvar=0.95,eff.rank)

# method for distance class 'dist' or 'dissimilary'
dblm.dist(y,distance,...,method="OCV",full_search=FALSE,
        weights,rel.gvar=0.95,eff.rank)

#  method for distance class 'D2'
dblm.D2(y,D2,...,method="OCV",full_search=FALSE,weights,rel.gvar=0.95,
        eff.rank)
              
#  method for class 'Gram'
dblm.Gram(y,G,...,method="OCV",full_search=FALSE,weights,rel.gvar=0.95,
        eff.rank)

Arguments

formula

an object of class formula. A formula of the form y~Z. This argument is a remnant of the lm function, kept for compatibility.

data

an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).

(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.

distance

a dist or dissimilarity class object. See functions dist in the package stats and daisy in the package clust

a D2 class object. Squared distances matrix between individuals. See details below to learn the usage of dblm.D2.

a Gram class object. Doubly centered inner product matrix of the squared distances matrix D2. See details below to learn the usage of dblm.Gram.

metric

metric function to be used when computing distances from observed explanatory variables. One of "euclidean" (default), "manhattan", or "gower".

method

sets the method to be used in deciding the effective rank, which is defined as the number of linearly independent Euclidean coordinates used in prediction. There are six different methods: "AIC", "BIC",

full_search

sets which optimization procedure will be used to minimize the modelling criterion specified in method. Needs to be specified only if method is "AIC", "BIC", "OCV" or "GCV"

weights

an optional numeric vector of weights to be used in the fitting process. 
	 By default all individuals have the same weight.

rel.gvar

relative geometric variability (real between 0 and 1). Take the 
	  lowest effective rank with a relative geometric variability higher 
      or equal to rel.gvar. Default value (rel.gvar=0.95) 
	  uses a 95% of the total vari

eff.rank

integer between 1 and the number of observations minus one. 
  	  Number of Euclidean coordinates used for model fitting. Applies only  
	  if method="eff.rank".

...

arguments passed to or from other methods to the low level. 
	 Currently not used.

`Value`

A list of class dblm containing the following components:
residualsthe residuals (response minus fitted values).
fitted.valuesthe fitted mean values.
df.residualsthe residual degrees of freedom.
weightsthe specified weights.
ythe response used to fit the model.
Hhatthe hat matrix projector.
callthe matched call.
rel.gvarthe relative geometric variabiliy, used to fit the model.
eff.rankthe dimensions chosen to estimate the model.
ocvthe ordinary cross-validation estimate of the prediction error.
gcvthe generalized cross-validation estimate of the prediction error.
aicthe Akaike Value Criterium of the model (only if method="AIC").
bicthe Bayesian Value Criterium of the model (only if method="BIC").

`Details`

The dblm model uses the distance matrix between individuals 
	to find an appropriate prediction method. 
    There are many ways to compute and calculate this matrix, besides
	the three included as parameters in this function. 
    Several packages in R also study this problem. In particular 
	dist in the package stats and daisy
   in the package	cluster (the three metrics in dblm call
  the daisy function).
	
    Another way to enter a distance matrix to the model is through an object 
	of class "D2" (containing the squared distances matrix).
    An object of class "dist" or "dissimilarity" can 
	easily be transformed into one of class "D2". See disttoD2.
    Reciprocally, an object of class "D2" can be transformed into one 
  of class "dist". See D2toDist.
  
   S3 method Gram uses the Doubly centered inner product matrix G=XX'.
  Its also easily to transformed into one of class "D2". 
  See D2toG and GtoD2.
  
    The weights array is adequate when responses for different individuals
	have different variances. In this case the weights array should be 
	(proportional to) the reciprocal of the variances vector.  
   
    When using method method="eff.rank" or method="rel.gvar",
	a compromise between possible consequences of a bad choice has to be 
	reached. If the rank is too large, the model can be overfitted, possibly 
	leading to an increased prediction error for new cases 
	(even though R2 is higher). On the other hand, a small rank suggests  
	a model inadequacy (R2 is small).  The other four methods are less error 
	prone (but still they do not guarantee good predictions).

`References`

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors.
	Computational Statistics and Data Analysis 54, 429-437.

Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression.
	Communications in Statistics B - Simulation and Computation 36, 87-98.

Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model
	for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609.
	
Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data.
	Communications in Statistics A - Theory and Methods 19, 2261-2279.
	
Cuadras CM (1989). Distance analysis in discrimination and classification using both 
continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference.		
Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

`See Also`

summary.dblm for summary.
plot.dblm for plots.
predict.dblm for predictions.
ldblm for distance-based local linear models.

`Examples`

Run this code# easy example to illustrate usage of the dblm function
n <- 100
p <- 3
k <- 5

Z <- matrix(rnorm(n*p),nrow=n)
b <- matrix(runif(p)*k,nrow=p)
s <- 1
e <- rnorm(n)*s
y <- Z%*%b + e

D<-dist(Z)

dblm1<-dblm.dist(y,D)
lm1 <- lm(y~Z)
# the same fitted values with the lm
mean(lm1$fitted.values-dblm1$fitted.values)
Run the code above in your browser using DataLab