dbplsr: Distance-based partial least squares regression

Description

dbplsr is a variety of partial least squares regression where explanatory information is coded as distances between individuals. These distances can either be computed from observed explanatory variables or directly input as a squared interdistances matrix. Since distances can be computed from a mixture of continuous and qualitative explanatory variables or, in fact, from more general quantities, dbplsr is a proper extension of plsr. Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage

## S3 method for class 'formula':
dbplsr(formula,data,...,metric="euclidean",
        method="ncomp",weights,ncomp) 

# method for distance class 'dist' or 'dissimilary'
dbplsr.dist(y,distance,...,weights,ncomp=ncomp,method="ncomp")

#  method for distance class 'D2'
dbplsr.D2(y,D2,...,weights,ncomp=ncomp,method="ncomp")

#  method for class 'Gram'
dbplsr.Gram(y,G,...,weights,ncomp=ncomp,method="ncomp")

Arguments

formula

an object of class formula. A formula of the form y~Z. This argument is a remnant of the plsr function, kept for compatibility.

data

an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).

(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.

distance

a dist or dissimilarity class object. See functions dist in the package stats and daisy in the package clust

a D2 class object. Squared distances matrix between individuals. See details below to learn the usage of dblm.D2.

a Gram class object. Weighted centered inner products matrix of the squared distances matrix D2. See details in dblm.

metric

metric function to be used when computing distances from observed explanatory variables. One of "euclidean" (default), "manhattan", or "gower".

method

sets the method to be used in deciding how many components needed to fit the best model for new predictions. There are five different methods, "AIC", "BIC", "OCV", "GCV" and "ncomp"<

weights

an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight.

ncomp

the number of components to include in the model.

...

arguments passed to or from other methods to the low level. Currently not used.

Value

A list of class dbplsr containing the following components:
residualsa list containing the residuals (response minus fitted values) for each iteration.
fitted.valuesa list containing the fitted values for each iteration.
fka list containing the scores for each iteration.
bkregression coefficients. fitted.values = fk*bk
Pkorthogonal projector on the one-dimensional linear space by fk.
ncompnumber of components included in the model.
ncomp_optoptimum number of components according to the selected method.
weightsthe specified weights.
methodthe using method.
ythe response used to fit the model.
Hhatthe hat matrix projector.
G0initial weighted centered inner products matrix of the squared distance matrix.
Gkweighted centered inner products matrix in last iteration.
gvartotal weighted geometric variability.
gvecthe diagonal entries in G0.
gvar.itergeometric variability for each iteration.
ocvthe ordinary cross-validation estimate of the prediction error.
gcvthe generalized cross-validation estimate of the prediction error.
aicthe Akaike Value Criterium of the model.
bicthe Bayesian Value Criterium of the model.

Details

Partial least squares (PLS) is a method for constructing predictive models when the factors (Z) are many and highly collinear. A PLS model will try to find the multidimensional direction in the Z space that explains the maximum multidimensional variance direction in the Y space. dbplsr is particularly suited when the matrix of predictors has more variables than observations. By contrast, standard regression (dblm) will fail in these cases. The various possible ways for inputting the model explanatory information through distances, or their squares, etc., are the same as in dblm. The number of components to fit is specified with the argument ncomp.

References

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437. Boj E, Grane A, Fortiana J, Claramunt MM (2007). Implementing PLS for distance-based regression: computational issues. Computational Statistics 22, 237-248. Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98. Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609. Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279. Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

Examples

Run this code

#require(pls)
library(pls)
data(yarn)
## Default methods:
yarn.dbplsr <- dbplsr(density ~ NIR, data = yarn, ncomp=6, method="GCV")

Run the code above in your browser using DataLab