The function spls.adapt
performs compression and variable selection
in the context of linear regression (with possible prediction)
using Durif et al. (2017) adaptive SPLS algorithm.
spls(Xtrain, Ytrain, lambda.l1, ncomp, weight.mat = NULL, Xtest = NULL,
adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE,
scale.Y = TRUE, weighted.center = FALSE)
a (ntrain x p) data matrix of predictor values.
Xtrain
must be a matrix. Each row corresponds to an observation
and each column to a predictor variable.
a (ntrain) vector of (continuous) responses. Ytrain
must be a vector or a one column matrix, and contains the response variable
for each observation.
a positive real value, in [0,1]. lambda.l1
is the
sparse penalty parameter for the dimension reduction step by sparse PLS
(see details).
a positive integer. ncomp
is the number of PLS
components.
a (ntrain x ntrain) matrix used to weight the l2 metric
in the observation space, it can be the covariance inverse of the Ytrain
observations in a heteroskedastic context. If NULL, the l2 metric is the
standard one, corresponding to homoskedastic model (weight.mat
is the
identity matrix).
a (ntest x p) matrix containing the predictor values for the
test data set. Xtest
may also be a vector of length p
(corresponding to only one test observation). Default value is NULL,
meaning that no prediction is performed.
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).
a boolean value indicating whether the data matrices
Xtrain
and Xtest
(if provided) should be centered or not.
a boolean value indicating whether the response values
Ytrain
set should be centered or not.
a boolean value indicating whether the data matrices
Xtrain
and Xtest
(if provided) should be scaled or not
(scale.X=TRUE
implies center.X=TRUE
).
a boolean value indicating whether the response values
Ytrain
should be scaled or not (scale.Y=TRUE
implies
center.Y=TRUE
).
a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).
An object of class spls
with the following attributes
the ntrain x p predictor matrix.
the response observations.
the centered if so and scaled if so predictor matrix.
the centered if so and scaled if so response.
the linear coefficients in model
sYtrain = sXtrain %*% betahat + residuals
.
the (p+1) vector containing the coefficients and intercept
for the non centered and non scaled model
Ytrain = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc + residuals.nc
.
the (p) vector of Xtrain column mean, used for centering if so.
the (p) vector of Xtrain column standard deviation, used for scaling if so.
the mean of Ytrain, used for centering if so.
the standard deviation of Ytrain, used for centering if so.
a (n x ncomp) matrix being the observations coordinates or
scores in the new component basis produced by the compression step
(sparse PLS). Each column t.k of X.score
is a SPLS component.
a (n x ncomp) matrix being the PLS components only computed with the selected predictors.
the (ncomp x p) matrix of coefficients in regression of
Xtrain over the new components X.score
.
the (ncomp) vector of coefficients in regression of Ytrain
over the SPLS components X.score
.
a (p x ncomp) matrix being the coefficients of predictors
in each components produced by sparse PLS. Each column w.k of
X.weight
verifies t.k = Xtrain x w.k (as a matrix product).
the (ntrain) vector of residuals in the model
sYtrain = sXtrain %*% betahat + residuals
.
the (ntrain) vector of residuals in the non centered
and non scaled model
Ytrain = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc + residuals.nc
.
the (ntrain) vector containing the estimated reponse values
on the train set of centered and scaled (if so) predictors
sXtrain
, hatY = sXtrain %*% betahat
.
the (ntrain) vector containing the estimated reponse value
on the train set of non centered and non scaled predictors Xtrain
,
hatY.nc = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc
.
the (ntest) vector containing the predicted values
for the response on the centered and scaled test set sXtest
(if provided), hatYtest = sXtest %*% betahat
.
the (ntest) vector containing the predicted values
for the response on the non centered and non scaled test set Xtest
(if provided),
hatYtest.nc = cbind(rep(1,ntest),Xtest) %*% betahat.nc
.
the active set of predictors selected by the procedures. A
is a subset of 1:p
.
a (ncomp) list of coefficient vector betahat in the model
with k
components, for k=1,...,ncomp
.
a (ncomp) list of subset of (1:p)
indicating the
variables that are selected when constructing the
components k
, for k=1,...,ncomp
.
the sparse hyper-parameter used to fit the model.
the number of components used to fit the model.
the (ntrain x ntrain) matrix used to weight the metric in the sparse PLS step.
a boolean value, indicating whether the sparse PLS selection step was adaptive or not.
The columns of the data matrices Xtrain
and Xtest
may
not be standardized, since standardizing can be performed by the function
spls
as a preliminary step.
The procedure described in Durif et al. (2017) is used to compute
latent sparse components that are used in a regression model.
In addition, when a matrix Xtest
is supplied, the procedure
predicts the response associated to these new values of the predictors.
Durif G., Modolo L., Michaelsson J., Mold J. E., Lambert-Lacroix S., Picard F. (2017). High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression, (in prep), available on (http://arxiv.org/abs/1502.05933).
Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society. Series B (Methodological), 72(1), 3-25. doi:10.1111/j.1467-9868.2009.00723.x
# NOT RUN {
### load plsgenomics library
library(plsgenomics)
### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25,
beta.max=0.75, mean.H=0.2, sigma.H=10,
sigma.F=5, sigma.E=5)
X <- sample1$X
Y <- sample1$Y
### splitting between learning and testing set
index.train <- sort(sample(1:n, size=round(0.7*n)))
index.test <- (1:n)[-index.train]
Xtrain <- X[index.train,]
Ytrain <- Y[index.train,]
Xtest <- X[index.test,]
Ytest <- Y[index.test,]
### fitting the model, and predicting new observations
model1 <- spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.l1=0.5, ncomp=2,
weight.mat=NULL, Xtest=Xtest, adapt=TRUE, center.X=TRUE,
center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE,
weighted.center=FALSE)
str(model1)
### plotting the estimation versus real values for the non centered response
plot(model1$Ytrain, model1$hatY.nc,
xlab="real Ytrain", ylab="Ytrain estimates")
points(-1000:1000,-1000:1000, type="l")
### plotting residuals versus centered response values
plot(model1$sYtrain, model1$residuals, xlab="sYtrain", ylab="residuals")
### plotting the predictor coefficients
plot(model1$betahat.nc, xlab="variable index", ylab="coeff")
### mean squares error of prediction on test sample
sYtest <- as.matrix(scale(Ytest, center=model1$meanYtrain, scale=model1$sigmaYtrain))
sum((model1$hatYtest - sYtest)^2) / length(index.test)
### plotting predicted values versus non centered real response values
## on the test set
plot(model1$hatYtest, sYtest, xlab="real Ytest", ylab="predicted values")
points(-1000:1000,-1000:1000, type="l")
# }
Run the code above in your browser using DataLab