Functions to perform orthogonal projections of high dimensional data matrices using principal component analysis (pca) and partial least squares (pls).
ortho_projection(Xr, Xu = NULL,
Yr = NULL,
method = "pca",
pc_selection = list(method = "var", value = 0.01),
center = TRUE, scale = FALSE, ...)pc_projection(Xr, Xu = NULL, Yr = NULL,
pc_selection = list(method = "var", value = 0.01),
center = TRUE, scale = FALSE,
method = "pca",
tol = 1e-6, max_iter = 1000, ...)
pls_projection(Xr, Xu = NULL, Yr,
pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
scale = FALSE, method = "pls",
tol = 1e-6, max_iter = 1000, ...)
# S3 method for ortho_projection
predict(object, newdata, ...)
a list
of class ortho_projection
with the following
components:
scores
: a matrix of scores corresponding to the observations in
Xr
(and Xu
if it was provided). The components retrieved
correspond to the ones optimized or specified.
X_loadings
: a matrix of loadings corresponding to the
explanatory variables. The components retrieved correspond to the ones
optimized or specified.
Y_loadings
: a matrix of partial least squares loadings
corresponding to Yr
. The components retrieved correspond to the
ones optimized or specified.
This object is only returned if the partial least squares algorithm was used.
weigths
: a matrix of partial least squares ("pls") weights.
This object is only returned if the "pls" algorithm was used.
projection_mat
: a matrix that can be used to project new data
onto a "pls" space. This object is only returned if the "pls" algorithm was
used.
variance
: a list with information on the original variance and
the explained variances. This list contains a matrix indicating the amount of
variance explained by each component (var), the ratio between explained
variance by each single component and the original variance (explained_var) and
the cumulative ratio of explained variance (cumulative_explained_var).
The amount of variance explained by each component is computed by multiplying
its score vector by its corresponding loading vector and calculating the
variance of the result. These values are computed based on the observations
used to create the projection matrices. For example if the "pls" method was
used, then these values are computed based only on the data that contains
information on Yr
(i.e. the Xr
data). If the principal
component method is used, the this data is computed on the basis of
Xr
and Xu
(if it applies) since both matrices are employed in
the computation of the projection matrix (loadings in this case).
sdv
: the standard deviation of the retrieved scores. This vector
can be different from the "sd" in variance
.
n_components
: the number of components (either principal
components or partial least squares components) used for computing the
global dissimilarity scores.
opc_evaluation
: a matrix containing the statistics computed
for optimizing the number of principal components based on the variable(s)
specified in the Yr
argument. If Yr
was a continuous was a
continuous vector or matrix then this object indicates the root mean square
of differences (rmse) for each number of components. If Yr
was a
categorical variable this object indicates the kappa values for each number
of components. This object is returned only if "opc"
was used within
the pc_selection
argument. See the sim_eval
function for
more details.
method
: the ortho_projection
method used.
predict.ortho_projection
, returns a matrix of scores proprojected for
newdtata
.
a matrix of observations.
an optional matrix containing data of a second set of observations.
if the method used in the pc_selection
argument is "opc"
or if method = "pls"
, then it must be a matrix
containing the side information corresponding to the spectra in Xr
.
It is equivalent to the side_info
parameter of the sim_eval
function. In case method = "pca"
, a matrix (with one or more
continuous variables) can also be used as input. The root mean square of
differences (rmsd) is used for assessing the similarity between the observations
and their corresponding most similar observations in terms of the side information
provided. A single discrete variable of class factor can also be passed. In
that case, the kappa index is used. See sim_eval
function for more details.
the method for projecting the data. Options are:
"pca"
: principal component analysis using the singular value
decomposition algorithm.
"pca.nipals"
: principal component analysis using the
non-linear iterative partial least squares algorithm.
"pls"
: partial least squares.
"mpls"
: modified partial least squares. See details.
a list of length 2 which specifies the method to be used
for optimizing the number of components (principal components or pls factors)
to be retained. This list must contain two elements (in the following order):
method
(a character indicating the method for selecting the number of
components) and value
(a numerical value that complements the selected
method). The methods available are:
"opc"
: optimized principal component selection based on
Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components
of a given set of observations is the one for which its distance matrix
minimizes the differences between the Yr
value of each
observation and the Yr
value of its closest observation. In this
case value
must be a value (larger than 0 and
below min(nrow(Xr)
+ nrow(Xu),
ncol(Xr))
indicating
the maximum number of principal components to be tested. See details.
"cumvar"
: selection of the principal components based
on a given cumulative amount of explained variance. In this case,
value
must be a value (larger than 0 and below or equal to 1)
indicating the minimum amount of cumulative variance that the
combination of retained components should explain.
"var"
: selection of the principal components based
on a given amount of explained variance. In this case,
value
must be a value (larger than 0 and below or equal to 1)
indicating the minimum amount of variance that a single component should
explain in order to be retained.
"manual"
: for manually specifying a fix number of
principal components. In this case, value
must be a value
(larger than 0 and
below the minimum dimension of Xr
or Xr
and Xu
combined).
indicating the minimum amount of variance that a component should
explain in order to be retained.
The list list(method = "var", value = 0.01)
is the default.
Optionally, the pc_selection
argument admits "opc"
or
"cumvar"
or "var"
or "manual"
as a single character
string. In such a case the default "value"
when either "opc"
or
"manual"
are used is 40. When "cumvar"
is used the default
"value"
is set to 0.99 and when "var"
is used, the default
"value"
is set to 0.01.
a logical indicating if the data Xr
(and Xu
if
specified) must be centered. If Xu
is specified the data is centered
on the basis of Xr XuXr U Xu. NOTE: This argument only applies to the
principal components projection. For pls projections the data is always
centered.
a logical indicating if Xr
(and Xu
if specified)
must be scaled. If Xu
is specified the data is scaled on the basis of
Xr XuXr U Xu.
additional arguments to be passed
to pc_projection
or pls_projection
.
tolerance limit for convergence of the algorithm in the nipals algorithm (default is 1e-06). In the case of PLS this applies only to Yr with more than one variable.
maximum number of iterations (default is 1000). In the case of
method = "pls"
this applies only to Yr
matrices with more than
one variable.
object of class "ortho_projection"
.
an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order.
In the case of method = "pca"
, the algorithm used is the singular value
decomposition in which a given data matrix (XX) is factorized as follows:
X = UDV^TX = UDV^T
where UU and VV are orthogonal matrices, being the left and right singular vectors of XX respectively, DD is a diagonal matrix containing the singular values of XX and VV is the is a matrix of the right singular vectors of XX. The matrix of principal component scores is obtained by a matrix multiplication of UU and DD, and the matrix of principal component loadings is equivalent to the matrix VV.
When method = "pca.nipals"
, the algorithm used for principal component
analysis is the non-linear iterative partial least squares (nipals).
In the case of the of the partial least squares projection (a.k.a projection
to latent structures) the nipals regression algorithm is used by default.
Details on the "nipals" algorithm are presented in Martens (1991). Another
method called modified pls ('mpls'
) can also be used. The modified
pls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and it
differs from the standard pls method in the way the weights of the Xr
(used to compute the matrix of scores) are obtained. While pls uses the covariance
between Yr
and Xr
(and later their deflated versions
corresponding at each pls component iteration) to obtain these weights, the modified pls
uses the correlation as weights. The authors indicate that by using correlation,
a larger potion of the response variable(s) can be explained.
When method = "opc"
, the selection of the components is carried out by
using an iterative method based on the side information concept
(Ramirez-Lopez et al. 2013a, 2013b). First let be PP a sequence of
retained components (so that P = 1, 2, ...,k P = 1, 2, ...,k ).
At each iteration, the function computes a dissimilarity matrix retaining
p_ip_i components. The values in this side information variable are
compared against the side information values of their most spectrally similar
observations (closest Xr
observation).
The optimal number of components retrieved by the function is the one that
minimizes the root mean squared differences (RMSD) in the case of continuous
variables, or maximizes the kappa index in the case of categorical variables.
In this process, the sim_eval
function is used.
Note that for the "opc"
method Yr
is required (i.e. the
side information of the observations).
Martens, H. (1991). Multivariate calibration. John Wiley & Sons.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
ortho_diss
, sim_eval
, mbl
# \donttest{
library(prospectr)
data(NIRsoil)
# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc,
wav = as.numeric(colnames(NIRsoil$spc))
),
m = 1,
p = 1,
w = 7
)
NIRsoil$spc_pr <- sg_det
# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
# A principal component analysis using 5 components
pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5))
pca_projected
# A principal components projection using the "opc" method
# for the selection of the optimal number of components
pca_projected_2 <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pca",
pc_selection = list("opc", 40)
)
pca_projected_2
plot(pca_projected_2)
# A partial least squares projection using the "opc" method
# for the selection of the optimal number of components
pls_projected <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pls",
pc_selection = list("opc", 40)
)
pls_projected
plot(pls_projected)
# A partial least squares projection using the "cumvar" method
# for the selection of the optimal number of components
pls_projected_2 <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pls",
pc_selection = list("cumvar", 0.99)
)
# }
Run the code above in your browser using DataLab