pca
is used to build and explore a principal component analysis (PCA) model.
pca(x, ncomp = 15, center = T, scale = F, cv = NULL, exclrows = NULL,
exclcols = NULL, x.test = NULL, method = "svd", rand = NULL,
lim.type = "jm", alpha = 0.05, gamma = 0.01, info = "")
a numerical matrix with calibration data.
maximum number of components to calculate.
logical, do mean centering of data or not.
logical, do sdandardization of data or not.
number of segments for random cross-validation (1 for full cross-validation).
rows to be excluded from calculations (numbers, names or vector with logical values)
columns to be excluded from calculations (numbers, names or vector with logical values)
a numerical matrix with test data.
method to compute principal components ('svd', 'nipals').
vector with parameters for randomized PCA methods (if NULL, conventional PCA is used instead)
which method to use for calculation of critical limits for residuals (see details)
significance level for calculating critical limits for T2 and Q residuals.
significance level for calculating outlier limits for T2 and Q residuals.
a short text line with model description.
Returns an object of pca
class with following fields:
number of components included to the model.
selected (optimal) number of components.
matrix with loading values (nvar x ncomp).
vector with eigenvalues for all existent components.
vector with explained variance for each component (in percent).
vector with cumulative explained variance for each component (in percent).
statistical limit for T2 distance.
statistical limit for Q residuals.
information about the model, provided by user when build the model.
an object of class pcares
with PCA results for a calibration data.
an object of class pcares
with PCA results for a test data, if it
was provided.
an object of class pcares
with PCA results for cross-validation,
if this option was chosen.
More details and examples can be found in the Bookdown tutorial.
By default pca
uses number of components (ncomp
) as a minimum of number of
objects - 1, number of variables and default or provided value. Besides that, there is also
a parameter for selecting an optimal number of components (ncomp.selected
). The optimal
number of components is used to build a residuals plot (with Q residuals vs. Hotelling T2
values), calculate confidence limits for Q residuals, as well as for SIMCA classification.
You can provde number, names or logical values to exclode rows or columns from calibration and validation of PCA model. In this case the outcome, e.g. scores and loadings will correspond to the original size of the data, but:
Loadings (and all performance statistics) will be computed without excluded objects and variables
Matrix with loadings will have zero values for the excluded variables and the corresponding columns will be hidden.
Matrix with scores will have score values calculated for the hidden objects but the rows will be hidden.
You can see scores and loadings for hidden rows and columns by using parameter 'show.excluded = T' in plots. If you see other packages to make plots (e.g. ggplot2) you will not be able to distinguish between hidden and normal objects.
By default loadings are computed for the original dataset using either SVD or NIPALS algorithm.
However, for datasets with large number of rows (e.g. hyperspectral images), there is a
possibility to run algorithms based on random permutations [1, 2]. In this case you have
to define parameter rand
as a vector with two values: p - oversampling parameter and
k - number of iterations. Usually rand = c(15, 0)
or rand = c(5, 1)
are good
options, which give quite precise solution using several times less computational time. It must
be noted that statistical limits for residuals will not be computed in this case.
There are several ways to calculate critical limits for Q and T2 residuals. In mdatools
you can specify one of the following methods via parameter lim.type
: 'jm'
- method
based on Jackson-Mudholkar approach [3], 'chisq'
- method based on chi-square distribution
[4] and 'ddrobust'
and 'ddmoments'
- both related to data driven method proposed
by Pomerantsev and Rodionova [5]. The 'ddmoments'
is based on method of moments for
estimation of distribution parameters while 'ddrobust'
is based in robust estimation.
It must be noted that the first two methods calculate limits for Q-residuals only, assuming, that limits for T2 residuals must be computed using Hotelling's T-squared distribution. The methods based on the data driven approach calculate limits for both Q and T2 residuals based on chi-square distribution and parameters estimated from the calibration data.
The critical limits are calculated for a significance level defined by parameter 'alpha'
.
You can also specify another parameter, 'gamma'
, which is used to calculate acceptance
limit for outliers (shown as dashed line on residuals plot).
You can also recalculate the limits for existent model by using different values for alpha and
gamme, without recomputing the model itself. In this case use the following code (it is assumed
that you current PCA/SIMCA model is stored in variable m
):
m = setResLimits(m, alpha, gamma)
.
In case of PCA the critical limits are just shown on residual plot as lines and can be used for
detection of extreme objects (solid line) and outliers (dashed line). When PCA model is used for
classification in SIMCA (see simca
) the limits are utilized for classification of
objects.
1. N. Halko, P.G. Martinsson, J.A. Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53 (2010) pp. 217-288. 2. S. Kucheryavskiy, Blessing of randomness against the curse of dimensionality, Journal of Chemometrics, 32 (2018), pp. 3. J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons, New York, NY (1991). 4. A.L. Pomerantsev, Acceptance areas for multivariate classification derived by projection methods, Journal of Chemometrics, 22 (2008) pp. 601-609. 5. A.L. Pomerantsev, O.Ye. Rodionova, Concept and role of extreme objects in PCA/SIMCA, Journal of Chemometrics, 28 (2014) pp. 429-438.
Methods for pca
objects:
plot.pca |
makes an overview of PCA model with four plots. |
summary.pca |
shows some statistics for the model. |
selectCompNum.pca |
set number of optimal components in the model |
setResLimits.pca |
set critical limits for residuals |
predict.pca |
applies PCA model to a new data. |
plotScores.pca |
shows scores plot. |
plotLoadings.pca |
shows loadings plot. |
plotVariance.pca |
shows explained variance plot. |
plotCumVariance.pca |
shows cumulative explained variance plot. |
plotResiduals.pca |
shows Q vs. T2 residuals plot. |
Most of the methods for plotting data are also available for PCA results (pcares
)
objects. Also check pca.mvreplace
, which replaces missing values in a data matrix
with approximated using iterative PCA decomposition.
# NOT RUN {
library(mdatools)
### Examples for PCA class
## 1. Make PCA model for People data with autoscaling
## and full cross-validation
data(people)
model = pca(people, scale = TRUE, cv = 1, info = 'Simple PCA model')
model = selectCompNum(model, 4)
summary(model)
plot(model, show.labels = TRUE)
## 3. Show scores and loadings plots for the model
par(mfrow = c(2, 2))
plotScores(model, comp = c(1, 3), show.labels = TRUE)
plotScores(model, comp = 2, type = 'h', show.labels = TRUE)
plotLoadings(model, comp = c(1, 3), show.labels = TRUE)
plotLoadings(model, comp = c(1, 2), type = 'h', show.labels = TRUE)
par(mfrow = c(1, 1))
## 4. Show residuals and variance plots for the model
par(mfrow = c(2, 2))
plotVariance(model, type = 'h')
plotCumVariance(model, show.labels = TRUE, legend.position = 'bottomright')
plotResiduals(model, show.labels = TRUE)
plotResiduals(model, ncomp = 2, show.labels = TRUE)
par(mfrow = c(1, 1))
# }
Run the code above in your browser using DataLab