Robpca(x, ...)
## S3 method for class 'default':
Robpca(x, k = 0, kmax = 10, alpha = 0.75, mcd = TRUE, trace=FALSE, \dots)
## S3 method for class 'formula':
Robpca(formula, data = NULL, subset, na.action, \dots)
model.frame
) containing the variables in the
formula formula
.x
.NA
s. The default is set by
the na.action
setting of options
, and is
k
is missing,
or k = 0
, the algorithm itself will determine the number of
components by finding such k
that $l_k/l_1 >= 10.E-3$ and
$\Sigma_{j=1}^k l_j/kmax=10
. If k
is provided, kmax
does not need to be specified, unless k
is larger than 10.CovMcd()
is automatically called. The ntrace = FALSE
Robpca-class
which is a subclass of the
virtual class PcaRobust-class
.Robpca-class
is a generic function with "formula" and "default" methods.
The calculation is done using the ROBPCA method of Hubert et al (2005) which can
be described briefly as follows. For details see the relevant references.
Let n
denote the number of observations, and p
the
number of original variables in the input data matrix X
. The
ROBPCA algorithm finds a robust center M (p x 1)
of the data
and a loading matrix P
which is (p x k)
dimensional.
Its columns are orthogonal and define a new coordinate system. The
scores T, an (n x k)
matrix, are the coordinates of the
centered observations with respect to the loadings:
$$T=(X-M)P$$ The ROBPCA algorithm also yields a robust covariance
matrix (often singular) which can be computed as
$$S=PLP^t$$
where $L$ is the diagonal matrix with the eigenvalues $l_1, \dots, \l_k$.
This is done in the following three main steps:
Step 1: The data are preprocessed by reducing their data space to
the subspace spanned by the n
observations. This is done by
singular value decomposition of the input data matrix. As a result
the data are represented using at most n-1=rank(X)
without
loss of information.
Step 2: In this step for each data point a measure of outlyingness
is computed. For this purpose the high-dimensional data points are
projected on many univariate directions, each time the univariate
MCD estimator of location and scale is computed and the
standardized distance to the center is measured. The largest of
these distances (over all considered directions) is the outlyingness
measure of the data point. The h
data points with smallest
outlyingness measure are used to compute the covariance matrix
$\Sigma_h$ and to select the number k
of principal
components to retain. This is done by finding such k
that
$l_k/l_1 >= 10.E-3$ and $\Sigma_{j=1}^k l_j/\Sigma_{j=1}^r l_j >= 0.8$
Alternatively the number of principal components k
can be
specified by the user after inspecting the scree plot.
Step 3: The data points are projected on the k-dimensional subspace
spanned by the k
eigenvectors corresponding to the largest
k
eigenvalues of the matrix $\Sigma_h$. The location and
scatter and shape of the projected data are computed using the
reweighted MCD estimator and the eigenvectors of this scatter matrix
yield the robust principal components.## PCA of the Hawkins Bradu Kass's Artificial Data
## using all 4 variables
data(hbk)
pca <- Robpca(hbk)
pca
## Compare with the classical PCA
prcomp(hbk)
## or
PcaClassic(hbk)
## If you want to print the scores too, use
print(pca, print.x=TRUE)
## Using the formula interface
Robpca(~., data=hbk)
## To plot the results:
plot(pca) # distance plot
plot(Robpca(hbk, k=2)) # PCA diagnostic plot (or outlier map)
## Use the standard plots for \code{prcomp} and \code{princomp}
screeplot(getPrcomp(pca))
biplot(getPrcomp(pca))
Run the code above in your browser using DataLab