PCAgrid: (Sparse) Robust Principal Components using the Grid search algorithm

Description

Computes a desired number of (sparse) (robust) principal components using the grid search algorithm in the plane. The global optimum of the objective function is searched in planes, not in the p-dimensional space, using regular grids in these planes.

Usage

PCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, 
	 center = l1median, scale, trace = 0, store.call = TRUE, control, ...)
sPCAgrid (x, k = 2, method = c ("mad", "sd", "qn"), lambda = 1, maxiter = 10, splitcircle = 25, scores = TRUE, zero.tol = 1e-16, 
	  center = l1median, scale, trace = 0, store.call = TRUE, control, ...)

Arguments

a numerical matrix or data frame of dimension (n x p)which provides the data for the principal components analysis.

the desired number of components to compute

method

the scale estimator used to detect the direction with the largest variance. Possible values are "sd", "mad" and "qn", the latter can be called "Qn" too. "mad" is the default value.

lambda

the sparseness constraint's strength(sPCAgrid only). A single value for all components, or a vector of length k with different values for each component can be specified. See opt.TPO for the choice of this argument.

maxiter

the maximum number of iterations.

splitcircle

the number of directions in which the algorithm should search for the largest variance. The direction with the largest variance is searched for in the directions defined by a number of equally spaced points on the unit circle. This argument determines, how many such points are used to split the unit circle.

scores

A logical value indicating whether the scores of the principal component should be calculated.

zero.tol

the zero tolerance used internally for checking convergence, etc.

center

this argument indicates how the data is to be centered. It can be a function like mean or median or a vector of length ncol(x) containing the center value of each column.

scale

this argument indicates how the data is to be rescaled. It can be a function like sd or mad or a vector of length ncol(x) containing the scale value of each column.

trace

an integer value >= 0, specifying the tracing level.

store.call

a logical variable, specifying whether the function call shall be stored in the result structure.

control

a list which elements must be the same as (or a subset of) the parameters above. If the control object is supplied, the parameters from it will be used and any other given parameters are overridden.

...

further arguments passed to or from other functions.

Value

sdev: the (robust) standard deviations of the principal components.
loadings: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). This is of class "loadings": see loadings for its print method.
center: the means that were subtracted.
scale: the scalings applied to each variable.
n.obs: the number of observations.
scores: if scores = TRUE, the scores of the supplied data on the principal components.
call: the matched call.
obj: A vector containing the objective functions values. For function PCAgrid this is the same as sdev.
lambda: The lambda each component has been calculated with (sPCAgrid only).

Details

In contrast to PCAgrid, the function sPCAgrid computes sparse principal components. The strength of the applied sparseness constraint is specified by argument lambda.

Similar to the function princomp, there is a print method for the these objects that prints the results in a nice format and the plot method produces a scree plot (screeplot). There is also a biplot method.

Angle halving is an extension of the original algorithm. In the original algorithm, the search directions are determined by a number of points on the unit circle in the interval [-pi/2 ; pi/2). Angle halving means this angle is halved in each iteration, eg. for the first approximation, the above mentioned angle is used, for the second approximation, the angle is halved to [-pi/4 ; pi/4) and so on. This usually gives better results with less iterations needed. NOTE: in previous implementations angle halving could be suppressed by the former argument "anglehalving". This still can be done by setting argument maxiter = 0.

References

C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, Vol. 87, pp. 218-225.

C. Croux, P. Filzmoser, H. Fritz (2011). Robust Sparse Principal Component Analysis Based on Projection-Pursuit, ?? To appear.

Examples

Run this code

  # multivariate data with outliers
  library(mvtnorm)
  x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))),
             rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6))))
  # Here we calculate the principal components with PCAgrid
  pc <- PCAgrid(x)
  # we could draw a biplot too:
  biplot(pc)
  # now we want to compare the results with the non-robust principal components
  pc <- princomp(x)
  # again, a biplot for comparison:
  biplot(pc)

  ##  Sparse loadings
  set.seed (0)
  x <- data.Zou ()

                   ##  applying PCA
  pc <-  princomp (x)
                   ##  the corresponding non-sparse loadings
  unclass (pc$load[,1:3])
  pc$sdev[1:3]

                   ##  lambda as calculated in the opt.TPO - example
  lambda <- c (0.23, 0.34, 0.005)
                   ##  applying sparse PCA
  spc <- sPCAgrid (x, k = 3, lambda = lambda, method = "sd")
  unclass (spc$load)
  spc$sdev[1:3]

                   ## comparing the non-sparse and sparse biplot
  par (mfrow = 1:2)
  biplot (pc, main = "non-sparse PCs")
  biplot (spc, main = "sparse PCs")

Run the code above in your browser using DataLab