pca_spca: Scalable principal component analysis via streaming power iterations

Description

Implements the scalable PCA (sPCA) procedure of Elgamal et al. (2015), which uses block power iterations to approximate the leading principal components while streaming the data in manageable chunks. The algorithm only requires matrix-vector products, allowing large matrices to be processed without materialising the full cross-product in memory.

Usage

pca_spca(
  x,
  ncomp = NULL,
  center = TRUE,
  scale = FALSE,
  block_size = 2048L,
  max_iter = 50L,
  tol = 1e-04,
  seed = NULL,
  return_scores = FALSE,
  verbose = FALSE
)
pca_spca(
  x,
  ncomp = NULL,
  center = TRUE,
  scale = FALSE,
  block_size = 2048L,
  max_iter = 50L,
  tol = 1e-04,
  seed = NULL,
  return_scores = FALSE,
  verbose = FALSE
)
pca_spca_R(
  x,
  ncomp = NULL,
  center = TRUE,
  scale = FALSE,
  block_size = 2048L,
  max_iter = 50L,
  tol = 1e-04,
  seed = NULL,
  return_scores = FALSE,
  verbose = FALSE
)
pca_spca(
  x,
  ncomp = NULL,
  center = TRUE,
  scale = FALSE,
  block_size = 2048L,
  max_iter = 50L,
  tol = 1e-04,
  seed = NULL,
  return_scores = FALSE,
  verbose = FALSE
)
pca_spca_R(
  x,
  ncomp = NULL,
  center = TRUE,
  scale = FALSE,
  block_size = 2048L,
  max_iter = 50L,
  tol = 1e-04,
  seed = NULL,
  return_scores = FALSE,
  verbose = FALSE
)

Value

A bigpca object containing the approximate PCA solution with the same structure as pca_bigmatrix(). The result includes component standard deviations, rotation/loadings, optional scores, column statistics, and variance summaries. Additional metadata is stored in attr(result, "iterations") (number of iterations performed), attr(result, "tolerance") (requested tolerance), and attr(result, "converged") (logical convergence flag).

Arguments

x: A numeric matrix, data frame, bigmemory::big.matrix, or an external pointer referencing a big.matrix. The input is processed in row-wise blocks so that large matrices can be analysed without creating dense copies in R memory.
ncomp: Number of principal components to retain. Use NULL or a non-positive value to keep min(nrow(x), ncol(x)) components.
center: Logical; should column means be subtracted before performing PCA?
scale: Logical; when TRUE, columns are scaled to unit variance after centring. Scaling requires center = TRUE.
block_size: Number of rows to stream per block when computing column statistics and matrix-vector products.
max_iter: Maximum number of block power iterations.
tol: Convergence tolerance applied to the Frobenius norm of the difference between successive subspace projectors.
seed: Optional integer seed used to initialise the random starting basis.
return_scores: Logical; when TRUE, principal component scores are computed in a final streaming pass over the data.
verbose: Logical; when TRUE, diagnostic messages describing the iteration progress are emitted.

References

Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda (2015). sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. doi:10.1145/2723372.2751520.