Learn R Programming

BigDataStatMeth (version 1.0.3)

bdPCA_hdf5: Principal Component Analysis for HDF5-Stored Matrices

Description

Performs Principal Component Analysis (PCA) on a large matrix stored in an HDF5 file. PCA reduces the dimensionality of the data while preserving as much variance as possible. The implementation uses SVD internally for efficient and numerically stable computation.

Usage

bdPCA_hdf5(
  filename,
  group,
  dataset,
  ncomponents = 0L,
  bcenter = FALSE,
  bscale = FALSE,
  k = 2L,
  q = 1L,
  rankthreshold = 0,
  SVDgroup = NULL,
  overwrite = FALSE,
  method = NULL,
  threads = NULL
)

Value

A list containing the paths to the PCA results stored in the HDF5 file:

fn

Character string. Path to the HDF5 file containing the results

lambda

Character string. Dataset path to eigenvalues \(\lambda\)

variance

Character string. Dataset path to variance explained by each PC

cumvar

Character string. Dataset path to cumulative variance explained

var.coord

Character string. Dataset path to variable coordinates on the PCs

var.cos2

Character string. Dataset path to squared cosines (quality of representation) for variables

ind.dist

Character string. Dataset path to distances of individuals from the origin

components

Character string. Dataset path to principal components (rotated data)

ind.coord

Character string. Dataset path to individual coordinates on the PCs

ind.cos2

Character string. Dataset path to squared cosines (quality of representation) for individuals

ind.contrib

Character string. Dataset path to contributions of individuals to each PC

All results are written to the HDF5 file in the group 'PCA/dataset'.

Arguments

filename

Character string. Path to the HDF5 file containing the input matrix.

group

Character string. Path to the group containing the input dataset.

dataset

Character string. Name of the input dataset to analyze.

ncomponents

Integer. Number of principal components to compute (default = 0, which computes all components).

bcenter

Logical. If TRUE, centers the data by subtracting column means. Default is FALSE.

bscale

Logical. If TRUE, scales the centered columns by their standard deviations (if centered) or root mean square. Default is FALSE.

k

Integer. Number of local SVDs to concatenate at each level (default = 2). Controls memory usage in block computation.

q

Integer. Number of levels for SVD computation (default = 1). Higher values can improve accuracy but increase computation time.

rankthreshold

Numeric. Threshold for determining matrix rank (default = 0). Must be between 0 and 0.1.

SVDgroup

Character string. Group name where intermediate SVD results are stored. If SVD was previously computed, results will be reused from this group.

overwrite

Logical. If TRUE, forces recomputation of SVD even if results exist.

method

Character string. Computation method:

  • "auto": Automatically selects method based on matrix size

  • "blocks": Uses block-based computation (for large matrices)

  • "full": Performs direct computation (for smaller matrices)

threads

Integer. Number of threads for parallel computation.

Details

This function implements a scalable PCA algorithm suitable for large matrices that may not fit in memory. Key features include:

  • Automatic method selection based on matrix size

  • Block-based computation for large matrices

  • Optional data preprocessing (centering and scaling)

  • Parallel processing support

  • Memory-efficient incremental algorithm

  • Reuse of existing SVD results

The implementation uses SVD internally and supports two computation methods:

  • Full decomposition: Suitable for matrices that fit in memory

  • Block-based decomposition: For large matrices, uses an incremental algorithm

References

  • Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288.

  • Jolliffe, I. T. (2002). Principal Component Analysis, Second Edition. Springer Series in Statistics.

See Also

  • bdSVD_hdf5 for the underlying SVD computation

  • bdNormalize_hdf5 for data preprocessing options

Examples

Run this code
if (FALSE) {
# Create a sample large matrix in HDF5
library(rhdf5)
X <- matrix(rnorm(10000), 1000, 10)
h5createFile("data.h5")
h5write(X, "data.h5", "data/matrix")

# Basic PCA with default parameters
bdPCA_hdf5("data.h5", "data", "matrix")

# PCA with preprocessing and specific number of components
bdPCA_hdf5("data.h5", "data", "matrix",
           ncomponents = 3,
           bcenter = TRUE, bscale = TRUE,
           method = "blocks",
           threads = 4)
}

Run the code above in your browser using DataLab