bdSVD_hdf5: Singular Value Decomposition for HDF5-Stored Matrices

Description

Computes the Singular Value Decomposition (SVD) of a large matrix stored in an HDF5 file. The SVD decomposes a matrix A into a product A = UDV' where U and V are orthogonal matrices and D is a diagonal matrix containing the singular values.

Usage

bdSVD_hdf5(
  filename,
  group = NULL,
  dataset = NULL,
  k = 2L,
  q = 1L,
  bcenter = TRUE,
  bscale = TRUE,
  rankthreshold = 0,
  overwrite = NULL,
  method = NULL,
  threads = NULL
)

Value

A list with the following elements:

fn: Path to the HDF5 file
ds_d: Path to the dataset containing singular values
ds_u: Path to the dataset containing left singular vectors
ds_v: Path to the dataset containing right singular vectors

Arguments

filename

Character string. Path to the HDF5 file containing the input matrix.

group

Character string. Path to the group containing the input dataset.

dataset

Character string. Name of the input dataset to decompose.

k

Integer. Number of local SVDs to concatenate at each level (default = 2). Controls the trade-off between memory usage and computation speed.

q

Integer. Number of levels for SVD computation (default = 1). Higher values can improve accuracy but increase computation time.

bcenter

Logical. If TRUE (default), centers the data by subtracting column means.

bscale

Logical. If TRUE (default), scales the centered columns by their standard deviations or root mean square.

rankthreshold

Numeric. Threshold for determining matrix rank (default = 0). Must be between 0 and 0.1. Used to approximate rank for nearly singular matrices.

overwrite

Logical. If TRUE, allows overwriting existing results.

method

Character string. Computation method:

"auto": Automatically selects between "full" and "blocks" based on matrix size
"blocks": Uses block-based computation (recommended for large matrices)
"full": Performs direct computation without partitioning

threads

Integer. Number of threads for parallel computation.

Details

This function implements a block-based SVD algorithm suitable for large matrices that may not fit in memory. Key features include:

Automatic method selection based on matrix size
Block-based computation for large matrices
Data centering and scaling options
Parallel processing support
Rank approximation through threshold
Memory-efficient incremental algorithm

The implementation uses an incremental algorithm with two key parameters:

k: number of local SVDs to concatenate at each level
q: number of levels in the computation

References

Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288.

Examples

Run this code

if (FALSE) {
# Create a sample large matrix in HDF5

library(BigDataStatMeth)
library(rhdf5)

# Create a sample large matrix in HDF5
A <- matrix(rnorm(10000), 1000, 10)

fn <- "test_temp.hdf5"
bdCreate_hdf5_matrix(filename = fn, object = A, group = "data", dataset = "matrix")

# Compute SVD with default parameters
res <- bdSVD_hdf5(fn, "data", "matrix")

# Compute SVD with custom parameters
res <- bdSVD_hdf5(fn, "data", "matrix",
           k = 4, q = 2,
           bcenter = TRUE, bscale = TRUE,
           method = "blocks",
           threads = 4)

# list contents
h5ls(res$fn)

# Extract the result from HDF5 (d)
result_d_hdf5 <- h5read(res$fn, res$ds_d)
result_d_hdf5

# Compute the same SVD in R
result_d_r <- svd(A)$d
result_d_r

# Compare both results (should be TRUE)
all.equal(result_d_hdf5, result_d_r)

# Remove file
if (file.exists(fn)) {
  file.remove(fn)
}

}

Run the code above in your browser using DataLab