Learn R Programming

BigDataStatMeth (version 1.0.3)

bdCrossprod_hdf5: Crossprod with hdf5 matrix

Description

Performs optimized cross product operations on matrices stored in HDF5 format. For a single matrix A, computes A^t * A. For two matrices A and B, computes A^t * B. Uses block-wise processing for memory efficiency.

Usage

bdCrossprod_hdf5(
  filename,
  group,
  A,
  B = NULL,
  groupB = NULL,
  block_size = NULL,
  mixblock_size = NULL,
  paral = NULL,
  threads = NULL,
  outgroup = NULL,
  outdataset = NULL,
  overwrite = NULL
)

Value

A list containing the location of the crossproduct result:

fn

Character string. Path to the HDF5 file containing the result

ds

Character string. Full dataset path to the crossproduct result (t(A) %% A or t(A) %% B) within the HDF5 file

Arguments

filename

String indicating the HDF5 file path

group

String indicating the input group containing matrix A

A

String specifying the dataset name for matrix A

B

Optional string specifying dataset name for matrix B. If NULL, performs A^t * A

groupB

Optional string indicating group containing matrix B. If NULL, uses same group as A

block_size

Optional integer specifying the block size for processing. Default is automatically determined based on matrix dimensions

mixblock_size

Optional integer for memory block size in parallel processing

paral

Optional boolean indicating whether to use parallel processing. Default is false

threads

Optional integer specifying number of threads for parallel processing. If NULL, uses maximum available threads

outgroup

Optional string specifying output group. Default is "OUTPUT"

outdataset

Optional string specifying output dataset name. Default is "CrossProd_A_x_B"

overwrite

Optional boolean indicating whether to overwrite existing datasets. Default is false

Details

The function implements block-wise matrix multiplication to handle large matrices efficiently. Block size is automatically optimized based on:

  • Available memory

  • Matrix dimensions

  • Whether parallel processing is enabled

For parallel processing:

  • Uses OpenMP for thread management

  • Implements cache-friendly block operations

  • Provides automatic thread count optimization

Memory efficiency is achieved through:

  • Block-wise reading and writing

  • Minimal temporary storage

  • Proper resource cleanup

Examples

Run this code
if (FALSE) {
  library(BigDataStatMeth)
  library(rhdf5)
  
  # Create test matrix
  N = 1000
  M = 1000
  set.seed(555)
  a <- matrix(rnorm(N*M), N, M)
  
  # Save to HDF5
  bdCreate_hdf5_matrix("test.hdf5", a, "INPUT", "A", overwriteFile = TRUE)
  
  # Compute cross product
  bdCrossprod_hdf5("test.hdf5", "INPUT", "A", 
                   outgroup = "OUTPUT",
                   outdataset = "result",
                   block_size = 1024,
                   paral = TRUE,
                   threads = 4)
}

Run the code above in your browser using DataLab