bdSplit_matrix_hdf5: Split HDF5 Dataset into Submatrices

Description

Splits a large dataset in an HDF5 file into smaller submatrices, with support for both row-wise and column-wise splitting.

Usage

bdSplit_matrix_hdf5(
  filename,
  group,
  dataset,
  outgroup = NULL,
  outdataset = NULL,
  nblocks = NULL,
  blocksize = NULL,
  bycols = TRUE,
  overwrite = FALSE
)

Value

List with components. If an error occurs, all string values are returned as empty strings (""):

fn: Character string with the HDF5 filename
ds: Character string with the output group path where the split datasets are stored. Multiple datasets are created in this location named as \<outdataset\>.1, \<outdataset\>.2, etc.

Arguments

filename: Character string. Path to the HDF5 file.
group: Character string. Path to the group containing input dataset.
dataset: Character string. Name of the dataset to split.
outgroup: Character string (optional). Output group path. If NULL, uses input group.
outdataset: Character string (optional). Base name for output datasets. If NULL, uses input dataset name with block number suffix.
nblocks: Integer (optional). Number of blocks to split into. Mutually exclusive with blocksize.
blocksize: Integer (optional). Size of each block. Mutually exclusive with nblocks.
bycols: Logical (optional). Whether to split by columns (TRUE) or rows (FALSE). Default is TRUE.
overwrite: Logical (optional). Whether to overwrite existing datasets. Default is FALSE.

Details

This function provides efficient dataset splitting capabilities with:

Splitting options:
- Row-wise or column-wise splitting
- Fixed block size splitting
- Fixed block count splitting
Implementation features:
- Memory-efficient processing
- Block-based operations
- Safe file operations
- Progress reporting

The function supports two splitting strategies:

By number of blocks: Splits the dataset into a specified number of roughly equal-sized blocks
By block size: Splits the dataset into blocks of a specified size

References

The HDF Group. (2000-2010). HDF5 User's Guide.

Examples

Run this code

if (FALSE) {
library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(1000), 100, 10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Split by number of blocks
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split",
  outdataset = "block",
  nblocks = 4,
  bycols = TRUE
)

# Split by block size
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split2",
  outdataset = "block",
  blocksize = 25,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}
}

Run the code above in your browser using DataLab