Learn R Programming

BigDataStatMeth (version 1.0.3)

bdSplit_matrix_hdf5: Split HDF5 Dataset into Submatrices

Description

Splits a large dataset in an HDF5 file into smaller submatrices, with support for both row-wise and column-wise splitting.

Usage

bdSplit_matrix_hdf5(
  filename,
  group,
  dataset,
  outgroup = NULL,
  outdataset = NULL,
  nblocks = NULL,
  blocksize = NULL,
  bycols = TRUE,
  overwrite = FALSE
)

Value

List with components. If an error occurs, all string values are returned as empty strings (""):

fn

Character string with the HDF5 filename

ds

Character string with the output group path where the split datasets are stored. Multiple datasets are created in this location named as \<outdataset\>.1, \<outdataset\>.2, etc.

Arguments

filename

Character string. Path to the HDF5 file.

group

Character string. Path to the group containing input dataset.

dataset

Character string. Name of the dataset to split.

outgroup

Character string (optional). Output group path. If NULL, uses input group.

outdataset

Character string (optional). Base name for output datasets. If NULL, uses input dataset name with block number suffix.

nblocks

Integer (optional). Number of blocks to split into. Mutually exclusive with blocksize.

blocksize

Integer (optional). Size of each block. Mutually exclusive with nblocks.

bycols

Logical (optional). Whether to split by columns (TRUE) or rows (FALSE). Default is TRUE.

overwrite

Logical (optional). Whether to overwrite existing datasets. Default is FALSE.

Details

This function provides efficient dataset splitting capabilities with:

  • Splitting options:

    • Row-wise or column-wise splitting

    • Fixed block size splitting

    • Fixed block count splitting

  • Implementation features:

    • Memory-efficient processing

    • Block-based operations

    • Safe file operations

    • Progress reporting

The function supports two splitting strategies:

  1. By number of blocks: Splits the dataset into a specified number of roughly equal-sized blocks

  2. By block size: Splits the dataset into blocks of a specified size

References

  • The HDF Group. (2000-2010). HDF5 User's Guide.

See Also

  • bdCreate_hdf5_matrix for creating HDF5 matrices

Examples

Run this code
if (FALSE) {
library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(1000), 100, 10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Split by number of blocks
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split",
  outdataset = "block",
  nblocks = 4,
  bycols = TRUE
)

# Split by block size
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split2",
  outdataset = "block",
  blocksize = 25,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}
}

Run the code above in your browser using DataLab