bdsubset_hdf5_dataset: Create Subset of HDF5 Dataset

Description

Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5's hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory.

Usage

bdsubset_hdf5_dataset(
  filename,
  dataset_path,
  indices,
  select_rows = TRUE,
  new_group = "",
  new_name = "",
  overwrite = FALSE
)

Value

Logical. TRUE on success, FALSE on failure

Arguments

filename: Character string. Path to the HDF5 file
dataset_path: Character string. Path to the source dataset (e.g., "/group1/dataset1")
indices: Integer vector. Row or column indices to include (1-based, as per R convention)
select_rows: Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE)
new_group: Character string. Target group for the new dataset (default: same as source)
new_name: Character string. Name for the new dataset (default: original_name + "_subset")
overwrite: Logical. Whether to overwrite destination if it exists (default: FALSE)

Index Convention

Indices follow R's 1-based convention (first element is index 1), but are automatically converted to HDF5's 0-based indexing internally.

Performance

This function is designed for big data scenarios. Memory usage is minimal regardless of source dataset size, making it suitable for datasets that don't fit in memory.

Requirements

The HDF5 file must exist and be accessible
The source dataset must exist and contain numeric data
Indices must be valid (within dataset dimensions)
User must have read-write permissions on the file

Author

BigDataStatMeth package authors

Details

This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5's native hyperslab selection mechanism for optimal performance with big data.

Key features:

Memory efficient - processes one row/column at a time
Direct disk-to-disk copying using HDF5 hyperslab selection
Preserves all dataset attributes and properties
Works with datasets of any size
Automatic creation of parent groups if needed
Support for both row and column selection

Examples

Run this code

if (FALSE) {
# Select specific rows (e.g., rows 1, 3, 5, 10-15)
success <- bdsubset_dataset("data.h5", 
                           dataset_path = "/matrix/data",
                           indices = c(1, 3, 5, 10:15),
                           select_rows = TRUE,
                           new_name = "selected_rows")

# Select specific columns
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/matrix/data", 
                           indices = c(2, 4, 6:10),
                           select_rows = FALSE,
                           new_group = "/filtered",
                           new_name = "selected_cols")

# Create subset in different group
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/raw_data/matrix",
                           indices = 1:100,  # First 100 rows
                           select_rows = TRUE,
                           new_group = "/processed",
                           new_name = "top_100_rows")

# Extract specific samples for analysis
interesting_samples <- c(15, 23, 45, 67, 89, 123)
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/experiments/results",
                           indices = interesting_samples,
                           select_rows = TRUE,
                           new_name = "analysis_subset")
}

Run the code above in your browser using DataLab