Learn R Programming

BigDataStatMeth (version 1.0.3)

bdsubset_hdf5_dataset: Create Subset of HDF5 Dataset

Description

Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5's hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory.

Usage

bdsubset_hdf5_dataset(
  filename,
  dataset_path,
  indices,
  select_rows = TRUE,
  new_group = "",
  new_name = "",
  overwrite = FALSE
)

Value

Logical. TRUE on success, FALSE on failure

Arguments

filename

Character string. Path to the HDF5 file

dataset_path

Character string. Path to the source dataset (e.g., "/group1/dataset1")

indices

Integer vector. Row or column indices to include (1-based, as per R convention)

select_rows

Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE)

new_group

Character string. Target group for the new dataset (default: same as source)

new_name

Character string. Name for the new dataset (default: original_name + "_subset")

overwrite

Logical. Whether to overwrite destination if it exists (default: FALSE)

Index Convention

Indices follow R's 1-based convention (first element is index 1), but are automatically converted to HDF5's 0-based indexing internally.

Performance

This function is designed for big data scenarios. Memory usage is minimal regardless of source dataset size, making it suitable for datasets that don't fit in memory.

Requirements

  • The HDF5 file must exist and be accessible

  • The source dataset must exist and contain numeric data

  • Indices must be valid (within dataset dimensions)

  • User must have read-write permissions on the file

Author

BigDataStatMeth package authors

Details

This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5's native hyperslab selection mechanism for optimal performance with big data.

Key features:

  • Memory efficient - processes one row/column at a time

  • Direct disk-to-disk copying using HDF5 hyperslab selection

  • Preserves all dataset attributes and properties

  • Works with datasets of any size

  • Automatic creation of parent groups if needed

  • Support for both row and column selection

See Also

Other BigDataStatMeth HDF5 utilities: bdmove_hdf5_dataset()

Examples

Run this code
if (FALSE) {
# Select specific rows (e.g., rows 1, 3, 5, 10-15)
success <- bdsubset_dataset("data.h5", 
                           dataset_path = "/matrix/data",
                           indices = c(1, 3, 5, 10:15),
                           select_rows = TRUE,
                           new_name = "selected_rows")

# Select specific columns
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/matrix/data", 
                           indices = c(2, 4, 6:10),
                           select_rows = FALSE,
                           new_group = "/filtered",
                           new_name = "selected_cols")

# Create subset in different group
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/raw_data/matrix",
                           indices = 1:100,  # First 100 rows
                           select_rows = TRUE,
                           new_group = "/processed",
                           new_name = "top_100_rows")

# Extract specific samples for analysis
interesting_samples <- c(15, 23, 45, 67, 89, 123)
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/experiments/results",
                           indices = interesting_samples,
                           select_rows = TRUE,
                           new_name = "analysis_subset")
}

Run the code above in your browser using DataLab