bdSort_hdf5_dataset: Sort HDF5 Dataset Using Predefined Order

Description

Sorts a dataset in an HDF5 file based on a predefined ordering specified through a list of sorting blocks.

Usage

bdSort_hdf5_dataset(
  filename,
  group,
  dataset,
  outdataset,
  blockedSortlist,
  func,
  outgroup = NULL,
  overwrite = FALSE
)

Value

List with components. If an error occurs, all string values are returned as empty strings (""):

fn: Character string with the HDF5 filename
ds: Character string with the full dataset path to the sorted dataset (group/dataset)

Arguments

filename

Character string. Path to the HDF5 file.

group

Character string. Path to the group containing input dataset.

dataset

Character string. Name of the dataset to sort.

outdataset

Character string. Name for the sorted dataset.

blockedSortlist

List of data frames. Each data frame specifies the sorting order for a block of elements. See Details for structure.

func

Character string. Function to apply:

"sortRows" for row-wise sorting
"sortCols" for column-wise sorting

outgroup

Character string (optional). Output group path. If NULL, uses input group.

overwrite

Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

Details

This function provides efficient dataset sorting capabilities with:

Sorting options:
- Row-wise sorting
- Column-wise sorting
- Block-based processing
Implementation features:
- Memory-efficient processing
- Block-based operations
- Safe file operations
- Progress reporting

The sorting order is specified through a list of data frames, where each data frame represents a block of elements to be sorted. Each data frame must contain:

Row names (current identifiers)
chr (new identifiers)
order (current positions)
newOrder (target positions)

Example sorting blocks structure:

Block 1 (maintaining order): chr order newOrder Diagonal TCGA-OR-A5J1 TCGA-OR-A5J1 1 1 1 TCGA-OR-A5J2 TCGA-OR-A5J2 2 2 1 TCGA-OR-A5J3 TCGA-OR-A5J3 3 3 1 TCGA-OR-A5J4 TCGA-OR-A5J4 4 4 1

Block 2 (reordering with new identifiers): chr order newOrder Diagonal TCGA-OR-A5J5 TCGA-OR-A5JA 10 5 1 TCGA-OR-A5J6 TCGA-OR-A5JB 11 6 1 TCGA-OR-A5J7 TCGA-OR-A5JC 12 7 0 TCGA-OR-A5J8 TCGA-OR-A5JD 13 8 1

Block 3 (reordering with identifier swaps): chr order newOrder Diagonal TCGA-OR-A5J9 TCGA-OR-A5J5 5 9 1 TCGA-OR-A5JA TCGA-OR-A5J6 6 10 1 TCGA-OR-A5JB TCGA-OR-A5J7 7 11 1 TCGA-OR-A5JC TCGA-OR-A5J8 8 12 1 TCGA-OR-A5JD TCGA-OR-A5J9 9 13 0

In this example:

Block 1 maintains the original order
Block 2 assigns new identifiers (A5JA-D) to elements
Block 3 swaps identifiers between elements
The Diagonal column indicates whether the element is on the diagonal (1) or not (0)

References

The HDF Group. (2000-2010). HDF5 User's Guide.

Examples

Run this code

if (FALSE) {
library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(100), 10, 10)
rownames(data) <- paste0("TCGA-OR-A5J", 1:10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Create sorting blocks
block1 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(2,1,3,4)),
  order = 1:4,
  newOrder = c(2,1,3,4),
  row.names = paste0("TCGA-OR-A5J", 1:4)
)

block2 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(6,5,8,7)),
  order = 5:8,
  newOrder = c(6,5,8,7),
  row.names = paste0("TCGA-OR-A5J", 5:8)
)

# Sort dataset
bdSort_hdf5_dataset(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outdataset = "matrix1_sorted",
  blockedSortlist = list(block1, block2),
  func = "sortRows"
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}
}

Run the code above in your browser using DataLab