Learn R Programming

BigDataStatMeth (version 1.0.3)

bdCorr_hdf5: Compute correlation matrix for matrices stored in HDF5 format

Description

This function computes Pearson or Spearman correlation matrix for matrices stored in HDF5 format. It automatically detects whether to compute:

  • Single matrix correlation cor(X) - when only dataset_x is provided

  • Cross-matrix correlation cor(X,Y) - when both dataset_x and dataset_y are provided

It automatically selects between direct computation for small matrices and block-wise processing for large matrices to optimize memory usage and performance.

Correlation types supported:

  • Single matrix: cor(X) when only dataset_x provided

  • Single matrix transposed: cor(t(X)) when trans_x=TRUE

  • Cross-correlation: cor(X,Y) when both datasets provided

  • Cross with transpose: cor(t(X),Y), cor(X,t(Y)), cor(t(X),t(Y))

For omics data analysis:

  • trans_x=FALSE, trans_y=FALSE: Variables vs Variables (genes vs genes, CpGs vs CpGs)

  • trans_x=TRUE, trans_y=FALSE: Samples vs Variables (individuals vs genes)

  • trans_x=FALSE, trans_y=TRUE: Variables vs Samples (genes vs individuals)

  • trans_x=TRUE, trans_y=TRUE: Samples vs Samples (individuals vs individuals) - optimized to cor(X,Y)

Usage

bdCorr_hdf5(
  filename_x,
  group_x,
  dataset_x,
  filename_y = "",
  group_y = "",
  dataset_y = "",
  trans_x = FALSE,
  trans_y = FALSE,
  method = "pearson",
  use_complete_obs = TRUE,
  compute_pvalues = TRUE,
  block_size = 1000L,
  overwrite = FALSE,
  output_filename = "",
  output_group = "",
  output_dataset_corr = "",
  output_dataset_pval = "",
  threads = -1L
)

Value

List with components:

fn

Character string with the HDF5 filename

ds

Character string with the full dataset path to the correlation matrix (group/dataset)

Arguments

filename_x

Character string with the path to the HDF5 file containing matrix X

group_x

Character string indicating the group containing matrix X

dataset_x

Character string indicating the dataset name of matrix X

filename_y

Character string with the path to the HDF5 file containing matrix Y (optional, default: "")

group_y

Character string indicating the group containing matrix Y (optional, default: "")

dataset_y

Character string indicating the dataset name of matrix Y (optional, default: "")

trans_x

Logical, whether to transpose matrix X (default: FALSE)

trans_y

Logical, whether to transpose matrix Y (default: FALSE, ignored for single matrix)

method

Character string indicating correlation method ("pearson" or "spearman", default: "pearson")

use_complete_obs

Logical, whether to use only complete observations (default: TRUE)

compute_pvalues

Logical, whether to compute p-values for correlations (default: TRUE)

block_size

Integer, block size for large matrix processing (default: 1000)

overwrite

Logical, whether to overwrite existing results (default: FALSE)

output_filename

Character string, output HDF5 file (default: same as filename_x)

output_group

Character string, custom output group name (default: auto-generated)

output_dataset_corr

Character string, custom correlation dataset name (default: "correlation")

output_dataset_pval

Character string, custom p-values dataset name (default: "pvalues")

threads

Integer, number of threads for parallel computation (optional, default: auto)

Examples

Run this code
if (FALSE) {
# Backward compatible - existing code works unchanged
result_original <- bdCorr_hdf5("data.h5", "expression", "genes")

# New transpose functionality
# Gene-gene correlations (variables)
gene_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = FALSE)

# Sample-sample correlations (individuals) 
sample_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = TRUE)

# Cross-correlation: genes vs methylation sites (variables vs variables)
cross_vars <- bdCorr_hdf5("omics.h5", "expression", "genes", 
                         "omics.h5", "methylation", "cpg_sites",
                         trans_x = FALSE, trans_y = FALSE)

# Cross-correlation: samples vs methylation sites (samples vs variables)
samples_vs_cpg <- bdCorr_hdf5("omics.h5", "expression", "genes",
                             "omics.h5", "methylation", "cpg_sites", 
                             trans_x = TRUE, trans_y = FALSE)
}

Run the code above in your browser using DataLab