Learn R Programming

BigDataStatMeth

Overview

BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.

The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.

Key Features

  • Block-wise algorithms: Process data larger than memory through intelligent partitioning
  • HDF5 integration: Seamless storage and computation with hierarchical data format
  • Parallel processing: Multi-threaded operations for enhanced performance
  • Dual API: Complete R interface with underlying C++ implementation for performance
  • Statistical methods: PCA, SVD, CCA, regression models, and more
  • Production-ready: Extensively tested on genomic datasets with millions of features

Installation

From CRAN (Stable Release)

install.packages("BigDataStatMeth")

From GitHub (Development Version)

# Install devtools if needed
install.packages("devtools")

# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")

System Requirements

R packages:

  • Matrix
  • rhdf5 (Bioconductor)
  • RcppEigen
  • RSpectra

System dependencies:

  • HDF5 library (>= 1.8)
  • C++11 compatible compiler
  • For Windows: Rtools

Install Bioconductor dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install(c("rhdf5", "HDF5Array"))

Quick Start

Basic Workflow: PCA on Large Genomic Data

library(BigDataStatMeth)
library(rhdf5)

# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
  filename = "genomics.hdf5",
  object = genotype_matrix,
  group = "data",
  dataset = "genotypes"
)

# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
  filename = "genomics.hdf5",
  group = "data",
  dataset = "genotypes",
  k = 4,              # Number of blocks
  bcenter = TRUE,     # Center data
  bscale = FALSE,     # Don't scale
  threads = 4         # Use 4 threads
)

# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop

Working with HDF5 Files

# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A",
  B = "matrix_B"
)

# Cross-product
crossp <- bdCrossprod_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A"
)

# SVD decomposition
svd_result <- bdSVD_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  dataset = "matrix_A",
  k = 8,
  threads = 4
)

Core Functionality

Linear Algebra Operations

OperationR FunctionFeatures
Matrix multiplicationbdblockmult_hdf5()Block-wise, parallel, HDF5
Cross-productbdCrossprod_hdf5()t(A) %% A, t(A) %% B
Transposed cross-productbdtCrossprod_hdf5()A %% t(A), A %% t(B)
SVDbdSVD_hdf5()Block-wise, hierarchical
QR decompositionbdQR_hdf5()Block-wise
CholeskybdCholesky_hdf5()For positive-definite matrices
Matrix inversionbdInvCholesky_hdf5()Via Cholesky decomposition

Statistical Methods

MethodR FunctionDescription
Principal Component AnalysisbdPCA_hdf5()Block-wise PCA with centering/scaling
Singular Value DecompositionbdSVD_hdf5()Hierarchical block-wise SVD
Canonical Correlation AnalysisbdCCA_hdf5()Multi-omic data integration
Linear Regressionbdlm_hdf5()Large-scale regression models

Data Management

OperationR FunctionPurpose
Create HDF5 datasetbdCreate_hdf5_matrix()Initialize HDF5 files
Normalize databdNormalize_hdf5()Center and/or scale
Remove low-quality databdRemovelowdata_hdf5()Filter by missing values
Impute missing valuesbdImputeSNPs_hdf5()Mean/median imputation
Split datasetsbdSplit_matrix_hdf5()Partition into blocks
Merge datasetsbdBind_hdf5_datasets()Combine by rows/columns

Utility Functions

FunctionPurpose
bdgetDim_hdf5()Get dataset dimensions
bdExists_hdf5_element()Check if dataset exists
bdgetDatasetsList_hdf5()List all datasets in group
bdRemove_hdf5_element()Delete dataset or group
bdImportTextFile_hdf5()Import text files to HDF5

Documentation

Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/

Sections

Vignettes

# List available vignettes
vignette(package = "BigDataStatMeth")

# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")

Performance

BigDataStatMeth is designed for efficiency:

  • Block-wise computation: Process 100+ GB datasets with 8-16 GB RAM
  • Parallel algorithms: Multi-core support for matrix operations
  • Optimized I/O: Efficient HDF5 chunking and access patterns
  • Memory management: Controlled memory usage through block size tuning

Use Cases

BigDataStatMeth is particularly suited for:

  • Genomics: GWAS, eQTL analysis, population genetics
  • Transcriptomics: RNA-seq analysis, differential expression
  • Multi-omics: Data integration (CCA, MOFA-style analyses)
  • Large-scale statistics: Any analysis requiring matrix operations on big data
  • Method development: Building new statistical methods for big data

Examples

Example 1: Genomic PCA with Quality Control

library(BigDataStatMeth)

# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")

# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps", 
                     pcent = 0.05, bycols = TRUE)  # Remove SNPs >5% missing

# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")

# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered", 
                  k = 8, bcenter = TRUE, threads = 4)

# Plot results
plot(pca$components[,1], pca$components[,2],
     xlab = "PC1", ylab = "PC2",
     main = "Population Structure")

Example 2: Multi-Omic CCA

# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")

# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes", 
                 bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
                 bcenter = TRUE, bscale = TRUE)

# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
  filename = "multi_omic.hdf5",
  X = "NORMALIZED/data/genes",
  Y = "NORMALIZED/data/cpgs",
  m = 10  # Number of blocks
)

# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")

Example 3: Custom Method Development (C++ API)

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace BigDataStatMeth;

// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
  
  hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
  ds->openDataset();
  
  // Your custom algorithm using BigDataStatMeth functions
  // Block-wise processing, matrix operations, etc.
  
  delete ds;
}

See Developing Methods for complete examples.

Citation

If you use BigDataStatMeth in your research, please cite:

Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods 
for Big Data Using Block-wise Algorithms and HDF5 Storage. 
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth

BibTeX entry:

@Manual{bigdatastatmeth,
  title = {BigDataStatMeth: Statistical Methods for Big Data},
  author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
  year = {2024},
  note = {R package version X.X.X},
  url = {https://github.com/isglobal-brge/BigDataStatMeth},
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow existing code style (Rcpp coding standards)
  • Add tests for new functionality
  • Update documentation (Roxygen2 for R, Doxygen for C++)
  • Run R CMD check before submitting

Getting Help

License

MIT License - see LICENSE file for details.

Authors

Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Acknowledgments

Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).

Copy Link

Version

Install

install.packages('BigDataStatMeth')

Monthly Downloads

253

Version

1.0.3

License

MIT + file LICENSE

Maintainer

Dolors Pelegri-Siso

Last Published

December 22nd, 2025

Functions in BigDataStatMeth (1.0.3)

bdImputeSNPs_hdf5

Impute Missing SNP Values in HDF5 Dataset
bdPCA_hdf5

Principal Component Analysis for HDF5-Stored Matrices
bdInvCholesky_hdf5

Matrix Inversion using Cholesky Decomposition for HDF5-Stored Matrices
bdReduce_hdf5_dataset

Reduce Multiple HDF5 Datasets
bdNormalize_hdf5

Normalize dataset in HDF5 file
bdRemoveMAF_hdf5

Remove SNPs Based on Minor Allele Frequency
bdQR

QR Decomposition for In-Memory Matrices
bdQR_hdf5

QR Decomposition for HDF5-Stored Matrices
bdIsLocked_hdf5

Test whether an HDF5 file is locked (in use)
bdRemove_hdf5_element

Remove Elements from HDF5 File
bdSort_hdf5_dataset

Sort HDF5 Dataset Using Predefined Order
bdScalarwproduct

Matrix–scalar weighted product
bdSplit_matrix_hdf5

Split HDF5 Dataset into Submatrices
bdWrite_hdf5_dimnames

Write dimnames to an HDF5 dataset
bdSolve

Solve Linear System AX = B (In-Memory)
bdSolve_hdf5

Solve Linear System AX = B (HDF5-Stored)
bdRemovelowdata_hdf5

Remove Low-Representation SNPs from HDF5 Dataset
bdWriteDiagonal_hdf5

Write Matrix Diagonal to HDF5
bdWriteOppsiteTriangularMatrix_hdf5

Write Upper/Lower Triangular Matrix
bdSVD_hdf5

Singular Value Decomposition for HDF5-Stored Matrices
bd_wproduct

Weighted matrix–vector products and cross-products
bdblockmult_hdf5

Hdf5 datasets multiplication
bdblockSubstract

Block-Based Matrix Subtraction
bdblockSum

Block-Based Matrix Addition
bdblockSum_hdf5

HDF5 dataset addition
bdcomputeMatrixVector_hdf5

Apply Vector Operations to HDF5 Matrix
bdblockmult_sparse_hdf5

Block matrix multiplication for sparse matrices
bdblockSubstract_hdf5

HDF5 dataset subtraction
bdapply_Function_hdf5

Apply function to different datasets inside a group
bdblockMult

Block-Based Matrix Multiplication
bdpseudoinv

Compute Matrix Pseudoinverse (In-Memory)
bdgetDim_hdf5

Get HDF5 Dataset Dimensions
bdtCrossprod

Efficient Matrix Transposed Cross-Product Computation
bdgetSDandMean_hdf5

Compute Matrix Standard Deviation and Mean in HDF5
bdgetDatasetsList_hdf5

List Datasets in HDF5 Group
bdgetDiagonal_hdf5

Get Matrix Diagonal from HDF5
bdpseudoinv_hdf5

Compute Matrix Pseudoinverse (HDF5-Stored)
bdsubset_hdf5_dataset

Create Subset of HDF5 Dataset
miRNA

miRNA
colesterol

Dataset colesterol
bdtCrossprod_hdf5

Transposed cross product with HDF5 matrices
cancer

Cancer classification
bdmove_hdf5_dataset

Move HDF5 Dataset
bdCreate_hdf5_group

Create Group in an HDF5 File
bdCreate_hdf5_emptyDataset

Create an empty HDF5 dataset (no data written)
bdBind_hdf5_datasets

Bind matrices by rows or columns
bdCheckMatrix_hdf5

Check Matrix Suitability for Eigenvalue Decomposition with Spectra
BigDataStatMeth

BigDataStatMeth package documentation
bdCholesky_hdf5

Cholesky Decomposition for HDF5-Stored Matrices
bdCorr_matrix

Compute correlation matrix for in-memory matrices (unified function)
bdCreate_diagonal_hdf5

Create Diagonal Matrix or Vector in HDF5 File
bdCreate_hdf5_matrix

Create hdf5 data file and write data to it
bdCorr_hdf5

Compute correlation matrix for matrices stored in HDF5 format
bdImportData_hdf5

Import data from URL or file to HDF5 format
bdDiag_add_hdf5

Add Diagonal Elements from HDF5 Matrices or Vectors
bdDiag_scalar_hdf5

Apply Scalar Operations to Diagonal Elements
bdCrossprod_hdf5

Crossprod with hdf5 matrix
bdDiag_subtract_hdf5

Subtract Diagonal Elements from HDF5 Matrices or Vectors
bdCrossprod

Efficient Matrix Cross-Product Computation
bdImportTextFile_hdf5

Import Text File to HDF5
bdEigen_hdf5

Eigenvalue Decomposition for HDF5-Stored Matrices using Spectra
bdDiag_multiply_hdf5

Multiply Diagonal Elements from HDF5 Matrices or Vectors
bdDiag_divide_hdf5

Divide Diagonal Elements from HDF5 Matrices or Vectors