BigDataStatMeth

Overview

BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.

The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.

Key Features

Block-wise algorithms: Process data larger than memory through intelligent partitioning
HDF5 integration: Seamless storage and computation with hierarchical data format
Parallel processing: Multi-threaded operations for enhanced performance
Dual API: Complete R interface with underlying C++ implementation for performance
Statistical methods: PCA, SVD, CCA, regression models, and more
Production-ready: Extensively tested on genomic datasets with millions of features

Installation

From CRAN (Stable Release)

install.packages("BigDataStatMeth")

From GitHub (Development Version)

# Install devtools if needed
install.packages("devtools")

# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")

System Requirements

R packages:

Matrix
rhdf5 (Bioconductor)
RcppEigen
RSpectra

System dependencies:

HDF5 library (>= 1.8)
C++11 compatible compiler
For Windows: Rtools

Install Bioconductor dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install(c("rhdf5", "HDF5Array"))

Quick Start

Basic Workflow: PCA on Large Genomic Data

library(BigDataStatMeth)
library(rhdf5)

# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
  filename = "genomics.hdf5",
  object = genotype_matrix,
  group = "data",
  dataset = "genotypes"
)

# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
  filename = "genomics.hdf5",
  group = "data",
  dataset = "genotypes",
  k = 4,              # Number of blocks
  bcenter = TRUE,     # Center data
  bscale = FALSE,     # Don't scale
  threads = 4         # Use 4 threads
)

# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop

Working with HDF5 Files

# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A",
  B = "matrix_B"
)

# Cross-product
crossp <- bdCrossprod_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A"
)

# SVD decomposition
svd_result <- bdSVD_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  dataset = "matrix_A",
  k = 8,
  threads = 4
)

Core Functionality

Linear Algebra Operations

Operation	R Function	Features
Matrix multiplication	`bdblockmult_hdf5()`	Block-wise, parallel, HDF5
Cross-product	`bdCrossprod_hdf5()`	t(A) %% A, t(A) %% B
Transposed cross-product	`bdtCrossprod_hdf5()`	A %% t(A), A %% t(B)
SVD	`bdSVD_hdf5()`	Block-wise, hierarchical
QR decomposition	`bdQR_hdf5()`	Block-wise
Cholesky	`bdCholesky_hdf5()`	For positive-definite matrices
Matrix inversion	`bdInvCholesky_hdf5()`	Via Cholesky decomposition

Statistical Methods

Method	R Function	Description
Principal Component Analysis	`bdPCA_hdf5()`	Block-wise PCA with centering/scaling
Singular Value Decomposition	`bdSVD_hdf5()`	Hierarchical block-wise SVD
Canonical Correlation Analysis	`bdCCA_hdf5()`	Multi-omic data integration
Linear Regression	`bdlm_hdf5()`	Large-scale regression models

Data Management

Operation	R Function	Purpose
Create HDF5 dataset	`bdCreate_hdf5_matrix()`	Initialize HDF5 files
Normalize data	`bdNormalize_hdf5()`	Center and/or scale
Remove low-quality data	`bdRemovelowdata_hdf5()`	Filter by missing values
Impute missing values	`bdImputeSNPs_hdf5()`	Mean/median imputation
Split datasets	`bdSplit_matrix_hdf5()`	Partition into blocks
Merge datasets	`bdBind_hdf5_datasets()`	Combine by rows/columns

Utility Functions

Function	Purpose
`bdgetDim_hdf5()`	Get dataset dimensions
`bdExists_hdf5_element()`	Check if dataset exists
`bdgetDatasetsList_hdf5()`	List all datasets in group
`bdRemove_hdf5_element()`	Delete dataset or group
`bdImportTextFile_hdf5()`	Import text files to HDF5

Documentation

Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/

Sections

Getting Started: Installation and first steps
Fundamentals: HDF5 storage and block-wise computing concepts
Workflows: Complete analysis examples (PCA, CCA, cross-platform integration)
Developing Methods: Building new statistical methods with BigDataStatMeth
API Reference: Complete function documentation (R and C++)
Technical Guide: Performance optimization and benchmarking

Vignettes

# List available vignettes
vignette(package = "BigDataStatMeth")

# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")

Performance

BigDataStatMeth is designed for efficiency:

Block-wise computation: Process 100+ GB datasets with 8-16 GB RAM
Parallel algorithms: Multi-core support for matrix operations
Optimized I/O: Efficient HDF5 chunking and access patterns
Memory management: Controlled memory usage through block size tuning

Use Cases

BigDataStatMeth is particularly suited for:

Genomics: GWAS, eQTL analysis, population genetics
Transcriptomics: RNA-seq analysis, differential expression
Multi-omics: Data integration (CCA, MOFA-style analyses)
Large-scale statistics: Any analysis requiring matrix operations on big data
Method development: Building new statistical methods for big data

Examples

Example 1: Genomic PCA with Quality Control

library(BigDataStatMeth)

# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")

# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps", 
                     pcent = 0.05, bycols = TRUE)  # Remove SNPs >5% missing

# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")

# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered", 
                  k = 8, bcenter = TRUE, threads = 4)

# Plot results
plot(pca$components[,1], pca$components[,2],
     xlab = "PC1", ylab = "PC2",
     main = "Population Structure")

Example 2: Multi-Omic CCA

# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")

# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes", 
                 bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
                 bcenter = TRUE, bscale = TRUE)

# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
  filename = "multi_omic.hdf5",
  X = "NORMALIZED/data/genes",
  Y = "NORMALIZED/data/cpgs",
  m = 10  # Number of blocks
)

# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")

Example 3: Custom Method Development (C++ API)

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace BigDataStatMeth;

// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
  
  hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
  ds->openDataset();
  
  // Your custom algorithm using BigDataStatMeth functions
  // Block-wise processing, matrix operations, etc.
  
  delete ds;
}

See Developing Methods for complete examples.

Citation

If you use BigDataStatMeth in your research, please cite:

Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods 
for Big Data Using Block-wise Algorithms and HDF5 Storage. 
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth

BibTeX entry:

@Manual{bigdatastatmeth,
  title = {BigDataStatMeth: Statistical Methods for Big Data},
  author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
  year = {2024},
  note = {R package version X.X.X},
  url = {https://github.com/isglobal-brge/BigDataStatMeth},
}

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow existing code style (Rcpp coding standards)
Add tests for new functionality
Update documentation (Roxygen2 for R, Doxygen for C++)
Run R CMD check before submitting

Getting Help

Documentation: https://isglobal-brge.github.io/BigDataStatMeth/
Issues: GitHub Issues

License

MIT License - see LICENSE file for details.

Authors

Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Acknowledgments

Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).

BigDataStatMeth

Overview

Key Features

Installation

From CRAN (Stable Release)

From GitHub (Development Version)

System Requirements

Quick Start

Basic Workflow: PCA on Large Genomic Data

Working with HDF5 Files

Core Functionality

Linear Algebra Operations

Statistical Methods

Data Management

Utility Functions

Documentation

Sections

Vignettes

Performance

Use Cases

Examples

Example 1: Genomic PCA with Quality Control

Example 2: Multi-Omic CCA

Example 3: Custom Method Development (C++ API)

Citation

Contributing

Development Guidelines

Getting Help

License

Authors

Acknowledgments

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in BigDataStatMeth (1.0.3)