BGData-package: A Suite of Packages for Analysis of Big Genomic Data.

Description

Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

Arguments

File-backed matrices

Functions with the chunkSize parameter work best with file-backed matrices such as BEDMatrix::BEDMatrix objects. To avoid loading the whole, potentially very large matrix into memory, these functions will load chunks of the file-backed matrix into memory and perform the operations on one chunk at a time. The size of the chunks is determined by the chunkSize parameter. Care must be taken to not set chunkSize too high to avoid memory shortage, particularly when combined with parallel computing.

Multi-level parallelism

Functions with the nCores, i, and j parameters provide capabilities for both parallel and distributed computing.

For parallel computing, nCores determines the number of cores the code is run on. Memory usage can be an issue for higher values of nCores as R is not particularly memory-efficient. As a rule of thumb, at least around (nCores * object_size(chunk)) + object_size(result) MB of total memory will be needed for operations on file-backed matrices, not including potential copies of your data that might be created (for example stats::lsfit() runs cbind(1, X)). i and j can be used to include or exclude certain rows or columns. Internally, the parallel::mclapply() function is used and therefore parallel computing will not work on Windows machines.

For distributed computing, i and j determine the subset of the input matrix that the code runs on. In an HPC environment, this can be used not just to include or exclude certain rows or columns, but also to partition the task among many nodes rather than cores. Scheduler-specific code and code to aggregate the results need to be written by the user. It is recommended to set nCores to 1 as nodes are often cheaper than cores.

Example dataset

The extdata folder contains example files that were generated from the 250k SNP and phenotype data in Atwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small. PLINK was used to convert the data to .bed and .raw files. FT10 has been chosen as a phenotype and is provided as an alternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

Details

We have identified several approaches to tackle those challenges within R:

File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages: BEDMatrix, LinkedMatrix, and symDMatrix. It features scalable and efficient computational methods for large genomic datasets such as genome-wide association studies (GWAS) or genomic relationship matrices (G matrix). It also contains a data structure called BGData that holds genotypes in the @geno slot, phenotypes in the @pheno slot, and additional information in the @map slot.