Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.
Functions with the chunkSize
parameter work best with file-backed matrices
such as BEDMatrix::BEDMatrix objects. To avoid loading the whole,
potentially very large matrix into memory, these functions will load chunks
of the file-backed matrix into memory and perform the operations on one
chunk at a time. The size of the chunks is determined by the chunkSize
parameter. Care must be taken to not set chunkSize
too high to avoid
memory shortage, particularly when combined with parallel computing.
Functions with the nCores
, i
, and j
parameters provide
capabilities for both parallel and distributed computing.
For parallel computing, nCores
determines the number of cores the code is
run on. Memory usage can be an issue for higher values of nCores
as R is
not particularly memory-efficient. As a rule of thumb, at least around
(nCores * object_size(chunk)) + object_size(result)
MB of total memory
will be needed for operations on file-backed matrices, not including
potential copies of your data that might be created (for example
stats::lsfit()
runs cbind(1, X)
). i
and j
can be used to include or
exclude certain rows or columns. Internally, the parallel::mclapply()
function is used and therefore parallel computing will not work on Windows
machines.
For distributed computing, i
and j
determine the subset of the input
matrix that the code runs on. In an HPC environment, this can be used not
just to include or exclude certain rows or columns, but also to partition
the task among many nodes rather than cores. Scheduler-specific code and
code to aggregate the results need to be written by the user. It is
recommended to set nCores
to 1
as nodes are often cheaper than cores.
The extdata
folder contains example files that were generated from the
250k SNP and phenotype data in Atwell et al. (2010).
Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the
size of the example dataset small.
PLINK was used to convert the data to
.bed and
.raw files. FT10
has been
chosen as a phenotype and is provided as an alternate phenotype file. The file is
intentionally shuffled to demonstrate that the additional phenotypes are put
in the same order as the rest of the phenotypes.
We have identified several approaches to tackle those challenges within R:
File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).
The BGData package is an umbrella package that comprises several packages:
BEDMatrix,
LinkedMatrix, and
symDMatrix. It features scalable and
efficient computational methods for large genomic datasets such as
genome-wide association studies (GWAS) or genomic relationship matrices (G
matrix). It also contains a data structure called BGData
that holds
genotypes in the @geno
slot, phenotypes in the @pheno
slot, and
additional information in the @map
slot.
BEDMatrix::BEDMatrix-package, LinkedMatrix::LinkedMatrix-package, and symDMatrix::symDMatrix-package for an introduction to the respective packages.