Computes a positive semi-definite symmetric genomic relation matrix G=XX'
offering options for centering and scaling the columns of X
beforehand.
getG_symDMatrix(X, center = TRUE, scale = TRUE, scaleG = TRUE,
minVar = 1e-05, blockSize = 5000L,
folderOut = paste0("symDMatrix_", randomString()), vmode = "double",
i = seq_len(nrow(X)), j = seq_len(ncol(X)), chunkSize = 5000L,
nCores = getOption("mc.cores", 2L), verbose = FALSE)
Either a logical value or a numeric vector of length equal to
the number of columns of X
. If FALSE
, no centering is done. Defaults to
TRUE
.
Either a logical value or a numeric vector of length equal to
the number of columns of X
. If FALSE
, no scaling is done. Defaults to
TRUE
.
TRUE/FALSE whether xx' must be scaled.
Columns with variance lower than this value will not be used
in the computation (only if scale
is not FALSE
).
The number of rows and columns of each block. If NULL
, a
single block of the same length as i
will be created. Defaults to 5000.
The path to the folder where to save the symDMatrix::symDMatrix object. Defaults to a random string prefixed with "symDMatrix_".
vmode of ff
objects.
Indicates which rows of X
should be used. Can be integer,
boolean, or character. By default, all rows are used.
Indicates which columns of X
should be used. Can be integer,
boolean, or character. By default, all columns are used.
The number of columns of X
that are brought into physical
memory for processing per core. If NULL
, all columns of X
are used.
Defaults to 5000.
The number of cores (passed to parallel::mclapply()
).
Defaults to the number of cores as detected by parallel::detectCores()
.
Whether progress updates will be posted. Defaults to FALSE
.
A symDMatrix::symDMatrix object.
Functions with the nCores
, i
, and j
parameters provide
capabilities for both parallel and distributed computing.
For parallel computing, nCores
determines the number of cores the code is
run on. Memory usage can be an issue for higher values of nCores
as R is
not particularly memory-efficient. As a rule of thumb, at least around
(nCores * object_size(chunk)) + object_size(result)
MB of total memory
will be needed for operations on file-backed matrices, not including
potential copies of your data that might be created (for example
stats::lsfit()
runs cbind(1, X)
). i
and j
can be used to include or
exclude certain rows or columns. Internally, the parallel::mclapply()
function is used and therefore parallel computing will not work on Windows
machines.
For distributed computing, i
and j
determine the subset of the input
matrix that the code runs on. In an HPC environment, this can be used not
just to include or exclude certain rows or columns, but also to partition
the task among many nodes rather than cores. Scheduler-specific code and
code to aggregate the results need to be written by the user. It is
recommended to set nCores
to 1
as nodes are often cheaper than cores.
Even very large genomic relationship matrices are supported by partitioning
X
into blocks and calling getG()
on these blocks. This function performs
the block computations sequentially, which may be slow. In an HPC
environment, performance can be improved by manually distributing these
operations to different nodes.