BGData (version 2.1.0)

GWAS: Performs Single Marker Regressions Using BGData Objects.

Description

Implements single marker regressions. The regression model includes all the covariates specified in the right-hand-side of the formula plus one column of @geno at a time. The data from the association tests is obtained from a '>BGData object.

Usage

GWAS(formula, data, method = "lsfit", i = seq_len(nrow(data@geno)),
  j = seq_len(ncol(data@geno)), chunkSize = 5000L,
  nCores = getOption("mc.cores", 2L), verbose = FALSE, ...)

Arguments

formula

The formula for the GWAS model without including the marker, e.g. y ~ 1 or y ~ factor(sex) + age. The variables included in the formula must be in the @pheno object of the '>BGData.

data

A '>BGData object.

method

The regression method to be used. Currently, the following methods are implemented: rayOLS, stats::lsfit(), stats::lm(), stats::lm.fit(), stats::glm(), lme4::lmer(), and SKAT::SKAT(). Defaults to lsfit.

i

Indicates which rows of @geno should be used. Can be integer, boolean, or character. By default, all rows are used.

j

Indicates which columns of @geno should be used. Can be integer, boolean, or character. By default, all columns are used.

chunkSize

The number of columns of @geno that are brought into physical memory for processing per core. If NULL, all elements in j are used. Defaults to 5000.

nCores

The number of cores (passed to parallel::mclapply()). Defaults to the number of cores as detected by parallel::detectCores().

verbose

Whether progress updates will be posted. Defaults to FALSE.

...

Additional arguments for chunkedApply and regression method.

Value

The same matrix that would be returned by coef(summary(model)).

File-backed matrices

Functions with the chunkSize parameter work best with file-backed matrices such as BEDMatrix::BEDMatrix objects. To avoid loading the whole, potentially very large matrix into memory, these functions will load chunks of the file-backed matrix into memory and perform the operations on one chunk at a time. The size of the chunks is determined by the chunkSize parameter. Care must be taken to not set chunkSize too high to avoid memory shortage, particularly when combined with parallel computing.

Multi-level parallelism

Functions with the nCores, i, and j parameters provide capabilities for both parallel and distributed computing.

For parallel computing, nCores determines the number of cores the code is run on. Memory usage can be an issue for higher values of nCores as R is not particularly memory-efficient. As a rule of thumb, at least around (nCores * object_size(chunk)) + object_size(result) MB of total memory will be needed for operations on file-backed matrices, not including potential copies of your data that might be created (for example stats::lsfit() runs cbind(1, X)). i and j can be used to include or exclude certain rows or columns. Internally, the parallel::mclapply() function is used and therefore parallel computing will not work on Windows machines.

For distributed computing, i and j determine the subset of the input matrix that the code runs on. In an HPC environment, this can be used not just to include or exclude certain rows or columns, but also to partition the task among many nodes rather than cores. Scheduler-specific code and code to aggregate the results need to be written by the user. It is recommended to set nCores to 1 as nodes are often cheaper than cores.

Examples

Run this code
# NOT RUN {
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Perform a single marker regression
res1 <- GWAS(formula = FT10 ~ 1, data = bg)

# Draw a Manhattan plot
plot(-log10(res1[, 4]))

# Use lm instead of lsfit (the default)
res2 <- GWAS(formula = FT10 ~ 1, data = bg, method = "lm")

# Use glm instead of lsfit (the default)
y <- bg@pheno$FT10
bg@pheno$FT10.01 <- y > quantile(y, 0.8, na.rm = TRUE)
res3 <- GWAS(formula = FT10.01 ~ 1, data = bg, method = "glm")

# Perform a single marker regression on the first 50 markers (useful for
# distributed computing)
res4 <- GWAS(formula = FT10 ~ 1, data = bg, j = 1:50)
# }

Run the code above in your browser using DataLab