BGData (version 2.1.0)

summarize: Generates Various Summary Statistics.

Description

Computes the frequency of missing values, the (minor) allele frequency, and standard deviation of each column of X.

Usage

summarize(X, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkSize = 5000L, nCores = getOption("mc.cores", 2L),
  verbose = FALSE)

Arguments

X

A matrix-like object, typically @geno of a '>BGData object.

i

Indicates which rows of X should be used. Can be integer, boolean, or character. By default, all rows are used.

j

Indicates which columns of X should be used. Can be integer, boolean, or character. By default, all columns are used.

chunkSize

The number of columns of X that are brought into physical memory for processing per core. If NULL, all elements in j are used. Defaults to 5000.

nCores

The number of cores (passed to parallel::mclapply()). Defaults to the number of cores as detected by parallel::detectCores().

verbose

Whether progress updates will be posted. Defaults to FALSE.

Value

A data.frame with three columns: freq_na for frequencies of missing values, allele_freq for (minor) allele frequencies, and sd for standard deviations.

File-backed matrices

Functions with the chunkSize parameter work best with file-backed matrices such as BEDMatrix::BEDMatrix objects. To avoid loading the whole, potentially very large matrix into memory, these functions will load chunks of the file-backed matrix into memory and perform the operations on one chunk at a time. The size of the chunks is determined by the chunkSize parameter. Care must be taken to not set chunkSize too high to avoid memory shortage, particularly when combined with parallel computing.

Multi-level parallelism

Functions with the nCores, i, and j parameters provide capabilities for both parallel and distributed computing.

For parallel computing, nCores determines the number of cores the code is run on. Memory usage can be an issue for higher values of nCores as R is not particularly memory-efficient. As a rule of thumb, at least around (nCores * object_size(chunk)) + object_size(result) MB of total memory will be needed for operations on file-backed matrices, not including potential copies of your data that might be created (for example stats::lsfit() runs cbind(1, X)). i and j can be used to include or exclude certain rows or columns. Internally, the parallel::mclapply() function is used and therefore parallel computing will not work on Windows machines.

For distributed computing, i and j determine the subset of the input matrix that the code runs on. In an HPC environment, this can be used not just to include or exclude certain rows or columns, but also to partition the task among many nodes rather than cores. Scheduler-specific code and code to aggregate the results need to be written by the user. It is recommended to set nCores to 1 as nodes are often cheaper than cores.

Examples

Run this code
# NOT RUN {
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Summarize the whole dataset
sum1 <- summarize(X = bg@geno)

# Summarize the first 50 individuals
sum2 <- summarize(X = bg@geno, i = 1:50)

# Summarize the first 1000 markers (useful for distributed computing)
sum3 <- summarize(X = bg@geno, j = 1:100)

# Summarize the first 50 individuals on the first 1000 markers
sum4 <- summarize(X = bg@geno, i = 1:50, j = 1:100)

# Summarize by names
sum5 <- summarize(X = bg@geno, j = c("snp81233_C", "snp81234_C", "snp81235_T"))
# }

Run the code above in your browser using DataCamp Workspace