pulverize: Computes p-values for interaction terms from linear regressions

Description

Given matrices ymat, xmat, and zmat, pulverize evaluates every linear regression $$y = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 xz + \epsilon$$ where $y$, $x$, and $z$ are columns of ymat, xmat, and zmat, respectively, and returns the p-value for the null hypothesis $\beta_3=0$. For example, if ymat, xmat, and zmat have 200, 300, and 400 columns, then pulverize evaluates 24 million models ($200\times 300\times 400$) and returns the p-value for the interaction term for each model.

Usage

pulverize(ymat, xmat, zmat, output_file = NULL, colnames = c("y", "x", "z"),
  pvalue_threshold = NULL, cores = 1L, overwrite = FALSE,
  suppress_return = FALSE, .fp = NULL)

Arguments

ymat, xmat, zmat

Matrices of type "double", number of rows for all matrices are the number of observations and must have the same size

output_file

Output file if NULL no output file will be created (default: NULL)

colnames

Column names for result table (default: "y", "x", and "z")

pvalue_threshold

Report only p-values below threshold (default: NULL, i.e., report all p-values)

cores

Number of cores to use for parallelization (default: 1)

overwrite

If TRUE overwrite output_file (default: FALSE)

suppress_return

If TRUE return NULL instead of a data frame with p-values for the interaction term (default: FALSE)

.fp

File pointer, only for internal use. Necessary for the function pulverize_all to join results from different input files to the same output file (dafault: NULL)

Value

If suppress_return is FALSE, a data frame with columns "y", "x", "z", and "pvalue" containing the p-values for the interaction term $xz$ of the above linear model is returned. Otherwise NULL is returned.

Details

For reasons of computational efficiency, pulverize returns only p-values and only for the interaction term $xz$. Fast run time is achieved by using the correlation coefficient between the outcome variable $y$ and the interaction term $xz$ to test the null-hypothesis, which avoids the costly computation of inversions. Additional employed time-saving operations are a rearrangement of the order when iterating through the different matrices, and implementing the core algorithm in the fast programming language C++.

Once interesting models are identified based on the resulting p-values, the number of models will be greatly reduced. At this point additional model characteristics, e.g. effect estimates and standard errors, can be obtained via traditional methods such as R's lm function.

Matrices ymat, xmat, and zmat must have column names. Missing values are imputed using their column means. If pvalue_threshold is supplied, only p-values (p < pvalue_threshold) strictly below the threshold are included in the returned data frame and saved in output_file.

In cases where the resulting data frame would be too large to fit in memory, it is possible to write the results to output_file without returning a data frame by setting suppress_return to TRUE.

The column names of the results table are by default set to "y", "x", and "z", but can be changed using the colnames argument.

An error will be signaled if output_file already exists. Setting overwrite to TRUE will silently overwrite the file.

The computation can be parallelized by specifying a number of cores greater than 1. By default only a single CPU is used. Note that parallelization is only supported in environments with C/C++ compilers that support OpenMP.

References

Shabalin, Andrey A (2012) Bioinformatics: Matrix eQTL: ultra fast eQTL analysis via large matrix operations, Oxford Univ Press, *28*, 1353-1358

Examples

Run this code

# NOT RUN {
nobs <- 100
y <- matrix(rnorm(nobs * 2), ncol = 2, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:2)))
x <- matrix(rnorm(nobs * 3), ncol = 3, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:3)))
z <- matrix(rnorm(nobs * 4), ncol = 4, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:4)))
pulverize(y, x, z)

# }

Run the code above in your browser using DataLab