Given matrices ymat, xmat, and zmat,
pulverize evaluates every linear regression
$$y = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 xz + \epsilon$$
where \(y\), \(x\), and \(z\) are columns of ymat,
xmat, and zmat, respectively, and returns the p-value
for the null hypothesis \(\beta_3=0\). For example, if
ymat, xmat, and zmat have 200, 300, and 400
columns, then pulverize evaluates 24 million models
(\(200\times 300\times 400\)) and returns the
p-value for the interaction term for each model.
pulverize(ymat, xmat, zmat, output_file = NULL, colnames = c("y", "x", "z"),
pvalue_threshold = NULL, cores = 1L, overwrite = FALSE,
suppress_return = FALSE, .fp = NULL)Matrices of type "double", number of rows for all matrices are the number of observations and must have the same size
Output file if NULL no output file will be created
(default: NULL)
Column names for result table (default: "y", "x", and "z")
Report only p-values below threshold
(default: NULL, i.e., report all p-values)
Number of cores to use for parallelization (default: 1)
If TRUE overwrite output_file (default:
FALSE)
If TRUE return NULL instead of a data
frame with p-values for the interaction term (default: FALSE)
File pointer, only for internal use. Necessary for the function
pulverize_all to join results from different input files
to the same output file (dafault: NULL)
If suppress_return is FALSE, a data frame with
columns "y", "x", "z", and "pvalue" containing the p-values for
the interaction term \(xz\) of the above linear model is returned.
Otherwise NULL is returned.
For reasons of computational efficiency, pulverize returns
only p-values and only for the interaction term \(xz\).
Fast run time is achieved by using the correlation coefficient between
the outcome variable \(y\) and the interaction term \(xz\) to test
the null-hypothesis, which avoids the costly computation of inversions.
Additional employed time-saving operations are a rearrangement of the order when iterating through
the different matrices, and implementing the core algorithm in the fast
programming language C++.
Once interesting models are identified based on the resulting p-values, the number of
models will be greatly reduced. At this point additional model
characteristics, e.g. effect estimates and standard errors, can be
obtained via traditional methods such as R's
lm function.
Matrices ymat, xmat, and zmat must
have column names. Missing values are imputed using their
column means. If pvalue_threshold is supplied, only
p-values (p < pvalue_threshold) strictly below the
threshold are included in the returned data frame and saved
in output_file.
In cases where the resulting data frame would be too large to fit
in memory, it is possible to write the results to
output_file without returning a data frame by setting
suppress_return to TRUE.
The column names of the results table are by default set to "y", "x", and
"z", but can be changed using the colnames argument.
An error will be signaled if output_file already exists.
Setting overwrite to TRUE will silently overwrite the
file.
The computation can be parallelized by specifying a number of
cores greater than 1. By default only a single CPU is
used. Note that parallelization is only supported in
environments with C/C++ compilers that support OpenMP.
Shabalin, Andrey A (2012) Bioinformatics: Matrix eQTL: ultra fast eQTL analysis via large matrix operations, Oxford Univ Press, *28*, 1353-1358
# NOT RUN {
nobs <- 100
y <- matrix(rnorm(nobs * 2), ncol = 2, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:2)))
x <- matrix(rnorm(nobs * 3), ncol = 3, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:3)))
z <- matrix(rnorm(nobs * 4), ncol = 4, dimnames = list(paste0("row", 1:nobs),
paste0("column", 1:4)))
pulverize(y, x, z)
# }
Run the code above in your browser using DataLab