relationshipMatrix: Fast calculation of the Genomic Relationship Matrix and its derivatives

Description

relationshipMatrix calculates the relationship matrix $A=(M-P)^T (M-P) /\sigma^2$ from the SNP matrix $M$ where $P = p (1,\ldots,1)$ with $p = M \%*\% (1,\dots,1)^T / n$. Furthermore, $sigma^2$ equals $\sigma^2 = p^T (1 - p/2)\in[0,\infty)$.

crossprodx calculates the cross-product of SNPxIndiv, i.e. it is identical to call relationshipMatrix with optional argument, centered=FALSE, cf. RFoptions

allele_freq calculates $p/2$.

SNPeffect calculates $M (A + \tau I)^{-1} v$

solveRelMat calculates $$(A + \tau I)^{-1} v$$ and $$A(A + \tau I)^{-1} v + \beta$$ where $A$ is the relationship matrix.

Usage

relationshipMatrix(SNPxIndiv, ...)
crossprodx(SNPxIndiv) 
solveRelMat(A, tau, vec, betahat=NULL, destroy_A=FALSE)
SNPeffect(SNPxIndiv, vec, centered=TRUE, tau=0)
allele_freq(SNPxIndiv)

Arguments

SNPxIndiv

$\{0,1\,2\}$-valued (snps $\times$ indiv) matrix or the result of genomicmatrix.

...

see RFoptions -- better use RFoptions. The main two options are:

centered: see below

normalized:logical. if FALSE then the division by $sigma^2$ is not performed

centered

if FALSE then $P$ is not substracted.

a symmetric, positive definite matrix, which is a relationship matrix

tau

non-negative scalar

vec

the vector $v$

betahat

scalar or NULL. See also section value.

destroy_A

logical. If TRUE the values of the matrix A will be overwritten during the calculations (leading to a faster execution with less memory needs).

Value

relationsshipMatrix returns a (Indiv $\times$ Indiv) numerical matrix.

The return value of solveRelMat depends on betahat. If the latter is NULL, only the vector $(A + \tau I)^{-1} v$ is returned. Else, a list of 2 elements is returned. First element equals the vector $$(A + \tau I)^{-1} v,$$ the second element equals $$A(A + \tau I)^{-1} v + \beta.$$

Benchmarks

Computing times for the relationship matrix in comparison to 'crossprod' in standard implementation on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, R version 3.6.0 (Linux) with indiv = 1000 and snps = 5e5 are:

Shuffle256 : 48 x faster (AVX2; 16x compressed) Packed256 : 36 x faster (AVX2; 16x compressed) Shuffle : 35 x faster (SSSE3; 16x compressed) Multiply256 : 29 x faster (AVX2; 16x compressed) Packed : 28 x faster (SSE2; 16x compressed) Hamming2 : 24 x faster (SSE2; 4x compressed) Hamming3 : 21 x faster (SSSE3; 4x compressed) Multiply : 17 x faster (SSE2; 16x compressed) ThreeBit : 17 x faster (uint64_t; 10x compressed) TwoBit : 15 x faster (uint64_t; 16x compressed) NoSNPcoding : 4 x faster (int, AVX2; not compressed) NoSNPcodingAVX: 2 x faster (double, AVX; not compressed) NoSNPcodingR : calls crossprod

In parantheses, first the instruction set or s the main data type is given, then the data compression with respect to 32 bit integer.

The following code was used:

RFoptions(cores = 1)
indiv <- 1000
snps <- 5e5 ## may cause memory allocation problems in R; better use 5e4 !!
methods <- c(NoSNPcodingR, NoSNPcodingAVX,
             FirstGenuineMethod:LastGenuineMethod)
M <- matrix(ncol=indiv, sample(0:2, indiv * snps, replace=TRUE))
for (storageMode in c("integer", "double")){
    storage.mode(M) <- storageMode
  cat("\n\n")
  print(S <- system.time(C <- crossprod(M)))
  p <- rowMeans(M)
  P <- p %*% t(rep(1, indiv))
  sigma2 <- sum(p * (1- p/2))
  A <- crossprod(M-P) / sigma2
  print(S <- system.time(C <- crossprod(M)))  
  for (method in methods) {
    RFoptions(snpcoding = method)
    cat("\nstorage=", storageMode, "  method=", SNPCODING_NAMES[method + 1],
    "\n")
    S0 <- system.time(G <- genomicmatrix(M))
    print(S1 <- system.time(C1 <- crossprodx(M)))
    print(S2 <- system.time(C2 <- crossprodx(G)))
    stopifnot(all(C == C1))
    stopifnot(all(C == C2))
    R1 <- S / S1 
    R2 <- S / S2 
    print(0.5 * (R1 + R2))
    print(S3 <- system.time(C3 <- relationshipMatrix(M)))
    print(S4 <- system.time(C4 <- relationshipMatrix(G)))
    R3 <- S / S3
    R4 <- S / S4 
    print(0.5 * (R3 + R4))
    stopifnot(all.equal(as.double(A), as.double(C3)))
    stopifnot(all.equal(as.double(A), as.double(C4)))
    gc()
  }
}

Details

Let $p = M \%*\% (1,\dots,1)^T / n$ where $n$ is the number of individuals. Then, the matrix $P$ equals $P = p (1,\ldots,1)$.

The constant $sigma^2$ equals $\sigma^2 = p^T (1 - p/2)$.

solveRelMat has a speed and memory advantage in comparison to the direct implementation of the above formulae.

Examples

Run this code

# NOT RUN {
 
# }
# NOT RUN {
<!-- %   library(miraculix) -->
# }
# NOT RUN {
require(RandomFieldsUtils)
set.seed(0)
snpcodes <- c(TwoBit, ThreeBit)
if (has.instruction.set("SSE2")) snpcodes <- c(snpcodes, Hamming2)
if (has.instruction.set("SSSE3")) snpcodes <- c(snpcodes, Hamming3, Shuffle)
if (has.instruction.set("AVX2")) snpcodes <- c(snpcodes, Shuffle256)
   
Print(snpcodes)

indiv <- 1 + sample(100:500, 1)
snps <- indiv * 2^sample(1:if (interactive()) 7 else 5, 1)
RFoptions(snpcoding=sample(snpcodes, 1))
M <- matrix(ncol=indiv, sample(0:2, indiv * snps, replace=TRUE))
print(system.time(G <- genomicmatrix(M)))
print(G)  

## crossprodx vs crossprod: about 10x faster
Print(system.time(C <- crossprodx(M)))   
print(system.time(C2 <- crossprod(M)))
stopifnot(all(C == C2))

## allele_freq vs rowMeans: about equally fast
Print(system.time(af <- allele_freq(M)))
print(system.time(alleleFreq <- 0.5 * rowMeans(M)))
stopifnot(all.equal(as.double(alleleFreq), as.double(af)))

## relationshipMatrix vs crossprod: about 10x faster
Print(system.time(R <- relationshipMatrix(M)))
print(system.time(R <- relationshipMatrix(G)))
print(system.time({
  sigma2 <- 2 * sum(alleleFreq * (1 - alleleFreq))
  R2 <- crossprod(M - 2 * alleleFreq) / sigma2
}))
stopifnot(all.equal(as.double(R), as.double(R2)))


### solveRelMat vs. solve: about equally fast
tau <- 0.0001
vec <- runif(indiv)
beta <- runif(1)
Print(system.time(S <- solveRelMat(R, tau=tau, vec=vec, betahat=beta)))
print(system.time({r <- solve(R + diag(indiv) * tau, vec);
                   y <- as.vector(R %*% r + beta)}))
stopifnot(all.equal(S$rest, r))
stopifnot(all.equal(S$yhat, y))

# }

Run the code above in your browser using DataLab