Learn R Programming

⚠️There's a newer version (2.3.28) of this package.Take me there.

tgstat

The goal of tgstat is to provide fast and efficient implementation of certain R functions such as ‘cor’ and ‘dist’, along with specific statistical tools.

Various approaches are used to boost the performance, including multi-processing and use of optimized functions provided by the Basic Linear Algebra Subprograms (BLAS) library.

Installation

In order to install tgstat:

install.packages("tgstat")

Examples

set.seed(seed=1)
rows = 3000
cols = 3000
vals<-sample(1:(rows*cols/2), rows*cols, replace=T)
m<-matrix(vals, nrow=rows, ncol=cols)
m_with_NAs <- m
m_with_NAs[sample(1:(rows*cols), rows*cols / 10)] <- NA
dim(m)
#> [1] 3000 3000

Fast computation of correlation matrices

Pearson correlation without BLAS, no NAs:

options(tgs_use.blas=F)
system.time(tgs_cor(m))
#>    user  system elapsed 
#>  19.961   1.332   0.913

Same with BLAS:

# tgs_cor, with BLAS, no NAs, pearson
options(tgs_use.blas=T)
system.time(tgs_cor(m))
#>    user  system elapsed 
#>   2.021   0.312   0.400

Base R version:

system.time(cor(m))
#>    user  system elapsed 
#>  17.307   0.082  17.430

Pearson correlation without BLAS, with NAs:

options(tgs_use.blas=F)
system.time(tgs_cor(m_with_NAs, pairwise.complete.obs=T))
#>    user  system elapsed 
#>  63.072   1.317   1.443

Same with BLAS:

options(tgs_use.blas=T)
system.time(tgs_cor(m_with_NAs, pairwise.complete.obs=T))
#>    user  system elapsed 
#>   6.431   0.888   0.639

Base R version:

system.time(cor(m_with_NAs, use="pairwise.complete.obs"))
#>    user  system elapsed 
#> 238.613   0.164 239.340

Fast computation of distance matrices

Distance without BLAS, no NAs:

options(tgs_use.blas=F)
system.time(tgs_dist(m))
#>    user  system elapsed 
#> 149.592   1.213   2.679

Same with BLAS:

options(tgs_use.blas=T)
system.time(tgs_dist(m))
#>    user  system elapsed 
#>   1.905   0.277   0.278

Base R:

system.time(dist(m, method="euclidean"))
#>    user  system elapsed 
#> 130.030   0.159 130.496

Notes regarding the usage of BLAS

tgstat runs best when R is linked with an optimized BLAS implementation.

Many optimized BLAS implementations are available, both proprietary (e.g. Intel’s MKL, Apple’s vecLib) and opensource (e.g. OpenBLAS, ATLAS). Unfortunately, R often uses by default the reference BLAS implementation, which is known to have poor performance.

Having tgstat rely on the reference BLAS will result in very poor performance and is strongly discouraged. If your R implementation uses an optimized BLAS, set options(tgs_use.blas=TRUE) to allow tgstat to make BLAS calls. Otherwise, set options(tgs_use.blas=FALSE) (default) which instructs tgstat to avoid BLAS and instead rely only on its own optimization methods. If in doubt, it is possible to run one of tgstat CPU intensive functions (e.g. tgs_cor) and compare its run time under both options(tgs_use.blas=FALSE).

Exact instructions for linking R with an optimized BLAS library are system dependent and are out of scope of this document.

Copy Link

Version

Install

install.packages('tgstat')

Monthly Downloads

276

Version

2.3.14

License

GPL-2

Maintainer

Misha Hoichman

Last Published

August 19th, 2020

Functions in tgstat (2.3.14)

tgs_graph

Builds directed graph of correlations
tgs_cor

Calculates correlation or auto-correlation
tgs_finite

Checks whether all the elements of the vector are finite
tgs_graph_cover_resample

Clusters directed graph multiple times with randomized sample subset
tgs_knn

Returns k highest values of each column
tgs_graph_cover

Clusters directed graph
tgstat-package

Tanay's group statistical utilities
tgs_dist

Calculates distances between the matrix rows
tgs_matrix_tapply

For each matrix row apply a function over a ragged array