# Wei-Chen Chen

#### 22 packages on CRAN

#### 1 packages on GitHub

'ZeroMQ' is a well-known library for high-performance asynchronous messaging in scalable, distributed applications. This package provides high level R wrapper functions to easily utilize 'ZeroMQ'. We mainly focus on interactive client/server programming frameworks. For convenience, a minimal 'ZeroMQ' library (4.2.2) is shipped with 'pbdZMQ', which can be used if no system installation of 'ZeroMQ' is available. A few wrapper functions compatible with 'rzmq' are also provided.

EM algorithms and several efficient initialization methods for model-based clustering of finite mixture Gaussian distribution with unstructured dispersion in both of unsupervised and semi-supervised learning.

Phylogenetic clustering (phyloclustering) is an evolutionary Continuous Time Markov Chain model-based approach to identify population structure from molecular data without assuming linkage equilibrium. The package phyclust (Chen 2011) provides a convenient implementation of phyloclustering for DNA and SNP data, capable of clustering individuals into subpopulations and identifying molecular sequences representative of those subpopulations. It is designed in C for performance, interfaced with R for visualization, and incorporates other popular open source programs including ms (Hudson 2002) <doi:10.1093/bioinformatics/18.2.337>, seq-gen (Rambaut and Grassly 1997) <doi:10.1093/bioinformatics/13.3.235>, Hap-Clustering (Tzeng 2005) <doi:10.1002/gepi.20063> and PAML baseml (Yang 1997, 2007) <doi:10.1093/bioinformatics/13.5.555>, <doi:10.1093/molbev/msm088>, for simulating data, additional analyses, and searching the best tree. See the phyclust website for more information, documentations and examples.

The utility of this package is in simulating mixtures of Gaussian distributions with different levels of overlap between mixture components. Pairwise overlap, defined as a sum of two misclassification probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datasets can then be used for systematic performance investigation of clustering and finite mixture modeling algorithms. Among other capabilities of 'MixSim', there are computing the exact overlap for Gaussian mixtures, simulating Gaussian and non-Gaussian data, simulating outliers and noise variables, calculating various measures of agreement between two partitionings, and constructing parallel distribution plots for the graphical display of finite mixture models.

Generalized eigenvalues and QZ decomposition (generalized Schur form) for an N-by-N non-symmetric matrix A or paired matrices (A,B) with eigenvalues reordering mechanism. The package is mainly based complex*16 and double precision of LAPACK library (version 3.4.2.)

An efficient interface to MPI by utilizing S4 classes and methods with a focus on Single Program/Multiple Data ('SPMD') parallel programming style, which is intended for batch parallel execution.

Estimating mutation and selection coefficients on synonymous codon bias usage based on models of ribosome overhead cost (ROC). Multinomial logistic regression and Markov Chain Monte Carlo are used to estimate and predict protein production rates with/without the presence of expressions and measurement errors. Work flows with examples for simulation, estimation and prediction processes are also provided with parallelization speedup. The whole framework is tested with yeast genome and gene expression data of Yassour (2009).

Utilizing scalable linear algebra packages mainly including 'BLACS', 'PBLAS', and 'ScaLAPACK' in double precision via 'pbdMPI' based on 'ScaLAPACK' version 2.0.2.

Aims to utilize model-based clustering (unsupervised) for high dimensional and ultra large data, especially in a distributed manner. The code employs 'pbdMPI' to perform a expectation-gathering-maximization algorithm for finite mixture Gaussian models. The unstructured dispersion matrices are assumed in the Gaussian models. The implementation is default in the single program multiple data programming model. The code can be executed through 'pbdMPI' and MPI' implementations such as 'OpenMPI' and 'MPICH'. See the High Performance Statistical Computing website <https://snoweye.github.io/hpsc/> for more information, documents and examples.

A very light implementation yet secure for remote procedure calls with unified interface via ssh (OpenSSH) or plink/plink.exe (PuTTY).

Utilizing model-based clustering (unsupervised) for functional magnetic resonance imaging (fMRI) data. The developed methods (Chen and Maitra (2018, manuscript)) include 2D and 3D clustering analyses (for p-values with voxel locations) and segmentation analyses (for p-values alone) for fMRI data where p-values indicate significant level of activation responding to stimulate of interesting. The analyses are mainly identifying active voxel/signal associated with normal brain behaviors. Analysis pipelines (R scripts) utilizing this package (see examples in 'inst/workflow/') is also implemented with high performance techniques.

A micro-package for reading "passwords", i.e. reading user input with masking, so that the input is not displayed as it is typed. Currently we have support for 'RStudio', the command line (every OS), and any platform where 'tcltk' is present.

How much ram do you need to store a 100,000 by 100,000 matrix? How much ram is your current R session using? How much ram do you even have? Learn the scintillating answer to these and many more such questions with the 'memuse' package.

A set of classes for managing distributed matrices, and a collection of methods for computing linear algebra and statistics. Computation is handled mostly by routines from the 'pbdBASE' package, which itself relies on the 'ScaLAPACK' and 'PBLAS' numerical libraries for distributed computing.

R comes with a suite of utilities for linear algebra with "numeric" (double precision) vectors/matrices. However, sometimes single precision (or less!) is more than enough for a particular task. This package extends R's linear algebra facilities to include 32-bit float (single precision) data. Float vectors/matrices have half the precision of their "numeric"-type counterparts but are generally faster to numerically operate on, for a performance vs accuracy trade-off. The internal representation is an S4 class, which allows us to keep the syntax identical to that of base R's. Interaction between floats and base types for binary operators is generally possible; in these cases, type promotion always defaults to the higher precision. The package ships with copies of the single precision 'BLAS' and 'LAPACK', which are automatically built in the event they are not available on the system.

An interface to and extensions for the 'PBLAS' and 'ScaLAPACK' numerical libraries. This enables R to utilize distributed linear algebra for codes written in the 'SPMD' fashion. This interface is deliberately low-level and mimics the style of the native libraries it wraps. For a much higher level way of managing distributed matrices, see the 'pbdDMAT' package.

A set of utilities for client/server computing with R, controlling a remote R session (the server) from a local one (the client). Simply set up a server (see package vignette for more details) and connect to it from your local R session ('RStudio', terminal, etc). The client/server framework is a custom 'REPL' and runs entirely in your R session without the need for installing a custom environment on your system. Network communication is handled by the 'ZeroMQ' library by way of the 'pbdZMQ' package.

Provides a full implementation of the 'Jupyter' <http://jupyter.org/> messaging protocol in C++ by leveraging 'Rcpp' and 'Xeus' <https://github.com/QuantStack/xeus>. 'Jupyter' supplies an interactive computing environment and a messaging protocol defined over 'ZeroMQ' for multiple programming languages. This package implements the 'Jupyter' kernel interface so that 'R' is exposed to this interactive computing environment. 'ZeroMQ' functionality is provided by the 'pbdZMQ' package. 'Xeus' is a C++ library that facilitates the implementation of kernels for 'Jupyter'. Additionally, 'Xeus' provides an interface to libraries that exist in the 'Jupyter' ecosystem for building widgets, plotting, and more <https://blog.jupyter.org/interactive-workflows-for-c-with-jupyter-fe9b54227d92>. 'JuniperKernel' uses 'Xeus' as a library for the 'Jupyter' messaging protocol.

This package adds collective parallel read and write capability to the R package ncdf4 version 1.8. Typical use is as a parallel NetCDF4 file reader in SPMD style programming. Each R process reads and writes its own data in a synchronized collective mode, resulting in faster parallel performance. Performance improvement is conditional on a parallel file system.

Many data science problems reduce to operations on very tall, skinny matrices. However, sometimes these matrices can be so tall that they are difficult to work with, or do not even fit into main memory. One strategy to deal with such objects is to distribute their rows across several processors. To this end, we offer an 'S4' class for tall, skinny, distributed matrices, called the 'shaq'. We also provide many useful numerical methods and statistics operations for operating on these distributed objects. The naming is a bit "tongue-in-cheek", with the class a play on the fact that 'Shaquille' 'ONeal' ('Shaq') is very tall, and he starred in the film 'Kazaam'.

Implements several algorithms for supervised learning on sparse data and many matrix factorizations of sparse matrices (with a focus on applications for recommender systems). All algorithms work on sparse matrices. Also they extensively use BLAS and LAPACK and parallelized with OpenMP. Implementations are reasonably fast and nicely work with large datasets (millions of rows and millions of columns). List of algorithms for supervised learning: 1) Elastic net regression via Follow The Proximally-Regularized leader algorithm 2) Second order Factorization Machines via stochastic gradient descent with adaptive learning rates. Allows to learn model parameters out-of-core. Fast - asynchronous parallel, SIMD accelerated. List of algorithms for matrix factorization: 1) Weighted Regularazied Matrix Factorization with Alternating Least Squares (ALS) for implicit feedback (inculding approximate Conjugate Gradient solver). Optional non-negativity (NNMF, non-negative matrix factorization). 2) Regularazied Matrix Factorization with ALS for explicit feedback Optional non-negativity (NNMF, non-negative matrix factorization). 3) Fast Trunceate SVD and Soft-SVD via ALS 4) Soft-Impute via fast ALS and solution in SVD form 5) LinearFlow method which learns item-item similarity matrix from the data 6) GloVe - GlobalVectors embeddings Clustering: 1) kmeans from Armadillo library which provides smart (similar to kmeans++) cluster initializations. Misc utils/methods: 1) multithreaded `%*%` and `tcrossprod()` for `<dgRMatrix, matrix>` 2) multithreaded `%*%` and `crossprod()` for `<matrix, dgCMatrix>`