# Drew Schmidt

#### 24 packages on CRAN

#### 2 packages on GitHub

A micro-package for reading "passwords", i.e. reading user input with masking, so that the input is not displayed as it is typed. Currently we have support for 'RStudio', the command line (every OS), and any platform where 'tcltk' is present.

Utilities for secure password hashing via the argon2 algorithm. It is a relatively new hashing algorithm and is believed to be very secure. The 'argon2' implementation included in the package is the reference implementation. The package also includes some utilities that should be useful for digest authentication, including a wrapper of 'blake2b'. For similar R packages, see sodium and 'bcrypt'. See <https://en.wikipedia.org/wiki/Argon2> or <https://eprint.iacr.org/2015/430.pdf> for more information.

An n-gram is a sequence of n "words" taken, in order, from a body of text. This is a collection of utilities for creating, displaying, summarizing, and "babbling" n-grams. The 'tokenization' and "babbling" are handled by very efficient C code, which can even be built as its own standalone library. The babbler is a simple Markov chain. The package also offers a vignette with complete example 'workflows' and information about the utilities offered in the package.

How much ram do you need to store a 100,000 by 100,000 matrix? How much ram is your current R session using? How much ram do you even have? Learn the scintillating answer to these and many more such questions with the 'memuse' package.

A set of classes for managing distributed matrices, and a collection of methods for computing linear algebra and statistics. Computation is handled mostly by routines from the 'pbdBASE' package, which itself relies on the 'ScaLAPACK' and 'PBLAS' numerical libraries for distributed computing.

R comes with a suite of utilities for linear algebra with "numeric" (double precision) vectors/matrices. However, sometimes single precision (or less!) is more than enough for a particular task. This package extends R's linear algebra facilities to include 32-bit float (single precision) data. Float vectors/matrices have half the precision of their "numeric"-type counterparts but are generally faster to numerically operate on, for a performance vs accuracy trade-off. The internal representation is an S4 class, which allows us to keep the syntax identical to that of base R's. Interaction between floats and base types for binary operators is generally possible; in these cases, type promotion always defaults to the higher precision. The package ships with copies of the single precision 'BLAS' and 'LAPACK', which are automatically built in the event they are not available on the system.

An interface to and extensions for the 'PBLAS' and 'ScaLAPACK' numerical libraries. This enables R to utilize distributed linear algebra for codes written in the 'SPMD' fashion. This interface is deliberately low-level and mimics the style of the native libraries it wraps. For a much higher level way of managing distributed matrices, see the 'pbdDMAT' package.

Fast implementations of the co-operations: covariance, correlation, and cosine similarity. The implementations are fast and memory-efficient and their use is resolved automatically based on the input data, handled by R's S3 methods. Full descriptions of the algorithms and benchmarks are available in the package vignettes.

A set of utilities for client/server computing with R, controlling a remote R session (the server) from a local one (the client). Simply set up a server (see package vignette for more details) and connect to it from your local R session ('RStudio', terminal, etc). The client/server framework is a custom 'REPL' and runs entirely in your R session without the need for installing a custom environment on your system. Network communication is handled by the 'ZeroMQ' library by way of the 'pbdZMQ' package.

Queues, stacks, and 'deques' are list-like, abstract data types. These are meant to be very cheap to "grow", or insert new objects into. A typical use case involves storing data in a list in a streaming fashion, when you do not necessarily know how may elements need to be stored. Unlike R's lists, the new data structures provided here are not necessarily stored contiguously, making insertions and deletions at the front/end of the structure much faster. The underlying implementation is new and uses a head/tail doubly linked list; thus, we do not rely on R's environments or hashing. To avoid unnecessary data copying, most operations on these data structures are performed via side-effects.

Many data science problems reduce to operations on very tall, skinny matrices. However, sometimes these matrices can be so tall that they are difficult to work with, or do not even fit into main memory. One strategy to deal with such objects is to distribute their rows across several processors. To this end, we offer an 'S4' class for tall, skinny, distributed matrices, called the 'shaq'. We also provide many useful numerical methods and statistics operations for operating on these distributed objects. The naming is a bit "tongue-in-cheek", with the class a play on the fact that 'Shaquille' 'ONeal' ('Shaq') is very tall, and he starred in the film 'Kazaam'.

A popular technique in text analysis today is sentiment analysis, or trying to determine the overall emotional attitude of a piece of text (positive or negative). We provide a new, basic implementation of a common method for computing sentiment, whereby words are scored as positive or negative according to a "dictionary", and then an average of those scores for the document is produced. The package uses the 'Hu' and 'Liu' sentiment dictionary for assigning sentiment.

Most search engines have a "did you mean?" feature, where suggestions are given in the presence of likely typos. We are able to somewhat replicate this functionality with ancient spellchecker techniques. When R detects that a function or object listed in the user's input is not found, the package finds the minimum 'Levenshtein' distance between the "'un-found'" token and all symbols in the user's global environment plus all loaded 'namespaces'. The word with minimum 'Levenshtein' distance (in the event of ties, the first such detected) is then suggested as an alternative to the missing symbol. To use, simply load the package from an interactive R session and start making some errors. However, there is an explicit interface for starting and stopping "did you mean?" behavior.

'Radix trees', or 'tries', are key-value data structures optimised for efficient lookups, similar in purpose to hash tables. 'triebeard' provides an implementation of 'radix trees' for use in R programming and in developing packages with 'Rcpp'.

A toolkit for all URL-handling needs, including encoding and decoding, parsing, parameter extraction and modification. All functions are designed to be both fast and entirely vectorised. It is intended to be useful for people dealing with web-related datasets, such as server-side logs, although may be useful for other situations involving large sets of URLs.

'ZeroMQ' is a well-known library for high-performance asynchronous messaging in scalable, distributed applications. This package provides high level R wrapper functions to easily utilize 'ZeroMQ'. We mainly focus on interactive client/server programming frameworks. For convenience, a minimal 'ZeroMQ' library (4.2.2) is shipped with 'pbdZMQ', which can be used if no system installation of 'ZeroMQ' is available. A few wrapper functions compatible with 'rzmq' are also provided.

Wraps some of the matrix exponentiation utilities from EXPOKIT (<http://www.maths.uq.edu.au/expokit/>), a FORTRAN library that is widely recommended for matrix exponentiation (Sidje RB, 1998. "Expokit: A Software Package for Computing Matrix Exponentials." ACM Trans. Math. Softw. 24(1): 130-156). EXPOKIT includes functions for exponentiating both small, dense matrices, and large, sparse matrices (in sparse matrices, most of the cells have value 0). Rapid matrix exponentiation is useful in phylogenetics when we have a large number of states (as we do when we are inferring the history of transitions between the possible geographic ranges of a species), but is probably useful in other ways as well.

Connectors to online and offline sources for taking IP addresses and geolocating them to country, city, timezone and other geographic ranges. For individual connectors, see the package index.

An efficient interface to MPI by utilizing S4 classes and methods with a focus on Single Program/Multiple Data ('SPMD') parallel programming style, which is intended for batch parallel execution.

Estimating mutation and selection coefficients on synonymous codon bias usage based on models of ribosome overhead cost (ROC). Multinomial logistic regression and Markov Chain Monte Carlo are used to estimate and predict protein production rates with/without the presence of expressions and measurement errors. Work flows with examples for simulation, estimation and prediction processes are also provided with parallelization speedup. The whole framework is tested with yeast genome and gene expression data of Yassour (2009).

Utilizing scalable linear algebra packages mainly including 'BLACS', 'PBLAS', and 'ScaLAPACK' in double precision via 'pbdMPI' based on 'ScaLAPACK' version 2.0.2.

A very light implementation yet secure for remote procedure calls with unified interface via ssh (OpenSSH) or plink/plink.exe (PuTTY).

This package adds collective parallel read and write capability to the R package ncdf4 version 1.8. Typical use is as a parallel NetCDF4 file reader in SPMD style programming. Each R process reads and writes its own data in a synchronized collective mode, resulting in faster parallel performance. Performance improvement is conditional on a parallel file system.

Implements several algorithms for supervised learning on sparse data and many matrix factorizations of sparse matrices (with a focus on applications for recommender systems). All algorithms work on sparse matrices. Also they extensively use BLAS and LAPACK and parallelized with OpenMP. Implementations are reasonably fast and nicely work with large datasets (millions of rows and millions of columns). List of algorithms for supervised learning: 1) Elastic net regression via Follow The Proximally-Regularized leader algorithm 2) Second order Factorization Machines via stochastic gradient descent with adaptive learning rates. Allows to learn model parameters out-of-core. Fast - asynchronous parallel, SIMD accelerated. List of algorithms for matrix factorization: 1) Weighted Regularazied Matrix Factorization with Alternating Least Squares (ALS) for implicit feedback (inculding approximate Conjugate Gradient solver). Optional non-negativity (NNMF, non-negative matrix factorization). 2) Regularazied Matrix Factorization with ALS for explicit feedback Optional non-negativity (NNMF, non-negative matrix factorization). 3) Fast Trunceate SVD and Soft-SVD via ALS 4) Soft-Impute via fast ALS and solution in SVD form 5) LinearFlow method which learns item-item similarity matrix from the data 6) GloVe - GlobalVectors embeddings Clustering: 1) kmeans from Armadillo library which provides smart (similar to kmeans++) cluster initializations. Misc utils/methods: 1) multithreaded `%*%` and `tcrossprod()` for `<dgRMatrix, matrix>` 2) multithreaded `%*%` and `crossprod()` for `<matrix, dgCMatrix>`