# kazaam v0.1-0

Monthly downloads

## Tools for Tall Distributed Matrices

Many data science problems reduce to operations on very tall,
skinny matrices. However, sometimes these matrices can be so tall that they
are difficult to work with, or do not even fit into main memory. One
strategy to deal with such objects is to distribute their rows across
several processors. To this end, we offer an 'S4' class for tall, skinny,
distributed matrices, called the 'shaq'. We also provide many useful
numerical methods and statistics operations for operating on these
distributed objects. The naming is a bit "tongue-in-cheek", with the class
a play on the fact that 'Shaquille' 'ONeal' ('Shaq') is very tall, and he
starred in the film 'Kazaam'.

## Readme

# kazaam

**Version:**0.1-0**URL**: https://github.com/RBigData/kazaam**Status:****License:**BSD 2-Clause**Author:**Drew Schmidt, Wei-Chen Chen, Mike Matheson, and George Ostrouchov.

Basic matrix and statistics operations for very tall, narrow distributed matrices. For a more general distributed matrix framework, see pbdDMAT.

## Installation

You can install the stable version from CRAN using the usual `install.packages()`

:

```
install.packages("kazaam")
```

The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub:

```
### Pick your preference
devtools::install_github("RBigData/kazaam")
ghit::install_github("RBigData/kazaam")
remotes::install_github("RBigData/kazaam")
```

To simplify installation on cloud systems, we also have a Docker container available.

## Background

Our tall/skinny/distributed matrices are called `shaq`

's which stands for Super Huge Analytics done Quickly. This of course has nothing at all to do with esteemed actor Shaquille O'Neal, who is very tall. And since the package is so easy to use, it sometimes looks like a magic trick. And "kazaam!" is something a magician might say. It is by mere coincidence that Shaquille O'Neal starred in a movie titled Kazaam.

Throughout the package, we make a few key assumptions:

- The data local to each process has
**the same number of columns**. The number of rows can vary freely, or be identical across ranks. - Codes should be
**run in batch**. Communication is handled by pbdMPI, which (as the name suggests) uses MPI. - Finally,
**adjacent ranks in the MPI communicator**as reported by`comm.rank()`

(e.g., ranks 2 and 3, 20 and 21, 1000 and 1001, ...) should store**adjacent pieces of the matrix**.

In order to get good performance, there are several other considerations:

- The number of rows
`m`

should be**very large**. If you only have a few hundred thousand rows (and few columns), you're probably better off with base R matrices. - The number of columns
`n`

should be**very small**. A shaq with 10,000 colums is pushing it. - For most operations, the local problem size should be
**as big as possible**so that the local BLAS/LAPACK operations can dominate over communication. This also keeps the total number of MPI ranks minimal, which cuts down on communication.

Because of these assumptions, we get a few distinct advantages over other, similar frameworks:

- Communication is very minimal. Generally it amounts to a single
`allreduce()`

of an`n*n`

matrix. With even a few hundred MPI ranks, this is basically instantaneous. And since most of the work is local, operations should complete very quickly. - The total number of rows can be as large as you like, even if that's more than can fit in a signed 32-bit integer, or
`2^31-1`

.

## Examples and Documentation

Individual package methods are well-documented, both with example code and discussions of the total amount of communication required.

For complete example codes, see `inst/batchtests`

. These are tests that are meant to be run in batch, generally with 2 or 4 MPI ranks. You can launch any one of them via:

```
mpirun -np 2 Rscript test_script.r
```

Finally, there is a comprehensive package vignette. If you installed the package from CRAN, you can view the vignette by entering:

```
vignette("kazaam", package="kazaam")
```

into your R session.

## Functions in kazaam

Name | Description | |

collapse | collapse | |

cov | Covariance and Correlation | |

crossprod | Matrix Multiplication | |

expand | expand | |

cbind.shaq | cbind | |

col_ops | Column Operations | |

getters | getters | |

glms | Generalized Linear Model Fitters | |

arithmetic | Arithmetic Operators | |

bracket | subsetting | |

lm_coefs | Linear Model Coefficients | |

matmult | Matrix Multiplication | |

is.shaq | is.shaq | |

kazaam-package | Tall Matrices | |

qr | QR Decomposition Methods | |

shaq | shaq | |

ranshaq | ranshaq | |

scale | Scale | |

svd | svd | |

norm | norm | |

prcomp | Principal Components Analysis | |

setters | setters | |

shaq-class | Class shaq | |

No Results! |

## Vignettes of kazaam

Name | ||

cover/kazaam.pdf | ||

include/00-acknowledgement.tex | ||

include/kazaam.bib | ||

include/kazaam.bib.backup | ||

include/settings.tex | ||

include/uch_small.png | ||

build_pdf.sh | ||

kazaam.Rnw | ||

No Results! |

## Last month downloads

## Details

Type | Package |

License | BSD 2-clause License + file LICENSE |

ByteCompile | yes |

URL | http://r-pbd.org/ |

BugReports | http://group.r-pbd.org/ |

MailingList | Please send questions and comments regarding pbdR to RBigData@gmail.com |

RoxygenNote | 6.0.1 |

NeedsCompilation | yes |

Packaged | 2017-06-26 17:12:17 UTC; mschmid3 |

Repository | CRAN |

Date/Publication | 2017-06-29 13:19:58 UTC |

imports | methods , stats |

depends | pbdMPI (>= 0.3-0) , R (>= 3.0.0) |

Contributors | Wei-Chen Chen, George Ostrouchov, Mike Matheson, ORNL |

#### Include our badge in your README

```
[![Rdoc](http://www.rdocumentation.org/badges/version/kazaam)](http://www.rdocumentation.org/packages/kazaam)
```