rport package - RDocumentation

Disclaimer

This project is still under development; CRAN release pending.

Rport is an R package that greatly facilitates common tasks found in many R Business Intelligence apps. It bridges R and SQL analytics similarly to how Rails bridges Ruby and Web Development.

Introduction

From our analytics work on data from Adjust and Apptrace, we've identified several tasks needed for nearly any new R project.

Handling multiple database connections within one R session
Caching results from long SQL statements in development
Parallel jobs processing
Building UNIX executables
Organizing the growing codebase

While R supports all these, there isn't a framework to contain the repeated code that you'll end up with when building a multi-projects analytics app.

For our needs we built such a framework and this article presents it. We named it Rport and it's open-sourced as an R extension.

Quick Start

Rport is distributed as an R package and you can get the most up-to-date version from GitHub, directly from within an R session:

> library(devtools); install_github('rport', 'adeven')

Then if you want to set up a fresh Rport app do:

> library(rport)

> rport.app.new('BI kingdom', root.dir='~/my_apps/')
# help(rport.app.new) for full list of options

This will create the file structure of a clean Rport app. Next you'll probably want to start setting up different projects for this and that in your new Rport app. To do this:

> rport.project.new('weekly kpi report', root.dir='~/my_apps/bi_kingdom')
# again help(rport.project.new) for more

Now you've bootstrapped your first Rport app with one project in it for reporting Weekly KPIs. The generators created some code for you already so you can already see some output by going:

$ cd ~/my_apps/bi_kingdom $ ./bin/weekly_kpi_report.R

Go ahead and browse those generated files, they're well commented and you will find some next steps there too.

Finally, note that Rport generated some dummy spec files for you that you can already run:

$ ./spec/all

from your shell and see your new tests outputting:

Specs for weekly_kpi_report : .

/Users/nikola/rport_app/spec/weekly_kpi_report/main_spec.R
file    "main_spec.R"
context "Specs for weekly_kpi_report"
test    "Make sure you add proper tests here"
nb      1
failed  0
error   FALSE
user    0.003
system  0
real    0.003

Make sure you edit the relevant files in spec folder to write real tests for your projects.

Rport Apps

Now that you've set up an Rport app, let's take a more detailed look at it. An Rport app would likely contain multiple projects, often serving different purposes. You might have:

Cron jobs for analytics, reports, calculation and other tasks.
A lot of one-off exploration scripts
Asynchronous processing of tasks from web apps
Standalone web services
Others

An Rport app with two projects might have the following folder structure:

├── README.md
├── bin
│   ├── monthly_aggregates.R
│   └── weekly_kpis_report.R
├── config
│   ├── database.yml
│   └── settings.R
├── doc
│   ├── monthly_aggregates.md
│   └── weekly_kpis_report.md
├── lib
│   ├── functions
│   │   ├── monthly_aggregates
│   │   │   └── main.R
│   │   └── weekly_kpis_report
│   │       └── main.R
│   └── opts
│   │   ├── monthly_aggregates.R
│   │   └── weekly_kpis_report.R
│   └── shared
├── log
├── script
└── spec
    ├── all
    ├── monthly_aggregates
    │   └── main_spec.R
    └── weekly_kpi_report
        └── main_spec.R

For an illustration of Rport's features we created a Demo Rport App. Make sure you refer to it along with reading this post.

Rport Features

Rport will take care of tedious background tasks, while you can focus on the actual explorations and analytics of your data. Let's introduce the core features of the package below.

Database Connectivity

If you're directly interfacing R's SQL drivers, you'll likely find yourself often using the dbGetQuery(connection, query) routine, meaning that you need to carry your connection object around every time you issue a database query.

Multiple Connections Handling

Using Rport in the scenario above, you'll define all your connections in a config/database.yml file, similarly to what you'd do in other frameworks (e.g. Rails). The difference here is that with Rport you can not only define multiple environments, but also multiple connections within each environment.

# bootstrap production environment rport('production')

# use the handy accessor method for the `read` connection, generated by
Rport dat <- rport.read('select me from you')

# access another database and get more data:
old.dat <- rport.backup('select me from old_you')

# `dat` and `old.dat` are now `data.table` objects with results from the
# `production->read` and `production->backup` connections respectively

Few things are worth mentioning in this snippet:

Rport created the rport.read and rport.backup methods magically based on the read and backup database configurations in the config/database.yml. Check out the Example app to see more of this.
You can have as many database configurations as you like and combine results from all of them into a single R session. We use this feature a lot when doing adjust.io <-> apptrace stuff or when we offload heavy reads to a replication server.
Note that different configurations could also mean entirely different database servers. Nothing stops you from having PostgreSQL and MySQL results brought together in R by Rport. And you wouldn't even care about the underlying connection mechanics, because Rport will do that for you.
Rport works with the Data Table package and all the results returned from database queries are data.table objects. If you're wondering why we introduced this dependency, just check out this fantastic package from the link above.

Caching Query Results

When working on scripts, you might want to load larger chunks of raw data into memory and crunch them in R (ideally with Data Table) to produce your results. When just 'playing' or exploring the raw data, you won't really want to wait for an unchanged SQL query to run again when rerunning a script to test your R code.

Rport's connection accessors allow you to cache results using R's load and save routines.

rport('development')

# Read the data in memory only if not found in the cache.
dat <- rport.read('select app_id, rank from application_ranks', cache=TRUE)

# do crazy crunching on `dat`

This query will now only run once and subsequent executions of the script will read and return the cached R object from the file system.

To ensure that nothing bad ever happens to you using this caching, it only works in development and it logs clearly when reading from cache. Furthermore, it works on a per-connection basis, so the same queries under different connections will be cached separately.

Parallel report compilation

Since R 2.15 the parallel package is part of R Core. Rport provides some wrappers around that package to address specific use cases.

For example a newsletter or report will likely be constituted of several independent items that could probably be generated independently to gain performance.

rport('production')

rport.bootstrap('parallel', cluster.size=8)

users.stats <- function(opts) {
  sql <- ' SELECT count(*) FROM users WHERE created_at >= %s'

  rport.apptrace(sprintf(sql, opts$start_date))
}

products.stats <- function(opts) {
  sql <- ' SELECT count(*) FROM products WHERE created_at >= %s'

  rport.apptrace(sprintf(sql, opts$start_date))
}

# run the components in parallel
result = rport.parallel (
  useres   = { users.stats, opts },
  products = { products.stats, opts }
)

# result is now a list with the results like:
# list (users=data.table(..), products=data.table(..))

Working with Executables

Executables are a common interface to our Rport apps. We use them to schedule cron jobs (e.g. reports generation, aggregations, etc.) or to run other analytics tasks. Rport uses Rscript for creation of cross-platform executables.

The rport.project.new() initializer already created a file in the bin/ folder for you. Rport will also automatically load all R files relevant to your project.

CLI options

Rport manages CLI options using the R standard lib optparse package and the convention of placing opts files under lib/opts/my_script_name.R. Check what rport.project.new() generated for you above or the sample app for an illustration.

Logging

Rport writes a lot about what it's doing either interactively or in log files. You can use these logs to get an idea about query and script execution times as well as debugging.

The convention for executables is that all output is sinked (including output from parallel workers) to log/my_script_name.log.

Summary

Rport is an ambitious project under ongoing development. Be sure to follow the GitHub repository for all updates.

See more

Read the blog post for more of the features.

License

This Software is licensed under the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Disclaimer

Introduction

Quick Start

Rport Apps

Rport Features

Database Connectivity

Multiple Connections Handling

Caching Query Results

Parallel report compilation

Working with Executables

CLI options

Logging

Summary

See more

License

Copy Link

Version

Version

License

Maintainer

Last Published

Functions in rport (0.0.1)