Disclaimer
This project is still under development; CRAN release pending.
Rport is an R package that greatly facilitates common tasks found in many R Business Intelligence apps. It bridges R and SQL analytics similarly to how Rails bridges Ruby and Web Development.
Introduction
From our analytics work on data from Adjust and Apptrace, we've identified several tasks needed for nearly any new R project.
Handling multiple database connections within one R session
Caching results from long SQL statements in development
Parallel jobs processing
Building UNIX executables
Organizing the growing codebase
While R supports all these, there isn't a framework to contain the repeated code that you'll end up with when building a multi-projects analytics app.
For our needs we built such a framework and this article presents it. We named it Rport and it's open-sourced as an R extension.
Quick Start
Rport is distributed as an R package and you can get the most up-to-date version from GitHub, directly from within an R session:
> library(devtools); install_github('rport', 'adeven')
Then if you want to set up a fresh Rport app do:
> library(rport)
> rport.app.new('BI kingdom', root.dir='~/my_apps/')
# help(rport.app.new) for full list of options
This will create the file structure of a clean Rport app. Next you'll probably want to start setting up different projects for this and that in your new Rport app. To do this:
> rport.project.new('weekly kpi report', root.dir='~/my_apps/bi_kingdom')
# again help(rport.project.new) for more
Now you've bootstrapped your first Rport app with one project in it for reporting Weekly KPIs. The generators created some code for you already so you can already see some output by going:
$ cd ~/my_apps/bi_kingdom $ ./bin/weekly_kpi_report.R
Go ahead and browse those generated files, they're well commented and you will find some next steps there too.
Finally, note that Rport generated some dummy spec
files for you that you can
already run:
$ ./spec/all
from your shell and see your new tests outputting:
Specs for weekly_kpi_report : .
/Users/nikola/rport_app/spec/weekly_kpi_report/main_spec.R
file "main_spec.R"
context "Specs for weekly_kpi_report"
test "Make sure you add proper tests here"
nb 1
failed 0
error FALSE
user 0.003
system 0
real 0.003
Make sure you edit the relevant files in spec
folder to write real tests for
your projects.
Rport Apps
Now that you've set up an Rport app, let's take a more detailed look at it. An Rport app would likely contain multiple projects, often serving different purposes. You might have:
- Cron jobs for analytics, reports, calculation and other tasks.
- A lot of one-off exploration scripts
- Asynchronous processing of tasks from web apps
- Standalone web services
- Others
An Rport app with two projects might have the following folder structure:
├── README.md
├── bin
│ ├── monthly_aggregates.R
│ └── weekly_kpis_report.R
├── config
│ ├── database.yml
│ └── settings.R
├── doc
│ ├── monthly_aggregates.md
│ └── weekly_kpis_report.md
├── lib
│ ├── functions
│ │ ├── monthly_aggregates
│ │ │ └── main.R
│ │ └── weekly_kpis_report
│ │ └── main.R
│ └── opts
│ │ ├── monthly_aggregates.R
│ │ └── weekly_kpis_report.R
│ └── shared
├── log
├── script
└── spec
├── all
├── monthly_aggregates
│ └── main_spec.R
└── weekly_kpi_report
└── main_spec.R
For an illustration of Rport's features we created a Demo Rport App. Make sure you refer to it along with reading this post.
Rport Features
Rport will take care of tedious background tasks, while you can focus on the actual explorations and analytics of your data. Let's introduce the core features of the package below.
Database Connectivity
If you're directly interfacing R's SQL drivers, you'll likely find yourself
often using the dbGetQuery(connection, query)
routine, meaning that you need
to carry your connection object around every time you issue a database query.
Multiple Connections Handling
Using Rport in the scenario above, you'll define all your connections in a
config/database.yml
file, similarly to what you'd do in other frameworks (e.g.
Rails). The difference here is that with Rport you can not only define multiple
environments, but also multiple connections within each environment.
# bootstrap production environment rport('production')
# use the handy accessor method for the `read` connection, generated by
Rport dat <- rport.read('select me from you')
# access another database and get more data:
old.dat <- rport.backup('select me from old_you')
# `dat` and `old.dat` are now `data.table` objects with results from the
# `production->read` and `production->backup` connections respectively
Few things are worth mentioning in this snippet:
Rport created the
rport.read
andrport.backup
methods magically based on theread
andbackup
database configurations in theconfig/database.yml
. Check out the Example app to see more of this.You can have as many database configurations as you like and combine results from all of them into a single R session. We use this feature a lot when doing adjust.io <-> apptrace stuff or when we offload heavy reads to a replication server.
Note that different configurations could also mean entirely different database servers. Nothing stops you from having PostgreSQL and MySQL results brought together in R by Rport. And you wouldn't even care about the underlying connection mechanics, because Rport will do that for you.
Rport works with the Data Table package and all the results returned from database queries are
data.table
objects. If you're wondering why we introduced this dependency, just check out this fantastic package from the link above.
Caching Query Results
When working on scripts, you might want to load larger chunks of raw data into memory and crunch them in R (ideally with Data Table) to produce your results. When just 'playing' or exploring the raw data, you won't really want to wait for an unchanged SQL query to run again when rerunning a script to test your R code.
Rport's connection accessors allow you to cache results using R's load
and
save
routines.
rport('development')
# Read the data in memory only if not found in the cache.
dat <- rport.read('select app_id, rank from application_ranks', cache=TRUE)
# do crazy crunching on `dat`
This query will now only run once and subsequent executions of the script will read and return the cached R object from the file system.
To ensure that nothing bad ever happens to you using this caching, it only works in development and it logs clearly when reading from cache. Furthermore, it works on a per-connection basis, so the same queries under different connections will be cached separately.
Parallel report compilation
Since R 2.15 the parallel
package is part of R Core. Rport provides some
wrappers around that package to address specific use cases.
For example a newsletter or report will likely be constituted of several independent items that could probably be generated independently to gain performance.
rport('production')
rport.bootstrap('parallel', cluster.size=8)
users.stats <- function(opts) {
sql <- ' SELECT count(*) FROM users WHERE created_at >= %s'
rport.apptrace(sprintf(sql, opts$start_date))
}
products.stats <- function(opts) {
sql <- ' SELECT count(*) FROM products WHERE created_at >= %s'
rport.apptrace(sprintf(sql, opts$start_date))
}
# run the components in parallel
result = rport.parallel (
useres = { users.stats, opts },
products = { products.stats, opts }
)
# result is now a list with the results like:
# list (users=data.table(..), products=data.table(..))
Working with Executables
Executables are a common interface to our Rport apps. We use them to schedule cron jobs (e.g. reports generation, aggregations, etc.) or to run other analytics tasks. Rport uses Rscript for creation of cross-platform executables.
The rport.project.new()
initializer already created a file in the bin/
folder for you. Rport will also automatically load all R files relevant to your
project.
CLI options
Rport manages CLI options using the R standard lib optparse
package and the
convention of placing opts files under lib/opts/my_script_name.R
. Check what
rport.project.new()
generated for you above or the sample app
for an illustration.
Logging
Rport writes a lot about what it's doing either interactively or in log files. You can use these logs to get an idea about query and script execution times as well as debugging.
The convention for executables is that all output is sink
ed (including output
from parallel workers) to log/my_script_name.log
.
Summary
Rport is an ambitious project under ongoing development. Be sure to follow the GitHub repository for all updates.
See more
Read the blog post for more of the features.
License
This Software is licensed under the MIT License.
Copyright (c) 2012 adeven GmbH, http://www.adeven.com
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.