Learn R Programming

⚠️There's a newer version (2.6.1) of this package.Take me there.

RAthena

The goal of the RAthena package is to provide a DBI-compliant interface to Amazon’s Athena (https://aws.amazon.com/athena/) using Boto3 SDK. This allows for an efficient, easy setup connection to Athena using the Boto3 SDK as a driver.

NOTE: Before using RAthena you must have an aws account or have access to aws account with permissions allowing you to use Athena.

Installation:

Before installing RAthena ensure that Python 3+ is installed onto your machine: https://www.python.org/downloads/. To install Boto3 either it can installed the pip command or using RAthena installation function:

pip install boto3

RAthena Method (after RAthena has been installed this method can be used)

RAthena::install_boto()

To install RAthena you can get it from CRAN with:

install.packages("RAthena")

Or to get the development version from Github with:

remotes::install_github("dyfanjones/rathena")

Connection Methods

Hard Coding

The most basic way to connect to AWS Athena is to hard-code your access key and secret access key. However this method is not recommended as your credentials are hard-coded.

library(DBI)

con <- dbConnect(RAthena::athena(),
                aws_access_key_id='YOUR_ACCESS_KEY_ID',
                aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
                s3_staging_dir='s3://path/to/query/bucket/',
                region_name='eu-west-1')

AWS Profile Name

The next method is to use profile names set up by AWS CLI or created manually in the ~/.aws directory. To create the profile names manually please refer to: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html.

Setting up AWS CLI

RAthena is compatible with AWS CLI. This allows your aws credentials to be stored and not be hard coded in your connection.

To install AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html, to configure AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Once AWS CLI has been set up you will be able to connect to Athena by only putting the s3_staging_dir.

Using default profile name:

library(DBI)
con <- dbConnect(RAthena::athena(),
                 s3_staging_dir = 's3://path/to/query/bucket/')

Connecting to Athena using profile name other than default.

library(DBI)
con <- dbConnect(RAthena::athena(),
                 profile_name = "your_profile",
                 s3_staging_dir = 's3://path/to/query/bucket/')

Temporary Credentials with MFA Account:

library(RAthena)
get_session_token("YOUR_PROFILE_NAME",
                  serial_number='arn:aws:iam::123456789012:mfa/user',
                  token_code = "531602",
                  set_env = TRUE)

# Connect to Athena using temporary credentials
con <- dbConnect(athena(),
                s3_staging_dir = 's3://path/to/query/bucket/')

Assuming ARN Role for connection

Another method in connecting to Athena is to use Amazon Resource Name (ARN) role.

Setting credentials in environmental variables:

library(RAthena)
assume_role(profile_name = "YOUR_PROFILE_NAME",
            role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
            set_env = TRUE)

# Connect to Athena using temporary credentials
con <- dbConnect(athena(),
                s3_staging_dir = 's3://path/to/query/bucket/')

Connecting to Athena directly using ARN role:

library(DBI)
 con <- dbConnect(athena(),
                  profile_name = "YOUR_PROFILE_NAME",
                  role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
                  s3_staging_dir = 's3://path/to/query/bucket/')

To change the duration of ARN role session please change the parameter duration_seconds. By default duration_seconds is set to 3600 seconds (1 hour).

Usage

Basic Usage

Connect to athena, and send a query and return results back to R.

library(DBI)

con <- dbConnect(RAthena::athena(),
                aws_access_key_id='YOUR_ACCESS_KEY_ID',
                aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
                s3_staging_dir='s3://path/to/query/bucket/',
                region_name='eu-west-1')

res <- dbExecute(con, "SELECT * FROM one_row")
dbFetch(res)
dbClearResult(res)

To retrieve query in 1 step.

dbGetQuery(con, "SELECT * FROM one_row")

Intermediate Usage

To create a tables in athena, dbExecute will send the query to athena and wait until query has been executed. This makes it and idea method to create tables within athena.

query <- 
  "CREATE EXTERNAL TABLE impressions (
      requestBeginTime string,
      adId string,
      impressionId string,
      referrer string,
      userAgent string,
      userCookie string,
      ip string,
      number string,
      processId string,
      browserCookie string,
      requestEndTime string,
      timers struct<modelLookup:string, requestTime:string>,
      threadId string,
      hostname string,
      sessionId string)
  PARTITIONED BY (dt string)
  ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
      with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
  LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;"
  
dbExecute(con, query)

RAthena has 2 extra function to return extra information around Athena tables: dbGetParitiions and dbShow

dbGetPartitions will return all the partitions (returns data.frame):

RAthena::dbGetPartition(con, "impressions")

dbShow will return the table’s ddl, so you will able to see how the table was constructed in Athena (returns SQL character):

RAthena::dbShow(con, "impressions")

Advanced Usage

library(DBI)
con <- dbConnect(RAthena::athena(),
                 s3_staging_dir = 's3://path/to/query/bucket/')

Sending data to Athena

RAthena has created a method to send data.frame from R to Athena.

# Check existing tables
dbListTables(con)
# Upload iris to Athena
dbWriteTable(con, "iris", iris, 
             partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")))

# Read in iris from Athena
dbReadTable(con, "iris")

# Check new existing tables in Athena
dbListTables(con)

# Check if iris exists in Athena
dbExistsTable(con, "iris")

Please check out RAthena method for dbWriteTable for more information in how to upload data to AWS Athena and AWS S3.

For more information around how to get the most out of AWS Athena when uploading data please check out: Top 10 Performance Tuning Tips for Amazon Athena

Tidyverse Usage

Creating a connection to Athena and query and already existing table iris that was created in previous example.

library(DBI)
library(dplyr)

con <- dbConnect(RAthena::athena(),
                aws_access_key_id='YOUR_ACCESS_KEY_ID',
                aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
                s3_staging_dir='s3://path/to/query/bucket/',
                region_name='eu-west-1')
tbl(con, sql("SELECT * FROM iris"))
# Source:   SQL [?? x 5]
# Database: Athena 1.9.210 [eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with more rows

dplyr provides lazy querying with allows to short hand tbl(con, sql("SELECT * FROM iris")) to tbl(con, "iris"). For more information please look at https://db.rstudio.com/dplyr/.

tbl(con, "iris")
# Source:   table<iris> [?? x 5]
# Database: Athena 1.9.210 [eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with more rows

Querying Athena with profile_name instead of hard coding aws_access_key_id and aws_secret_access_key. By using profile_name extra Meta Data is returned in the query to give users extra information.

con <- dbConnect(RAthena::athena(),
                profile_name = "your_profile",
                s3_staging_dir='s3://path/to/query/bucket/')
tbl(con, "iris")) %>% 
  filter(petal_length < 1.3)
# Source:   lazy query [?? x 5]
# Database: Athena 1.9.210 [your_profile@eu-west-1/default]
   sepal_length sepal_width petal_length petal_width species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          4.7         3.2          1.3         0.2 setosa 
 2          4.3         3            1.1         0.1 setosa 
 3          5.8         4            1.2         0.2 setosa 
 4          5.4         3.9          1.3         0.4 setosa 
 5          4.6         3.6          1           0.2 setosa 
 6          5           3.2          1.2         0.2 setosa 
 7          5.5         3.5          1.3         0.2 setosa 
 8          4.4         3            1.3         0.2 setosa 
 9          5           3.5          1.3         0.3 setosa 
10          4.5         2.3          1.3         0.3 setosa 
# … with more rows
tbl(con, "iris") %>% 
  select(contains("sepal"), contains("petal"))
# Source:   lazy query [?? x 4]
# Database: Athena 1.9.210 [your_profile@eu-west-1/default]
   sepal_length sepal_width petal_length petal_width
          <dbl>       <dbl>        <dbl>       <dbl>
 1          5.1         3.5          1.4         0.2
 2          4.9         3            1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5           3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4
 7          4.6         3.4          1.4         0.3
 8          5           3.4          1.5         0.2
 9          4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
# … with more rows

Upload data using dplyr function copy_to and compute.

library(DBI)
library(dplyr)

con <- dbConnect(RAthena::athena(),
                profile_name = "your_profile",
                s3_staging_dir='s3://path/to/query/bucket/')

Write data.frame to Athena table

copy_to(con, mtcars,
        s3_location = "s3://mybucket/data/")

Write Athena table from tbl_sql

athena_mtcars <- tbl(con, "mtcars")
mtcars_filter <- athena_mtcars %>% filter(gear >=4)

Create athena with unique table name

mtcars_filer %>% compute()

Create athena with specified name and s3 location

mtcars_filer %>% 
  compute("mtcars_filer",
          s3_location = "s3://mybucket/mtcars_filer/")

# Disconnect from Athena
dbDisconnect(con)

Work Groups

Creating work group:

library(RAthena)
library(DBI)

con <- dbConnect(RAthena::athena(),
                profile_name = "your_profile",
                encryption_option = "SSE_S3",
                s3_staging_dir='s3://path/to/query/bucket/')

create_work_group(con, "demo_work_group", description = "This is a demo work group",
                  tags = tag_options(key= "demo_work_group", value = "demo_01"))

List work groups:

list_work_groups(con)
[[1]]
[[1]]$Name
[1] "demo_work_group"

[[1]]$State
[1] "ENABLED"

[[1]]$Description
[1] "This is a demo work group"

[[1]]$CreationTime
2019-09-06 18:51:28.902000+01:00


[[2]]
[[2]]$Name
[1] "primary"

[[2]]$State
[1] "ENABLED"

[[2]]$Description
[1] ""

[[2]]$CreationTime
2019-08-22 16:14:47.902000+01:00

Update work group:

update_work_group(con, "demo_work_group", description = "This is a demo work group update")

Return work group meta data:

get_work_group(con, "demo_work_group")
$Name
[1] "demo_work_group"

$State
[1] "ENABLED"

$Configuration
$Configuration$ResultConfiguration
$Configuration$ResultConfiguration$OutputLocation
[1] "s3://path/to/query/bucket/"

$Configuration$ResultConfiguration$EncryptionConfiguration
$Configuration$ResultConfiguration$EncryptionConfiguration$EncryptionOption
[1] "SSE_S3"



$Configuration$EnforceWorkGroupConfiguration
[1] FALSE

$Configuration$PublishCloudWatchMetricsEnabled
[1] FALSE

$Configuration$BytesScannedCutoffPerQuery
[1] 10000000

$Configuration$RequesterPaysEnabled
[1] FALSE


$Description
[1] "This is a demo work group update"

$CreationTime
2019-09-06 18:51:28.902000+01:00

Connect to Athena using work group:

con <- dbConnect(RAthena::athena(),
                profile_name = "your_profile",
                work_group = "demo_work_group")

Delete work group:

delete_work_group(con, "demo_work_group")

Similar Projects

Python:

R:

awsathena and metis family of packages are currently used in production every day to analyze petabytes of internet scan and honeypot data.

Comparison:

The reason why RAthena stands slightly apart from AWR.Athena is that AWR.Athena uses the Athena JDBC drivers and RAthena uses the Python AWS SDK Boto3. The ultimate goal is to provide an extra method for R users to interface with AWS Athena. As pyAthena is the most similar project, this project has used an appropriate name to reflect this … RAthena.

Copy Link

Version

Install

install.packages('RAthena')

Monthly Downloads

792

Version

1.7.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Dyfan Jones

Last Published

January 7th, 2020

Functions in RAthena (1.7.0)

dbGetPartition

Athena table partitions
dbDisconnect

Disconnect (close) an Athena connection
dbExistsTable

Does Athena table exist?
dbIsValid

Is this DBMS object still valid?
dbListFields

List Field names of Athena table
dbGetInfo

Get DBMS metadata
dbFetch

Fetch records from previously executed query
db_desc

S3 implementation of db_desc for Athena
db_save_query

S3 implementation of db_save_query for Athena
dbGetQuery

Send query, retrieve results and then clear result set
dbRemoveTable

Remove table from Athena
dbGetTables

List Athena Schema, Tables and TableTypes
dbColumnInfo

Information about result types
dbShow

Show Athena table's DDL
dbHasCompleted

Completion status
dbConnect,AthenaDriver-method

Connect to Athena using python's sdk boto3
sql_translate_env

AWS Athena backend dbplyr
work_group

Athena Work Groups
db_copy_to

S3 implementation of db_copy_to for Athena
sqlData

Converts data frame into suitable format to be uploaded to Athena
sqlCreateTable

Creates query to create a simple Athena table
db_compute

S3 implementation of db_compute for Athena
dbDataType,AthenaDriver,ANY-method

Determine SQL data type of object
dbQuote

Quote Identifiers
dbListTables

List Athena Tables
session_token

Get Session Tokens for Boto3 Connection
install_boto

Install Amazon SDK boto3 for Athena connection
AthenaConnection

Athena Connection Methods
RAthena_options

A method to change RAthena backend file parser.
athena

Athena Driver
dbClearResult

Clear Results
assume_role

Assume AWS ARN Role
AthenaWriteTables

Convenience functions for reading/writing DBMS tables
RAthena-package

RAthena: a DBI interface into Athena using Boto3 SDK
AthenaDriver

Athena Driver Methods
Query

Execute a query on Athena