sparkbq v0.1.1

0

Monthly downloads

0th

Percentile

Google 'BigQuery' Support for 'sparklyr'

A 'sparklyr' extension package providing an integration with Google 'BigQuery'. It supports direct import/export where records are directly streamed from/to 'BigQuery'. In addition, data may be imported/exported via intermediate data extracts on Google 'Cloud Storage'.

Readme

sparkbq: Google BigQuery Support for sparklyr

CRAN\_Status\_Badge Rdoc

sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.

Version Information

You can install the released version of sparkbq from CRAN via

install.packages("sparkbq")

or the latest development version through

devtools::install_github("miraisolutions/sparkbq", ref = "develop")

The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:

sparkbq spark-bigquery Apache Spark Scala Google Dataproc
0.1.x 0.1.0 2.2.x and 2.3.x 2.11 1.2.x and 1.3.x

sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.

Example Usage

library(sparklyr)
library(sparkbq)
library(dplyr)

config <- spark_config()

sc <- spark_connect(master = "local[*]", config = config)

# Set Google BigQuery default settings
bigquery_defaults(
  billingProjectId = "<your_billing_project_id>",
  gcsBucket = "<your_gcs_bucket>",
  datasetLocation = "US",
  serviceAccountKeyFile = "<your_service_account_key_file>",
  type = "direct"
)

# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
hamlet <- 
  spark_read_bigquery(
    sc,
    name = "hamlet",
    projectId = "bigquery-public-data",
    datasetId = "samples",
    tableId = "shakespeare") %>%
  filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!

# Retrieve results into a local tibble
hamlet %>% collect()

# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
  hamlet,
  datasetId = "mysamples",
  tableId = "hamlet",
  mode = "overwrite")

Authentication

When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameter serviceAccountKeyFile to bigquery_defaults or directly to spark_read_bigquery and spark_write_bigquery.

Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.

When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.

Further Information

Functions in sparkbq

Name Description
default_billing_project_id Default Google BigQuery Billing Project ID
default_dataset_location Default Google BigQuery Dataset Location
default_service_account_key_file Default Google BigQuery Service Account Key File
default_bigquery_type Default BigQuery import/export type
spark_write_bigquery Writing data to Google BigQuery
bigquery_defaults Google BigQuery Default Settings
spark_read_bigquery Reading data from Google BigQuery
default_gcs_bucket Default Google BigQuery GCS Bucket
No Results!

Last month downloads

Details

Type Package
URL http://www.mirai-solutions.com, https://github.com/miraisolutions/sparkbq
BugReports https://github.com/miraisolutions/sparkbq/issues
License GPL-3 | file LICENSE
SystemRequirements Spark (>= 2.2.x)
Encoding UTF-8
LazyData yes
RoxygenNote 6.1.1
NeedsCompilation no
Packaged 2019-12-18 17:03:34 UTC; simon
Repository CRAN
Date/Publication 2019-12-18 18:00:02 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/sparkbq)](http://www.rdocumentation.org/packages/sparkbq)