spark_read_warc: Reads a WARC File into Apache Spark

Description

Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.

Usage

spark_read_warc(
  sc,
  name,
  path,
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE,
  match_warc = "",
  match_line = "",
  parser = c("r", "scala"),
  ...
)

Arguments

An active spark_connection.

name

The name to assign to the newly generated table.

path

The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3n://" and "file://" protocols.

repartition

The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.

memory

Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)

overwrite

Boolean; overwrite the table with the given name if it already exists?

match_warc

include only warc files mathcing this character string.

match_line

include only lines mathcing this character string.

parser

which parser implementation to use? Options are "scala" or "r" (default).

...

Additional arguments reserved for future use.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
  sc,
  name = "sample_warc",
  path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
  memory = FALSE,
  overwrite = FALSE
)

spark_disconnect(sc)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab