spark_read_warc

An active <code>spark_connection</code>.

The name to assign to the newly generated table.

name

The path to the file. Needs to be accessible from the cluster.
Supports the <samp>"hdfs://"</samp>, <samp>"s3n://"</samp> and <samp>"file://"</samp> protocols.

path

The number of partitions used to distribute the
generated table. Use 0 (the default) to avoid partitioning.

repartition

Boolean; should the data be loaded eagerly into memory? (That
is, should the table be cached?)

memory

Boolean; overwrite the table with the given name if it
already exists?

overwrite

<code>TRUE</code> to group by warc segment. Currently supported
only in HDFS and uncompressed files.

group

<code>TRUE</code> to parse warc into tags, attribute, value, etc.

parse

Additional arguments reserved for future use.

Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.

Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This
allows to read files from the Common Crawl project <http://commoncrawl.org/>.

spark_read_warc: Reads a WARC File into Apache Spark

Description

Usage

Arguments

Examples