spark_rcpp_read_warc: Reads a WARC File into using Rcpp
Description
Reads a WARC (Web ARChive) file using Rcpp.
Usage
spark_rcpp_read_warc(path, match_warc, match_line)
Arguments
path
The path to the file. Needs to be accessible from the cluster.
Supports the "hdfs://", "s3n://" and "file://" protocols.
match_warc
include only warc files mathcing this character string.
match_line
include only lines mathcing this character string.