Read a CSV file into a Spark DataFrame
spark_read_csv(sc, name, path, header = TRUE, delimiter = ",",
quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL,
options = list(), repartition = 0, memory = TRUE, overwrite = TRUE)
The Spark connection
Name of table
The path to the file. Needs to be accessible from the cluster. Supports: "hdfs://" or "s3n://"
Should the first row of data be used as a header? Defaults to TRUE
.
The character used to delimit each column, defaults to ,
.
The character used as a quote, defaults to "hdfs://"
.
The chatacter used to escape other characters, defaults to \
.
The character set, defaults to "UTF-8"
.
The character to use for default values, defaults to NULL
.
A list of strings with additional options.
Total of partitions used to distribute table or 0 (default) to avoid partitioning
Load data eagerly into memory
Overwrite the table with the given name if it already exists
Reference to a Spark DataFrame / dplyr tbl
You can read data from HDFS (hdfs://
), S3 (s3n://
), as well as
the local file system (file://
).
If you are reading from a secure S3 bucket be sure that the AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
environment variables are both defined.
When header
is FALSE
, the column names are generated with a V
prefix;
e.g. V1, V2, ...
.
Other reading and writing data: spark_read_json
,
spark_read_parquet
,
spark_write_csv
,
spark_write_json
,
spark_write_parquet