This function allows you to write a dataset. By writing to more efficient binary storage formats, and by specifying relevant partitioning, you can make it much faster to read and query.
write_dataset(
dataset,
path,
format = c("parquet", "feather", "arrow", "ipc"),
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
...
)
Dataset, RecordBatch, Table, arrow_dplyr_query
, or
data.frame
. If an arrow_dplyr_query
or grouped_df
,
schema
and partitioning
will be taken from the result of any select()
and group_by()
operations done on the dataset. filter()
queries will be
applied to restrict written rows.
Note that select()
-ed columns may not be renamed.
string path, URI, or SubTreeFileSystem
referencing a directory
to write to (directory will be created if it does not exist)
a string identifier of the file format. Default is to use "parquet" (see FileFormat)
Partitioning
or a character vector of columns to
use as partition keys (to be written as path segments). Default is to
use the current group_by()
columns.
string template for the names of files to be written.
Must contain "{i}"
, which will be replaced with an autoincremented
integer to generate basenames of datafiles. For example, "part-{i}.feather"
will yield "part-0.feather", ...
.
logical: write partition segments as Hive-style
(key1=value1/key2=value2/file.ext
) or as just bare values. Default is TRUE
.
additional format-specific arguments. For available Parquet
options, see write_parquet()
. The available Feather options are
use_legacy_format
logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE
. You can also
enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1
.
metadata_version
: A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest version,
unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1
, in
which case it will be V4.
codec
: A Codec which will be used to compress body buffers of written
files. Default (NULL) will not compress body buffers.
null_fallback
: character to be used in place of missing values (NA
or
NULL
) when using Hive-style partitioning. See hive_partition()
.
The input dataset
, invisibly