A Dataset can have one or more Sources. A Source
contains one or more
Fragments
, such as files, of a common storage location, format, and
partitioning. This function helps you construct a Source
that you can
pass to open_dataset()
.
open_source(
path,
filesystem = c("auto", "local"),
format = c("parquet", "arrow", "ipc"),
partitioning = NULL,
allow_non_existent = FALSE,
recursive = TRUE,
...
)
A string file path containing data files
A string identifier for the filesystem corresponding to
path
. Currently only "local" is supported.
A string identifier of the format of the files in path
.
Currently supported options are "parquet", "arrow", and "ipc" (an alias for
the Arrow file format)
One of
A Schema
, in which case the file paths relative to sources
will be
parsed, and path segments will be matched with the schema fields. For
example, schema(year = int16(), month = int8())
would create partitions
for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
A character vector that defines the field names corresponding to those
path segments (that is, you're providing the names that would correspond
to a Schema
but the types will be autodetected)
A HivePartitioning
or HivePartitioningFactory
, as returned
by hive_partition()
which parses explicit or autodetected fields from
Hive-style path segments
NULL
for no partitioning
logical: is path
allowed to not exist? Default
FALSE
. See FileSelector.
logical: should files be discovered in subdirectories of
path
? Default TRUE
.
Additional arguments passed to the FileSystem $create()
method
A SourceFactory
object. Pass this to open_dataset()
,
in a list potentially with other SourceFactory
objects, to create
a Dataset
.
If you only have a single Source
, such as a directory containing Parquet
files, you can call open_dataset()
directly. Use open_source()
when you
want to combine different directories, file systems, or file formats.