arrow (version 0.16.0.2)

open_source: Create a Source for a Dataset

Description

A Dataset can have one or more Sources. A Source contains one or more Fragments, such as files, of a common storage location, format, and partitioning. This function helps you construct a Source that you can pass to open_dataset().

Usage

open_source(
  path,
  filesystem = c("auto", "local"),
  format = c("parquet", "arrow", "ipc"),
  partitioning = NULL,
  allow_non_existent = FALSE,
  recursive = TRUE,
  ...
)

Arguments

path

A string file path containing data files

filesystem

A string identifier for the filesystem corresponding to path. Currently only "local" is supported.

format

A string identifier of the format of the files in path. Currently supported options are "parquet", "arrow", and "ipc" (an alias for the Arrow file format)

partitioning

One of

  • A Schema, in which case the file paths relative to sources will be parsed, and path segments will be matched with the schema fields. For example, schema(year = int16(), month = int8()) would create partitions for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.

  • A character vector that defines the field names corresponding to those path segments (that is, you're providing the names that would correspond to a Schema but the types will be autodetected)

  • A HivePartitioning or HivePartitioningFactory, as returned by hive_partition() which parses explicit or autodetected fields from Hive-style path segments

  • NULL for no partitioning

allow_non_existent

logical: is path allowed to not exist? Default FALSE. See FileSelector.

recursive

logical: should files be discovered in subdirectories of path? Default TRUE.

...

Additional arguments passed to the FileSystem $create() method

Value

A SourceFactory object. Pass this to open_dataset(), in a list potentially with other SourceFactory objects, to create a Dataset.

Details

If you only have a single Source, such as a directory containing Parquet files, you can call open_dataset() directly. Use open_source() when you want to combine different directories, file systems, or file formats.