parquet_options: Nanoparquet options

Description

Create a list of nanoparquet options.

Usage

parquet_options(
  class = getOption("nanoparquet.class", "tbl"),
  compression_level = getOption("nanoparquet.compression_level", NA_integer_),
  keep_row_groups = FALSE,
  num_rows_per_row_group = getOption("nanoparquet.num_rows_per_row_group", 10000000L),
  use_arrow_metadata = getOption("nanoparquet.use_arrow_metadata", TRUE),
  write_arrow_metadata = getOption("nanoparquet.write_arrow_metadata", TRUE),
  write_data_page_version = getOption("nanoparquet.write_data_page_version", 1L),
  write_minmax_values = getOption("nanoparquet.write_minmax_values", TRUE)
)

Value

List of nanoparquet options.

Arguments

class

The extra class or classes to add to data frames created in read_parquet(). By default nanoparquet adds the "tbl" class, so data frames are printed differently if the pillar package is loaded.

compression_level

The compression level in write_parquet(). NA is the default, and it specifies the default compression level of each method. Inf always selects the highest possible compression level. More details:

Snappy does not support compression levels currently.
GZIP supports levels from 0 (uncompressed), 1 (fastest), to 9 (best). The default is 6.
ZSTD allows positive levels up to 22 currently. 20 and above require more memory. Negative levels are also allowed, the lower the level, the faster the speed, at the cost of compression. Currently the smallest level is -131072. The default level is 3.

keep_row_groups

This option is used when appending to a Parquet file with append_parquet(). If TRUE then the existing row groups of the file are always kept as is and nanoparquet creates new row groups for the new data. If FALSE (the default), then the last row group of the file will be overwritten if it is smaller than the default row group size, i.e. num_rows_per_row_group.

num_rows_per_row_group

The number of rows to put into a row group, if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million.

use_arrow_metadata

TRUE or FALSE. If TRUE, then read_parquet() and read_parquet_schema() will make use of the Apache Arrow metadata to assign R classes to Parquet columns. This is currently used to detect factor columns, and to detect "difftime" columns.

If this option is FALSE:

"factor" columns are read as character vectors.
"difftime" columns are read as real numbers, meaning one of seconds, milliseconds, microseconds or nanoseconds. Impossible to tell which without using the Arrow metadata.

write_arrow_metadata

Whether to add the Apache Arrow types as metadata to the file write_parquet().

write_data_page_version

Data version to write by default. Possible values are 1 and 2. Default is 1.

write_minmax_values

Whether to write minimum and maximum values per row group, for data types that support this in write_parquet(). However, nanoparquet currently does not support minimum and maximum values for the DECIMAL, UUID and FLOAT16 logical types and the BOOLEAN, BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY primitive types if they are writing without a logical type. Currently the default is TRUE.

Examples

Run this code

if (FALSE) {
# the effect of using Arrow metadata
tmp <- tempfile(fileext = ".parquet")
d <- data.frame(
  fct = as.factor("a"),
  dft = as.difftime(10, units = "secs")
)
write_parquet(d, tmp)
read_parquet(tmp, options = parquet_options(use_arrow_metadata = TRUE))
read_parquet(tmp, options = parquet_options(use_arrow_metadata = FALSE))
}

Run the code above in your browser using DataLab