You can use this schema to specify how to write out a data frame to
a Parquet file with write_parquet()
.
parquet_schema(...)
Data frame with the same columns as read_parquet_schema()
:
file_name
, name
, r_type
, type
, type_length
, repetition_type
, converted_type
, logical_type
, num_children
, scale
, precision
, field_id
.
Parquet type specifications, see below.
For backwards compatibility, you can supply a file name
here, and then parquet_schema
behaves as read_parquet_schema()
.
Special type:
"AUTO"
: this is not a Parquet type, but it tells write_parquet()
to map the R type to Parquet automatically, using the default mapping
rules.
Primitive Parquet types:
"BOOLEAN"
"INT32"
"INT64"
"INT96"
"FLOAT"
"DOUBLE"
"BYTE_ARRAY"
"FIXED_LEN_BYTE_ARRAY"
: fixed-length byte array. It needs a
type_length
parameter, an integer between 0 and 2^31-1.
Parquet logical types:
"STRING"
"ENUM"
"UUID"
"INTEGER"
: signed or unsigned integer. It needs a bit_width
and
an is_signed
parameter. bit_width
must be 8, 16, 32 or 64.
is_signed
must be TRUE
or FALSE
.
"INT"
: same as "INTEGER"
. The Parquet documentation uses "INT"
,
but the actual specification uses "INTEGER"
. Both are supported in
nanoparquet.
"DECIMAL"
: decimal number of specified scale and precision.
It needs the precision
and primitive_type
parameters. Also
supports the scale
parameter, it defaults to zero if not specified.
"FLOAT16"
"DATE"
"TIME"
: needs an is_adjusted_utc
(TRUE
or FALSE
) and a
unit
parameter. unit
must be "MILLIS"
, "MICROS"
or "NANOS"
.
"TIMESTAMP"
: needs an is_adjusted_utc
(TRUE
or FALSE
) and a
unit
parameter. unit
must be "MILLIS"
, "MICROS"
or "NANOS"
.
"JSON"
"BSON"
Logical types MAP
, LIST
and UNKNOWN
are not supported currently.
Converted types are deprecated in the Parquet specification in favor of
logical types, but parquet_schema()
accepts some converted types as a
syntactic shortcut for the corresponding logical types:
INT_8
mean list("INT", bit_width = 8, is_signed = TRUE)
.
INT_16
mean list("INT", bit_width = 16, is_signed = TRUE)
.
INT_32
mean list("INT", bit_width = 32, is_signed = TRUE)
.
INT_64
mean list("INT", bit_width = 64, is_signed = TRUE)
.
TIME_MICROS
means list("TIME", is_adjusted_utc = TRUE, unit = "MICROS")
.
TIME_MILLIS
means list("TIME", is_adjusted_utc = TRUE, unit = "MILLIS")
.
TIMESTAMP_MICROS
means list("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MICROS")
.
TIMESTAMP_MILLIS
means list("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MILLIS")
.
UINT_8
means list("INT", bit_width = 8, is_signed = FALSE)
.
UINT_16
means list("INT", bit_width = 16, is_signed = FALSE)
.
UINT_32
means list("INT", bit_width = 32, is_signed = FALSE)
.
UINT_64
means list("INT", bit_width = 64, is_signed = FALSE)
.
Each type might also have a repetition_type
parameter, with possible
values "REQUIRED"
, "OPTIONAL"
or "REPEATED"
. "REQUIRED"
columns
do not allow missing values. Missing values are allowed in "OPTIONAL"
columns. "REPEATED"
columns are currently not supported in
write_parquet()
.
A schema is a list of potentially named type specifications. A schema
is stored in a data frame. Each (potentially named) argument of
parquet_schema
may be a character scalar, or a list. Parameterized
types need to be specified as a list. Primitive Parquet types may be
specified as a string or a list.
parquet_schema(
c1 = "INT32",
c2 = list("INT", bit_width = 64, is_signed = TRUE),
c3 = list("STRING", repetition_type = "OPTIONAL")
)
Run the code above in your browser using DataLab