Metadata of all pages of a Parquet file
read_parquet_pages(file)
Data frame with columns:
file_name
: file name.
row_group
: id of the row group the page belongs to,
an integer between 0 and the number of row groups
minus one.
column
: id of the column. An integer between the
number of leaf columns minus one. Note that only leaf
columns are considered, as non-leaf columns do not
have any pages.
page_type
: DATA_PAGE
, INDEX_PAGE
, DICTIONARY_PAGE
or
DATA_PAGE_V2
.
page_header_offset
: offset of the data page (its header) in the
file.
uncompressed_page_size
: does not include the page header, as per
Parquet spec.
compressed_page_size
: without the page header.
crc
: integer, checksum, if present in the file, can be NA
.
num_values
: number of data values in this page, include
NULL
(NA
in R) values.
encoding
: encoding of the page, current possible encodings:
"PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT".
definition_level_encoding
: encoding of the definition levels,
see encoding
for possible values. This can be missing in V2 data
pages, where they are always RLE encoded.
repetition_level_encoding
: encoding of the repetition levels,
see encoding
for possible values. This can be missing in V2 data
pages, where they are always RLE encoded.
data_offset
: offset of the actual data in the file.
page_header_length
: size of the page header, in bytes.
Path to a Parquet file.
Reading all the page headers might be slow for large files, especially if the file has many small pages.
read_parquet_page()
to read a page.
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::read_parquet_pages(file_name)
Run the code above in your browser using DataLab