read_parquet_pages: Metadata of all pages of a Parquet file

Description

Metadata of all pages of a Parquet file

read_parquet_pages(file)

Data frame with columns:

file_name: file name.
row_group: id of the row group the page belongs to, an integer between 0 and the number of row groups minus one.
column: id of the column. An integer between the number of leaf columns minus one. Note that only leaf columns are considered, as non-leaf columns do not have any pages.
page_type: DATA_PAGE, INDEX_PAGE, DICTIONARY_PAGE or DATA_PAGE_V2.
page_header_offset: offset of the data page (its header) in the file.
uncompressed_page_size: does not include the page header, as per Parquet spec.
compressed_page_size: without the page header.
crc: integer, checksum, if present in the file, can be NA.
num_values: number of data values in this page, include NULL (NA in R) values.
encoding: encoding of the page, current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT".
definition_level_encoding: encoding of the definition levels, see encoding for possible values. This can be missing in V2 data pages, where they are always RLE encoded.
repetition_level_encoding: encoding of the repetition levels, see encoding for possible values. This can be missing in V2 data pages, where they are always RLE encoded.
data_offset: offset of the actual data in the file.
page_header_length: size of the page header, in bytes.

Reading all the page headers might be slow for large files, especially if the file has many small pages.

file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet:::read_parquet_pages(file_name)

Run the code above in your browser using DataLab