This function should work on all files, even if read_parquet()
is
unable to read them, because of an unsupported schema, encoding,
compression or other reason.
read_parquet_metadata(file, options = parquet_options())parquet_metadata(file)
A named list with entries:
file_meta_data
: a data frame with file meta data:
file_name
: file name.
version
: Parquet version, an integer.
num_rows
: total number of rows.
key_value_metadata
: list column of a data frames with two
character columns called key
and value
. This is the key-value
metadata of the file. Arrow stores its schema here.
created_by
: A string scalar, usually the name of the software
that created the file.
schema
:
data frame, the schema of the file. It has one row for
each node (inner node or leaf node). For flat files this means one
root node (inner node), always the first one, and then one row for
each "real" column. For nested schemas, the rows are in depth-first
search order. Most important columns are:
file_name
: file name.
name
: column name.
r_type
: the R type that corresponds to the Parquet type.
Might be NA
if read_parquet()
cannot read this column. See
nanoparquet-types for the type mapping rules.
r_type
:
type
: data type. One of the low level data types.
type_length
: length for fixed length byte arrays.
repettion_type
: character, one of REQUIRED
, OPTIONAL
or
REPEATED
.
logical_type
: a list column, the logical types of the columns.
An element has at least an entry called type
, and potentially
additional entries, e.g. bit_width
, is_signed
, etc.
num_children
: number of child nodes. Should be a non-negative
integer for the root node, and NA
for a leaf node.
$row_groups
: a data frame, information about the row groups.
Some important columns:
file_name
: file name.
id
: row group id, integer from zero to number of row groups
minus one.
total_byte_size
: total uncompressed size of all column data.
num_rows
: number of rows.
file_offset
: where the row group starts in the file. This is
optional, so it might be NA
.
total_compressed_size
: total byte size of all compressed
(and potentially encrypted) column data in this row group.
This is optional, so it might be NA
.
ordinal
: ordinal position of the row group in the file, starting
from zero. This is optional, so it might be NA
. If NA
, then
the order of the row groups is as they appear in the metadata.
$column_chunks
: a data frame, information about all column chunks,
across all row groups. Some important columns:
file_name
: file name.
row_group
: which row group this chunk belongs to.
column
: which leaf column this chunks belongs to. The order is
the same as in $schema
, but only leaf columns (i.e. columns with
NA
children) are counted.
file_path
: which file the chunk is stored in. NA
means the
same file.
file_offset
: where the column chunk begins in the file.
type
: low level parquet data type.
encodings
: encodings used to store this chunk. It is a list
column of character vectors of encoding names. Current possible
encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT".
path_in_scema
: list column of character vectors. It is simply
the path from the root node. It is simply the column name for
flat schemas.
codec
: compression codec used for the column chunk. Possible
values are: "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO", "BROTLI", "LZ4", "ZSTD".
num_values
: number of values in this column chunk.
total_uncompressed_size
: total uncompressed size in bytes.
total_compressed_size
: total compressed size in bytes.
data_page_offset
: absolute position of the first data page of
the column chunk in the file.
index_page_offset
: absolute position of the first index page of
the column chunk in the file, or NA
if there are no index pages.
dictionary_page_offset
: absolute position of the first
dictionary page of the column chunk in the file, or NA
if there
are no dictionary pages.
null_count
: the number of missing values in the column chunk.
It may be NA
.
min_value
: list column of raw vectors, the minimum value of the
column, in binary. If NULL
, then then it is not specified.
This column is experimental.
max_value
: list column of raw vectors, the maximum value of the
column, in binary. If NULL
, then then it is not specified.
This column is experimental.
is_min_value_exact
: whether the minimum value is an actual
value of a column, or a bound. It may be NA
.
is_max_value_exact
: whether the maximum value is an actual
value of a column, or a bound. It may be NA
.
Path to a Parquet file.
Options that potentially alter the default Parquet to R
type mappings, see parquet_options()
.
read_parquet_info()
for a much shorter summary.
read_parquet_schema()
for column information.
read_parquet()
to read, write_parquet()
to write Parquet files,
nanoparquet-types for the R <-> Parquet type mappings.
file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet::read_parquet_metadata(file_name)
Run the code above in your browser using DataLab