Allow Parquet column access by field_id #6128
Labels
core
Core development tasks
feature request
New feature or request
parquet
Related to the Parquet integration
Milestone
When used in combination with external schemas or catalogs (such as Iceberg, or others) where columns may be renamed, removed, and added in arbitrary combination, Parquet provides a utility to attach a field_id in the SchemaElement to support properly mapping the data:
Right now, Deephaven only supports indexing into the parquet file via "columnName" and "path"; with
path
being the primary key.We should add support for indexing based on a
field_id
.This is in support of #6118.
Related, it has been noted that we use the following hierarchy to access path_in_schema as the primary key to access a row group, and absent other information, use the first element from that list (in some situations) to determine the column name.
This seems somewhat round-a-bout and fragile, as a parquet file can actually be empty and not have any row groups.
There is explicit documentation on RowGroup.columns
which means in any context where we are dealing with a
ColumnChunk
, we should (/could) pass along the correspondingSchemaElement
.This also means we might prefer to resolve the column name from the actual Parquet schema.
There might be special consideration we need to take for nested structs, which we don't generally support, but want to make sure we don't break downstream users who may be reading files with nested structs and explicitly excluding them.
The text was updated successfully, but these errors were encountered: