Allow Parquet column access by field_id #6128

devinrsmith · 2024-09-25T16:43:40Z

When used in combination with external schemas or catalogs (such as Iceberg, or others) where columns may be renamed, removed, and added in arbitrary combination, Parquet provides a utility to attach a field_id in the SchemaElement to support properly mapping the data:

  /** When the original schema supports field ids, this will save the
   * original field id in the parquet schema
   */
  9: optional i32 field_id;

Right now, Deephaven only supports indexing into the parquet file via "columnName" and "path"; with path being the primary key.

public interface RowGroupReader {
    /**
     * Returns the accessor to a given Column Chunk
     *
     * @param columnName the name of the column
     * @param path the full column path
     * @return the accessor to a given Column Chunk, or null if the column is not present in this Row Group
     */
    @Nullable
    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> path);

We should add support for indexing based on a field_id.

This is in support of #6118.

Related, it has been noted that we use the following hierarchy to access path_in_schema as the primary key to access a row group, and absent other information, use the first element from that list (in some situations) to determine the column name.

FileMetaData ->
row_groups[rgIx] : RowGroup ->
columns[colIx] : ColumnChunk ->
meta_data: ColumnMetaData ->
path_in_schema: list<string>

This seems somewhat round-a-bout and fragile, as a parquet file can actually be empty and not have any row groups.

There is explicit documentation on RowGroup.columns

  /** Metadata for each column chunk in this row group.
   * This list must have the same order as the SchemaElement list in FileMetaData.
   **/
  1: required list<ColumnChunk> columns

which means in any context where we are dealing with a ColumnChunk, we should (/could) pass along the corresponding SchemaElement.

This also means we might prefer to resolve the column name from the actual Parquet schema.

There might be special consideration we need to take for nested structs, which we don't generally support, but want to make sure we don't break downstream users who may be reading files with nested structs and explicitly excluding them.

The text was updated successfully, but these errors were encountered:

devinrsmith · 2024-09-25T18:53:01Z

Our io.deephaven.parquet.base.RowGroupReader#getColumnChunk(java.lang.String, java.util.List<java.lang.String>) smells; we should not need to do the resolution mapping per column per row group; it should only need to happen once per column.

Ie, a given List<String> path or int fieldId should map to a specific columnIndex; that columnIndex can then be used across all row groups for that column to get the appropriate ColumnChunk (RowGroup.columns[columnIndex]).

…bleLocation Fixes an inconsistency between `ParquetTools#readSingleFileTable` and `ParquetTools#readTable` in regards to whether they set the TableDefinition in the ParquetInstructions after inference. This is done as part of an effort towards deephaven#6128.

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings. Fixes deephaven#6128

devinrsmith added feature request New feature or request triage parquet Related to the Parquet integration labels Sep 25, 2024

devinrsmith added this to the 0.37.0 milestone Sep 25, 2024

devinrsmith self-assigned this Sep 25, 2024

rcaudy added core Core development tasks and removed triage labels Sep 25, 2024

devinrsmith mentioned this issue Sep 27, 2024

fix: Ensure ParquetInstruction's TableDefinition is set for ParquetTableLocation #6149

Open

devinrsmith linked a pull request Sep 30, 2024 that will close this issue

feat: Allow parquet column access by field_id #6156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Parquet column access by field_id #6128

Allow Parquet column access by field_id #6128

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024

Allow Parquet column access by field_id #6128

Allow Parquet column access by field_id #6128

Comments

devinrsmith commented Sep 25, 2024

devinrsmith commented Sep 25, 2024