Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Parquet column access by field_id #6128

Open
devinrsmith opened this issue Sep 25, 2024 · 1 comment · May be fixed by #6156
Open

Allow Parquet column access by field_id #6128

devinrsmith opened this issue Sep 25, 2024 · 1 comment · May be fixed by #6156
Assignees
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@devinrsmith
Copy link
Member

When used in combination with external schemas or catalogs (such as Iceberg, or others) where columns may be renamed, removed, and added in arbitrary combination, Parquet provides a utility to attach a field_id in the SchemaElement to support properly mapping the data:

  /** When the original schema supports field ids, this will save the
   * original field id in the parquet schema
   */
  9: optional i32 field_id;

Right now, Deephaven only supports indexing into the parquet file via "columnName" and "path"; with path being the primary key.

public interface RowGroupReader {
    /**
     * Returns the accessor to a given Column Chunk
     *
     * @param columnName the name of the column
     * @param path the full column path
     * @return the accessor to a given Column Chunk, or null if the column is not present in this Row Group
     */
    @Nullable
    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> path);

We should add support for indexing based on a field_id.

This is in support of #6118.

Related, it has been noted that we use the following hierarchy to access path_in_schema as the primary key to access a row group, and absent other information, use the first element from that list (in some situations) to determine the column name.

FileMetaData ->
row_groups[rgIx] : RowGroup ->
columns[colIx] : ColumnChunk ->
meta_data: ColumnMetaData ->
path_in_schema: list<string>

This seems somewhat round-a-bout and fragile, as a parquet file can actually be empty and not have any row groups.

There is explicit documentation on RowGroup.columns

  /** Metadata for each column chunk in this row group.
   * This list must have the same order as the SchemaElement list in FileMetaData.
   **/
  1: required list<ColumnChunk> columns

which means in any context where we are dealing with a ColumnChunk, we should (/could) pass along the corresponding SchemaElement.

This also means we might prefer to resolve the column name from the actual Parquet schema.

There might be special consideration we need to take for nested structs, which we don't generally support, but want to make sure we don't break downstream users who may be reading files with nested structs and explicitly excluding them.

@devinrsmith devinrsmith added feature request New feature or request triage parquet Related to the Parquet integration labels Sep 25, 2024
@devinrsmith devinrsmith added this to the 0.37.0 milestone Sep 25, 2024
@devinrsmith devinrsmith self-assigned this Sep 25, 2024
@devinrsmith
Copy link
Member Author

Our io.deephaven.parquet.base.RowGroupReader#getColumnChunk(java.lang.String, java.util.List<java.lang.String>) smells; we should not need to do the resolution mapping per column per row group; it should only need to happen once per column.

Ie, a given List<String> path or int fieldId should map to a specific columnIndex; that columnIndex can then be used across all row groups for that column to get the appropriate ColumnChunk (RowGroup.columns[columnIndex]).

@rcaudy rcaudy added core Core development tasks and removed triage labels Sep 25, 2024
devinrsmith added a commit to devinrsmith/deephaven-core that referenced this issue Sep 27, 2024
…bleLocation

Fixes an inconsistency between `ParquetTools#readSingleFileTable` and `ParquetTools#readTable` in regards to whether they set the TableDefinition in the ParquetInstructions after inference.

This is done as part of an effort towards deephaven#6128.
devinrsmith added a commit to devinrsmith/deephaven-core that referenced this issue Sep 30, 2024
This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.

Fixes deephaven#6128
@devinrsmith devinrsmith linked a pull request Sep 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants