feat: Allow parquet column access by field_id #6156

devinrsmith · 2024-09-30T22:28:41Z

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.

Writing support has also been added.

Fixes #6128

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings. Fixes deephaven#6128

malhotrashivam

First level of review, can do a more detailed review tomorrow.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

malhotrashivam · 2024-09-30T22:52:11Z

Do verify the nightlies pass before merging.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

devinrsmith · 2024-10-01T14:40:15Z

Do verify the nightlies pass before merging.

Verified.

malhotrashivam

I really like the change, minor comments.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

malhotrashivam · 2024-10-01T15:28:18Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                        // TODO: how should we handle this? Ignore?
+                        // throw new IllegalArgumentException();


I feel that this should be an error in ParquetInstructions, right when the user sets it instead of here

This whole code path is very smelly; I want to go through a larger refactoring that would alleviate the need to make these types of calls in the first place. This code path is only hit when inferring the TableDefinition, so I don't think it should be an error to set the same field id multiple times in general. We have set it up this way with parquet column names, but we shouldn't technically need to do that either - every little modelling mismatch we present is a small papercut that can lead to larger modelling problems at higher layers IMO.

I would be okay throwing an error here or silently ignoring wrt inferrence. Ideally, the user would be able to choose the behavior they desire. The structure of ParquetInstructions / builder makes that tedious (I wish we could redo it w/ Immutables and saner structures).

I'll change this to throw an error here, with a note we could think about exposing option to silently ignore if that's what the user wants.

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

devinrsmith · 2024-10-01T16:53:45Z

I couldn't find any resources to confirm, but this does feel incorrect to me, having two columns with same field ID. For example, if we get a field ID by Iceberg, it would expect a single column, right?

Iceberg probably mandates the uniqueness of field-ids.

Parquet doesn't have any mandates wrt that. And even the column names aren't guaranteed to be unique. I need to find the reference I found earlier that the parquet format "strongly recommends" unique column names, but it's not even a guarantee.

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

malhotrashivam · 2024-10-01T17:30:09Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/TypeInfos.java

@@ -474,6 +474,7 @@ default Type createSchemaType(
                builder = getBuilder(isRequired(columnDefinition), false, dataType);
                isRepeating = false;
            }
+            instructions.getFieldId(columnDefinition.getName()).ifPresent(builder::id);


You can skip it here, I am making the change and testing it as part of my PR here.
Or if you have already added the tests, you can copy the logic from my PR. The main difference is how we nested columns like handle lists.

The ability to write Parquet field ids doesn't necessarily need to be tied into Iceberg's usage of it. Given how simple it was here, I think we can leave it in?

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

rcaudy · 2024-10-07T15:14:01Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

        private final String columnName;
        private String parquetColumnName;
        private String codecName;
        private String codecArgs;
        private boolean useDictionary;
+        private Integer fieldId;


What's the point of this? Seems like we don't get anything but a little bit of extra allocation for this.

The javadoc on OptionalInt specifically calls out that it is intended for return types.

/* * ... * @apiNote * {@code OptionalInt} is primarily intended for use as a method return type where * there is a clear need to represent "no result." A variable whose type is * {@code OptionalInt} should never itself be {@code null}; it should always point * to an {@code OptionalInt} instance. */

IntelliJ, and likely other editors, will complain.

Immutables will also use this pattern internally when you have an object that returns OptionalInt.

I'm very heavily in favor of preferring the Java-canonical approach, especially when it comes to configuration objects which we should not really care about in terms of performance implications.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java

rcaudy · 2024-10-07T16:13:14Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

-        for (final ColumnDescriptor column : schema.getColumns()) {
+        final Map<Integer, Long> fieldIdCount = schema.getColumns()
+                .stream()
+                .map(ColumnDescriptor::getPrimitiveType)


I find it weird that field ID is on the primitive type.

Good catch, yes; in the case of a list type, the field id is on the list and not on the primitive.

rcaudy · 2024-10-07T16:15:39Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+            throw new IllegalStateException(String.format(
+                    "Field count inconsistent with number of columns, schema.getFieldCount()=%d, schema.getColumns().size()=%d",
+                    schema.getFieldCount(), schema.getColumns().size()));


Do we really think this holds? I wonder, since "field" and "column" are distinct names.

Great callout - I've dug into the distinction between "field" and "column"; for nested types, there is 1 field and multiple columns (potentially recursive).

I've improved the code to iterate through the each field with it's respective starting column index.

For inference purposes, we'll fail saying "we can't handle nested types" #871. For reading purposes when the user provides a specific table definition, we'll skip over nested columns.

I suspect we could greatly improve inference if we wanted (potentially to give the user the option to continue failing or to skip inference of nested fields by default) to not fail on these cases. I also suspect it should be pretty easy to actually support nested fields, at least a single level deep, by flattening them out into the table definition.

rcaudy · 2024-10-07T16:36:11Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                        colName = columnNames.get(0);
+                        break COL_NAME;
+                    } else if (columnNames.size() > 1) {
+                        throw new IllegalArgumentException(String.format(


I think this limitation is entirely because you didn't want to refactor the code. If you're going to argue for that, we should at least guide the user to use an updateView to achieve their goals.

Agreed. Added a comment that this could be improved with refactoring of the code.

Did you also update the error message? If we're not going to let the user do this, tell them how they can achieve the same result.

rcaudy · 2024-10-07T16:40:24Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                if (mappedName != null) {
+                    colName = mappedName;
+                    break COL_NAME;
+                }
                final String legalized = legalizeColumnNameFunc.apply(


I think there may be a name legalization bug:

I think we should be using builderSupplier in the below code.

I think we should be recording any column we assign as a taken name, in order to ensure that we don't collide between a user-specified name and a legalized name.

Potentially related to #6119

…ield id mappings are provided

rcaudy · 2024-10-08T19:47:38Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

+import shaded.parquet.org.apache.thrift.protocol.TSimpleJSONProtocol;
+import shaded.parquet.org.apache.thrift.transport.TIOStreamTransport;
+import shaded.parquet.org.apache.thrift.transport.TTransport;


Feels a little questionable to depend on someone else's shaded packages.

rcaudy · 2024-10-08T19:54:52Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReader.java

     * @param fieldId the field_id to fetch
     * @return the accessor to a given Column Chunk, or null if the column is not present in this Row Group
     */
    @Nullable
-    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> path, @Nullable Integer fieldId);
+    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> defaultPath,


Document defaultPath. Is it a parquet path?

rcaudy · 2024-10-08T19:58:57Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

+         * In the case where both a field id mapping and a parquet colum name mapping is provided, the field id will
+         * take precedence over the parquet column name. This may happen in cases where the parquet file is managed by a
+         * higher-level schema that has the concept of a "field id"; for example, Iceberg. As <a href=


Change it. There's no precedence if both are present, we insist that they be consistent.

rcaudy · 2024-10-08T20:03:21Z

...s/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetColumnLocation.java

     * @param columnChunkReaders The {@link ColumnChunkReader column chunk readers} for this location
     */
    ParquetColumnLocation(
            @NotNull final ParquetTableLocation tableLocation,
            @NotNull final String columnName,
-            @NotNull final String parquetColumnName,
+            @Nullable final String parquetColumnName,


We should check what happens if we're inferring and legalized a Parquet column name to get the Deephaven column name. I think in that case, this change is broken as-is.

rcaudy · 2024-10-08T20:11:19Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

-                columnPath == null ? Collections.singletonList(parquetColumnName) : Arrays.asList(columnPath);
+        final List<String> defaultPath;
+        {
+            final String[] path = parquetColumnNameToPath.get(columnName);


There's something about this making me uncomfortable. I think there may be a buggy path if legalization is used.

rcaudy · 2024-10-08T20:28:50Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

+        for (String nonUniquePath : nonUniquePaths) {
+            byPath.remove(nonUniquePath);
        }
        for (Integer nonUniqueFieldId : nonUniqueFieldIds) {
-            chunkMapByFieldId.remove(nonUniqueFieldId);
-            schemaMapByFieldId.remove(nonUniqueFieldId);
+            byFieldId.remove(nonUniqueFieldId);
        }


Last wins or first wins is better than "pretend we had nothing, and just give nulls".

I wonder what pyarrow does.

rcaudy · 2024-10-08T20:37:12Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

+            if (byFieldId != null && byPath != null) {
+                if (byFieldId != byPath) {
+                    throw new IllegalArgumentException(String.format(
+                            "For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",


Suggested change

"For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

"For columnName=%s, instructions provided an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

rcaudy · 2024-10-08T20:46:45Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

-            }
-            columnChunk = cc;
-            nonRequiredFields = schemaMap.get(key);
+            holder = byFieldId != null ? byFieldId : byPath;


If the user specified a field ID and we didn't find it, I'm not sure it's correct to fall back to name mapping.
https://iceberg.apache.org/spec/#schema-evolution specifies a set of rules, and we should be making sure our Parquet implementation will let our Iceberg implementation follow them.

For Iceberg support, it looks like we need:

A list of name mappings, which we fall back to if and only if the field ID was not found.

Some kind of handling for encountering multiple Parquet fields with names from the name mappings: first? last? exception?

Some kind of handling for finding a fallback field by name mappings, and determining that it does not match the expected field ID. Exception?

rcaudy · 2024-10-08T21:12:02Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                // TODO(deephaven-core#871): Parquet: Support repetition level >1 and multi-column fields
+                throw new UnsupportedOperationException(
+                        String.format("Encountered unsupported multi-column field %s, has %d total columns",
+                                fieldType.getName(), numColumns));


You suggested we should maybe just start skipping nested fields. I bet we could also choose to include them, with some weird default. Like "UnprocessedField" singleton POJOs.

rcaudy · 2024-10-08T21:14:03Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

-        if (fieldIt.hasNext() || columnDescriptorIterator.hasNext()) {
-            throw new IllegalStateException("Iterators not exhausted");
+        if (columnIx != columnDescriptors.size()) {
+            throw new IllegalStateException("Not proper size");


We can do better than this.

devinrsmith added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Sep 30, 2024

devinrsmith added this to the 0.37.0 milestone Sep 30, 2024

devinrsmith requested a review from malhotrashivam September 30, 2024 22:28

devinrsmith self-assigned this Sep 30, 2024

devinrsmith requested a review from rcaudy September 30, 2024 22:28

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Show resolved Hide resolved

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Outdated Show resolved Hide resolved

Review response

65453c4

devinrsmith requested a review from malhotrashivam October 1, 2024 00:00

devinrsmith added 3 commits September 30, 2024 17:52

Cleanup ParquetInstructions.addColumnNameMapping

9fa979e

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

Given statefulness we maintain around parquetColumnName, we should no…

6b58468

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

Add ParquetInstructions test

3e34bfa

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

Add writing support

19a6490

Review response

ce2f2b8

devinrsmith requested a review from malhotrashivam October 1, 2024 16:56

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

devinrsmith added 2 commits October 1, 2024 12:18

Handle case where a parquet field has non-unique field ids

a6ed292

Ensure LIST support for field_id

35e2983

devinrsmith requested a review from malhotrashivam October 1, 2024 19:59

malhotrashivam reviewed Oct 2, 2024

View reviewed changes

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java Outdated Show resolved Hide resolved

review response

1a2aa69

devinrsmith requested a review from malhotrashivam October 2, 2024 18:42

malhotrashivam previously approved these changes Oct 2, 2024

View reviewed changes

malhotrashivam mentioned this pull request Oct 2, 2024

feat: Added support to write iceberg tables #5989

Open

rcaudy reviewed Oct 7, 2024

View reviewed changes

devinrsmith added 4 commits October 7, 2024 10:19

easy

057bd2c

Ensure getColumnChunk is consistent if both parquet column name and f…

1408d2a

…ield id mappings are provided

Merge remote-tracking branch 'upstream/main' into parquet-field-ids

83ea295

Add Nested parquet file testing

4e7a3b1

devinrsmith dismissed malhotrashivam’s stale review via 4e7a3b1 October 8, 2024 17:32

devinrsmith requested a review from rcaudy October 8, 2024 17:32

rcaudy reviewed Oct 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow parquet column access by field_id #6156

feat: Allow parquet column access by field_id #6156

devinrsmith commented Sep 30, 2024 •

edited

Loading

malhotrashivam left a comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

malhotrashivam Oct 1, 2024

devinrsmith Oct 1, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam Oct 1, 2024 •

edited

Loading

devinrsmith Oct 1, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 7, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 7, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

		// TODO: how should we handle this? Ignore?
		// throw new IllegalArgumentException();

	"For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",
	"For columnName=%s, instructions provided an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

feat: Allow parquet column access by field_id #6156

Are you sure you want to change the base?

feat: Allow parquet column access by field_id #6156

Conversation

devinrsmith commented Sep 30, 2024 • edited Loading

malhotrashivam left a comment

Choose a reason for hiding this comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinrsmith commented Oct 1, 2024

malhotrashivam Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinrsmith commented Sep 30, 2024 •

edited

Loading

malhotrashivam Oct 1, 2024 •

edited

Loading