Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-459: Add Variant logical type annotation #460

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,41 @@ defined by the [BSON specification][bson-spec].

The sort order used for `BSON` is unsigned byte-wise comparison.

### VARIANT

`VARIANT` is used for a Variant value. It must annotate a group. The group must
contain a `binary` field named `metadata`, and a `binary` field named `value`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been using BYTE_ARRAY instead of binary in this doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. The type is BYTE_ARRAY in thrift but binary in actual type definitions.

I think that binary is more clear, but we should mention that they are synonyms at a minimum. How about this?

The group must contain a field named metadata and a field named value. Both fields must have type binary, which is also called BYTE_ARRAY in the Parquet thrift definition.

The `VARIANT` annotated group can be used to store either an unshredded Variant
value, or a shredded Variant value.

* The Variant group must be annotated with the `VARIANT` logical type.
* Both fields `value` and `metadata` must be of type `binary`.
* The `metadata` field is required and must be a valid Variant metadata component,
as defined by the [Variant binary encoding specification](VariantEncoding.md).
* When present, the `value` field must be a valid Variant value component,
as defined by the [Variant binary encoding specification](VariantEncoding.md).
* The `value` field is required for unshredded Variant values.
* The `value` field is optional and may be null only when parts of the Variant
value are shredded according to the [Variant shredding specification](VariantShredding.md).
* Additional fields which start with `_` (underscore) can be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? None of the other types allow writing columns that should be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was desired in case there were some additional (but redundant) metadata or values we might store, and still allow it to be a valid Variant value (group).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we want to add ignored columns. If we need to update the spec because something is missing, we should just do that directly instead of working around it with unspecified columns that only work in certain proprietary cases.


This is the expected representation of an unshredded Variant in Parquet:
```
optional group variant_unshredded (VARIANT) {
required binary metadata;
required binary value;
}
```

This is an example representation of a shredded Variant in Parquet:
```
optional group variant_shredded (VARIANT) {
required binary metadata;
optional binary value;
optional int64 typed_value;
}
```

## Nested Types

This section specifies how `LIST` and `MAP` can be used to encode nested types
Expand Down
8 changes: 8 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,12 @@ struct JsonType {
struct BsonType {
}

/**
* Embedded Variant logical type annotation
*/
struct VariantType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +416,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -980,6 +987,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* VARIANT - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down