-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-459: Add Variant logical type annotation #460
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -563,6 +563,41 @@ defined by the [BSON specification][bson-spec]. | |
|
||
The sort order used for `BSON` is unsigned byte-wise comparison. | ||
|
||
### VARIANT | ||
|
||
`VARIANT` is used for a Variant value. It must annotate a group. The group must | ||
contain a `binary` field named `metadata`, and a `binary` field named `value`. | ||
The `VARIANT` annotated group can be used to store either an unshredded Variant | ||
value, or a shredded Variant value. | ||
|
||
* The Variant group must be annotated with the `VARIANT` logical type. | ||
* Both fields `value` and `metadata` must be of type `binary`. | ||
* The `metadata` field is required and must be a valid Variant metadata component, | ||
as defined by the [Variant binary encoding specification](VariantEncoding.md). | ||
* When present, the `value` field must be a valid Variant value component, | ||
as defined by the [Variant binary encoding specification](VariantEncoding.md). | ||
* The `value` field is required for unshredded Variant values. | ||
* The `value` field is optional and may be null only when parts of the Variant | ||
value are shredded according to the [Variant shredding specification](VariantShredding.md). | ||
* Additional fields which start with `_` (underscore) can be ignored. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this needed? None of the other types allow writing columns that should be ignored. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was desired in case there were some additional (but redundant) metadata or values we might store, and still allow it to be a valid Variant value (group). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that we want to add ignored columns. If we need to update the spec because something is missing, we should just do that directly instead of working around it with unspecified columns that only work in certain proprietary cases. |
||
|
||
This is the expected representation of an unshredded Variant in Parquet: | ||
``` | ||
optional group variant_unshredded (VARIANT) { | ||
required binary metadata; | ||
required binary value; | ||
} | ||
``` | ||
|
||
This is an example representation of a shredded Variant in Parquet: | ||
``` | ||
optional group variant_shredded (VARIANT) { | ||
required binary metadata; | ||
optional binary value; | ||
optional int64 typed_value; | ||
} | ||
``` | ||
|
||
## Nested Types | ||
|
||
This section specifies how `LIST` and `MAP` can be used to encode nested types | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -380,6 +380,12 @@ struct JsonType { | |
struct BsonType { | ||
} | ||
|
||
/** | ||
* Embedded Variant logical type annotation | ||
*/ | ||
struct VariantType { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good to me. |
||
} | ||
|
||
/** | ||
* LogicalType annotations to replace ConvertedType. | ||
* | ||
|
@@ -410,6 +416,7 @@ union LogicalType { | |
13: BsonType BSON // use ConvertedType BSON | ||
14: UUIDType UUID // no compatible ConvertedType | ||
15: Float16Type FLOAT16 // no compatible ConvertedType | ||
16: VariantType VARIANT // no compatible ConvertedType | ||
} | ||
|
||
/** | ||
|
@@ -980,6 +987,7 @@ union ColumnOrder { | |
* ENUM - unsigned byte-wise comparison | ||
* LIST - undefined | ||
* MAP - undefined | ||
* VARIANT - undefined | ||
* | ||
* In the absence of logical types, the sort order is determined by the physical type: | ||
* BOOLEAN - false, true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been using
BYTE_ARRAY
instead ofbinary
in this doc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. The type is
BYTE_ARRAY
in thrift butbinary
in actual type definitions.I think that
binary
is more clear, but we should mention that they are synonyms at a minimum. How about this?