Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Add Opaque canonical extension type #41823

Closed
wants to merge 27 commits into from

Conversation

lidavidm
Copy link
Member

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 25, 2024
@lidavidm
Copy link
Member Author

Updated NA -> Null, add "vendor_name"

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing this and putting together all of the language! I am excited to implement this in any drivers if/when it is available.

Is there any desire to include any other information that the producer has available? If some future Parquet standard allows for extension types it might make sense for the Parquet reader to pass on the extension metadata somehow (or maybe that should only be considered when the feature is first requested, which might be never?)

Comment on lines 333 to 337
- The PostgreSQL ``polygon`` type may be represented as Other[binary] with
metadata ``{"type_name": "polygon", "vendor_name": "PostgreSQL"}``.
- The PostgreSQL ``point`` type may be represented as
Other[fixed_size_binary[16]] with metadata
``{"type_name": "point", "vendor_name": "PostgreSQL"}``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more realistic example might be using the name "geometry", since this is what the PostGIS extension type would give you here (I'm not aware of widespread usage of the point or polygon types, although they are certainly fine for illustration purposes). apache/arrow-adbc#546 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the example.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 28, 2024
@lidavidm
Copy link
Member Author

Is there any desire to include any other information that the producer has available? If some future Parquet standard allows for extension types it might make sense for the Parquet reader to pass on the extension metadata somehow (or maybe that should only be considered when the feature is first requested, which might be never?)

In that case, wouldn't it just get mapped as a different extension type? I suppose I can see this being used as a generic "Parquet gave an extension type that I don't know how to deal with". I think we can add a new field at that point.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 29, 2024
@jorisvandenbossche
Copy link
Member

Thoughts on using "Unknown" instead of "Other" ?

It's not entirely clear to me what the motivation is to bucket those other types together in a single arrow.other extension type, instead of just letting ADBC drivers define specific extension types for types that don't match to one of the arrow types (eg the ADBC postgres driver could annotate a binary field with adbc_driver_postgresql.polygon or adbc.postgresql.polygon when encountering a polygon type)

@lidavidm
Copy link
Member Author

The set of unknown types isn't a closed set.

@paleolimbot
Copy link
Member

Thoughts on using "Unknown" instead of "Other" ?

I recently was looking into this concept in Parquet and apparently they use the term "undefined" for this there. I think "unknown" best communicates this concept (as in, this is not a type that is known to the implementation); however, I don't have strong feelings about the name.

It's not entirely clear to me what the motivation is to bucket those other types together in a single arrow.other extension

For something like nanoarrow this might work; however, in Arrow C++ the extension type information is aggressively dropped, so it's very possible that that a pyarrow consumer would never be able to query the extension name to get the relevant information (what the type name is and where it came from). One could argue that also maybe Arrow C++ should stop aggressively dropping extension information, but even with that I think that this extension type is a better solution than an extension name that a database might not even know that it has until it receives a query.

@jorisvandenbossche
Copy link
Member

The set of unknown types isn't a closed set.

But also this extension type is not a closed set, as you can put whatever you want in the type_name and vendor_name metadata?

For something like nanoarrow this might work; however, in Arrow C++ the extension type information is aggressively dropped

Then we should maybe consider if that behaviour in C++ is practical, otherwise this "arrow.other" type feels like a workaround for a limitation in the C++ API.
In the past I had considered whether we (C++/pyarrow) should have a generic UnknownExtensionType subclass that is used for any extension type that is not recognized (i.e. for which the name is not registered). See #22572 (however, doing that now with dropping of the metadata in the schema's field metadata (as we do for registered types) would be a breaking change).

@jorisvandenbossche
Copy link
Member

To give a concrete example, to try to clarify:

  • GDAL reads data from a file that contains a column with WKB values and returns Arrow data with a column of binary type with the "ogc.wkb" annotation
    • Should GDAL instead also use the "arrow.other" type?
  • The ADBC postgresql driver reads data from a postgres table that has a column with the polygon type, and returns Arrow data with a column of binary type with the "arrow.other" annotation (and type_name "polygon" in the metadata)
    • Why would the driver not return a custom extension type?

In both cases, a producer of Arrow data is returning data with a data type that is not natively supported. So from an Arrow implementation (like Arrow C++) point of view, both cases seem equivalent. But so why would we treat the one case differently (assuming we add explicit support for "arrow.other" extension type) than the other?

Maybe what I am after is some better guidelines for when data producers should use the "arrow.other" type vs a custom extension type name.

@lidavidm
Copy link
Member Author

But also this extension type is not a closed set, as you can put whatever you want in the type_name and vendor_name metadata?

Then you risk colliding with an actual extension type, and you aren't communicating that the intermediate system doesn't know how to interpret the type. Part of the use case is just in catalogs like Flight SQL where there isn't necessarily concrete data in play.

GDAL wouldn't use it for types it knows about.

I would argue the typical extension type is to claim support for a type that is not part of Arrow's type system with some particular semantics (UUID, or geometry, etc.). "Other" is different, it is to explicitly disclaim support while also avoiding a hard error or silent data loss. See the examples already written in the proposal.

@paleolimbot
Copy link
Member

otherwise this "arrow.other" type feels like a workaround for a limitation in the C++ API.

I think they are considering two different scenarios: an "extension type" is perhaps something that has a definition or specification somewhere, whereas when the Postgres driver encounters a type that it didn't know about at compile time, all ADBC has is const char* typname and const char* vendor_name and it needs a way to pass that on. There is no guarantee regarding the content of the vendor name or the type name, so you might need something like adbc:::PostgreSQL:::Some.Type in the off chance one of the components contained a period.

I would also like to see pyarrow flag unknown extension types, but using that system for this use-case feels very hack-y to me.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 30, 2024

While I certainly understand the distinction of it being the intermediate system not knowing how to interpret the type (where in the example of GDAL the type is of course definitly known to GDAL), the "not knowing" part is bit fluid.

One of the examples given is the PostgreSQL polygon type, which AFAIK is a built-in postgres type. The driver not knowing about this type seems to be a choice you make in the implementation (although disclaimer I am not familiar with the actual implementation). The postgres driver could perfectly know about the polygon type if you wanted? But I assume that is not done right now because there is no matching type on the Arrow side to map it to.


Something else: since it seems this mostly comes from ADBC use cases, it would also be an option to use an "adbc.other" extension type with the described semantics? That avoids requiring a specific implementation for this on the Arrow side.

@paleolimbot
Copy link
Member

The driver not knowing about this type seems to be a choice you make in the implementation

Probably a better example would be a user-defined type (type alias for a built-in or extension type) or a type from a loaded extension (which the ADBC driver would never have an opportunity to know). I think the example was changed to "geometry", which is from an extension (although we could hard-code support for PostGIS).

That avoids requiring a specific implementation for this on the Arrow side.

If extension type information wasn't dropped in multiple implementations this would probably work. Ensuring extension type registration in the presence of shared/static libraries and multiple languages is not trivial (as we know from considering this for geoarrow 🙂 ).

since it seems this mostly comes from ADBC use cases

I think this issue comes up at any place data is converted to Arrow from something else. In R we (I) invented the undocumented arrow.r.vctrs extension type to handle something like this (unknown type name but storage that we do know how to convert). That's an example where the ability to attach some extra payload would be helpful (but also one where an extension type is fine because it only applies to one binding).

@ianmcook
Copy link
Member

ianmcook commented May 31, 2024

Do we want to try to prevent haphazard use of type_name and vendor_name?

Without some attempt to canonicalize well-known pairs of type and vendor, we could lose opportunities for interoperability. For example one developer uses PostgreSQL, another uses pgsql, another uses postgres, and so on.

Does that matter?
EDIT: Disregard; I now understand that this is not the point of this type.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 22, 2024
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

An example of the extension metadata would be::

{"type_name": "geometry", "vendor_name": "PostGIS"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: JSON MUST use UTF-8: https://datatracker.ietf.org/doc/html/rfc8259#section-8.1

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024
Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lidavidm for this proposal and Java component LGTM!

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look through the C++ and Python implementations (as well as the definition in CanonicalExtensions.rst). Thank you!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 23, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format spec addition looks good to me!

For the Python bindings, looks good as well, just a few minor comments:

  • We probably want to add OpaqueScalar to pyarrow/__init__.py as well? (although I see we forgot this for the tensor type, which I assume you have been imitating)
  • Can you add the three new classes to test_extension_type_constructor_errors in test_misc.py?
  • Can you list the new type in the python API docs (docs/source/python/api/datatypes.rst, docs/source/python/api/arrays.rst
    )? See eg #39652 for a previous PR adding a new type where this is added to the docs.

(FWIW I think we ideally would have a separate PR(s) for the implementations, so this Format PR (and what we vote on) is just the format change)

o.VendorName == rhs.VendorName
}

type OpaqueArray struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: godoc comment

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 24, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM!

std::string OpaqueType::ToString(bool show_metadata) const {
std::stringstream ss;
ss << "extension<" << this->extension_name()
<< "[storage_type=" << storage_type_->ToString() << ", type_name=" << type_name_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show_metadata passing into storage_type_->ToString()?

@@ -22,6 +22,7 @@
#include "arrow/compute/kernels/scalar_cast_internal.h"
#include "arrow/compute/kernels/util_internal.h"
#include "arrow/scalar.h"
#include "arrow/type_fwd.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why type_fwd is included?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits from a review of the python.

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jul 24, 2024
Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java LGTM!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Jul 25, 2024
@lidavidm
Copy link
Member Author

I've split this into four issues/PRs: #43453

Thanks all for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.