-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41810: [C++] Support cast kernel from (dense or sparse) union to (large) string #41827
base: main
Are you sure you want to change the base?
Conversation
|
BuilderType builder(input.type->GetSharedPtr(), ctx->memory_pool()); | ||
RETURN_NOT_OK(builder.Reserve(input.length)); | ||
|
||
for (int64_t i = 0; i < input.length; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a better way to implement this by expand StringFormatter
(include other nested types)? @felipecrv
In this way, we can unify it with other type's implementations and shield the logic of converting strings in the current file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review and the insightful suggestions.
That makes sense, and I'll consider applying that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this issue is related: #41831.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way, we can unify it with other type's implementations and shield the logic of converting strings in the current file.
True, but that can prevent optimizations in the future. The approach of taking a scalar function and turning it into an array function by mapping —array::map(scalar_function: scalar -> scalar) -> array
— is appealing but prevents vectorization techniques.
UPDATE: that's what we will do here because the set of unions and their parametrizations is infinite, but StringFormatter<MonthIntervalType>
is not the way to go because it would have to switch
on the type for every invocation of the formatter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string formatter for unions should allocate a vector of string formatters that can do virtual dispatching (and deal with nesting themselves as well). StringFormatter<T>
performs static dispatch which allows loop-specialization for the non-nested types. But for nested types we will need to setup a vector of VirtualStringFormatter
(which is actually a tree) so that all the "switching on the type" happens at construction time (beginning of the loop) and invocations inside the loop are following the same function pointers from the vtables.
// in header
class VirtualStringFormatter {
virtual ... = 0;
};
Result<std::unique_ptr<VirtualStringFormatter>> MakeFormatter(
const std::shared_ptr<DataType>& type);
// in an anon namespace of the .cpp
// one sub-class per `Type::type`
class <T>StringFormatter : public VirtualStringFormatter {
}
// you can use templates to cover most cases delegating to StringFormatter<T>
class UnionStringFormatter : ...
This hierarchy would be similar to the builder class hierarchy.
I have been examining the code with the intention of using Not only did I review the code, but I also attempted to modify it. These modifications are related to the Legacy code such as For reference, please see: Lines 1239 to 1250 in ff9921f
The reason I am addressing these issues is fundamentally to remove the legacy In the short term, there may be code that requires improvement, but I would like to approach this with the following steps:
Even the existing implementation of For reference, please see: Lines 1309 to 1326 in ff9921f
What do you think of this approach? I believe this is a way to address each issue and improve the code accordingly. Mentioning everyone who has reviewed the cast code. cc @kou, @felipecrv, @ZhangHuiGui |
I have created an issue to add support for nested types in StringFormatter. It seems like addressing this first would be beneficial. |
I'm ok to move these issues forward with this way. |
const ArraySpan& child_span = input.child_data[union_type.child_ids()[type_id]]; | ||
|
||
std::shared_ptr<Scalar> child_scalar; | ||
auto child_index = union_type.mode() == UnionMode::DENSE ? offsets[i] : i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a template parameter (a UnionMode::type
) so we can specialize with if constexpr
checks in the loop.
@llama90 what is the motivation for working on these kernels? It's hard to define what would be a desired string representation of nested types (something for humans to read? JSON-like notation?) and it's quite a lot of work. If you're looking for compute kernel work, may I interest you in making sure run-end encoded [1] types are handled everywhere? Scalar kernels can be fixed generically by applying the transformation to the values and keeping the same run-ends. There might be something like this for handling DICTIONARY automatically as well that you can use for inspiration. [1] https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout |
@felipecrv Thank you for your review. The main reason for handling this issue is to address the following problem: I have implemented However, I encountered a problem (such as in #39192 (comment)) where various languages binding C++ produced errors (e.g., Python, Ruby). Recently, I found a hint that implementing the As you mentioned, the long-term aspect of For example, the Lines 978 to 996 in 54bece3
Thank you again for the in-depth review. I feel a bit overwhelmed about how to proceed. Solving new issues is important, but I also want to finish the PRs I have already submitted. It feels like I'm not making much progress. Anyway, I will think more about it based on your review. |
OK, it's good that there is precedent in the codebase for the string conversions to be this way.
I'm sorry. Anything that touches all the Arrow types becomes very overwhelming. And when you're dealing with nested types you move away from the world where static dispatching works and you need a way to perform dynamic dispatching (either via Imagine converting an instance of |
I think I understand what you mean. Thank you for your response! If you proceed with the work on the |
Rationale for this change
Support
cast
kernel from (dense
orsparse
) union to (large) stringWhat changes are included in this PR?
cast
kerneldense union
to (large)string
sparse union
to (large)string
Are these changes tested?
Yes. It is passed by existing test cases.
Are there any user-facing changes?
No.