Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41810: [C++] Support cast kernel from (dense or sparse) union to (large) string #41827

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

llama90
Copy link
Contributor

@llama90 llama90 commented May 25, 2024

Rationale for this change

Support cast kernel from (dense or sparse) union to (large) string

What changes are included in this PR?

  • Support cast kernel
    • from dense union to (large) string
    • from sparse union to (large) string

Are these changes tested?

Yes. It is passed by existing test cases.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #41810 has been automatically assigned in GitHub to PR creator.

BuilderType builder(input.type->GetSharedPtr(), ctx->memory_pool());
RETURN_NOT_OK(builder.Reserve(input.length));

for (int64_t i = 0; i < input.length; ++i) {
Copy link
Collaborator

@ZhangHuiGui ZhangHuiGui May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a better way to implement this by expand StringFormatter (include other nested types)? @felipecrv
In this way, we can unify it with other type's implementations and shield the logic of converting strings in the current file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review and the insightful suggestions.

That makes sense, and I'll consider applying that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this issue is related: #41831.

Copy link
Contributor

@felipecrv felipecrv Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, we can unify it with other type's implementations and shield the logic of converting strings in the current file.

True, but that can prevent optimizations in the future. The approach of taking a scalar function and turning it into an array function by mapping —array::map(scalar_function: scalar -> scalar) -> array — is appealing but prevents vectorization techniques.

UPDATE: that's what we will do here because the set of unions and their parametrizations is infinite, but StringFormatter<MonthIntervalType> is not the way to go because it would have to switch on the type for every invocation of the formatter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string formatter for unions should allocate a vector of string formatters that can do virtual dispatching (and deal with nesting themselves as well). StringFormatter<T> performs static dispatch which allows loop-specialization for the non-nested types. But for nested types we will need to setup a vector of VirtualStringFormatter (which is actually a tree) so that all the "switching on the type" happens at construction time (beginning of the loop) and invocations inside the loop are following the same function pointers from the vtables.

// in header
class VirtualStringFormatter {
  virtual ... = 0;
};

Result<std::unique_ptr<VirtualStringFormatter>> MakeFormatter(
  const std::shared_ptr<DataType>& type);

// in an anon namespace of the .cpp
// one sub-class per `Type::type`
class <T>StringFormatter : public VirtualStringFormatter {
}
// you can use templates to cover most cases delegating to StringFormatter<T>

class UnionStringFormatter : ...

This hierarchy would be similar to the builder class hierarchy.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 27, 2024
@llama90
Copy link
Contributor Author

llama90 commented May 27, 2024

I have been examining the code with the intention of using StringFormatter, as it seems like a reasonable choice. However, currently, StringFormatter mostly handles primitive types and takes time to support complex types such as Union, Struct, and List-Like types.

Not only did I review the code, but I also attempted to modify it. These modifications are related to the Legacy code such as CastImpl.

For reference, please see:

arrow/cpp/src/arrow/scalar.cc

Lines 1239 to 1250 in ff9921f

// formattable to string
template <typename To, typename From, typename T = typename From::TypeClass,
typename Formatter = internal::StringFormatter<T>,
// note: Value unused but necessary to trigger SFINAE if Formatter is
// undefined
typename Value = typename Formatter::value_type>
typename std::enable_if_t<std::is_same<To, StringType>::value,
Result<std::shared_ptr<Scalar>>>
CastImpl(const From& from, std::shared_ptr<DataType> to_type) {
return std::make_shared<StringScalar>(FormatToBuffer(Formatter{from.type.get()}, from),
std::move(to_type));
}

The reason I am addressing these issues is fundamentally to remove the legacy CastTo.

In the short term, there may be code that requires improvement, but I would like to approach this with the following steps:

  1. Implement Cast Compute Kernel.
  2. Remove Legacy Cast.
  3. Extend StringFormatter to support additional types (e.g., Union, Struct, List-Like, etc.).
    • New Issues. e.g., [C++] Support StringFormatter Union Type
  4. Enable the use of StringFormatter within the Cast Compute Kernel.
    • New Issues. e.g., [C++] Enhance Cast Kernel from Union to String using StringFormatter

Even the existing implementation of CastTo (to string) does not utilize StringFormatter for complex types.

For reference, please see:

arrow/cpp/src/arrow/scalar.cc

Lines 1309 to 1326 in ff9921f

// union types to string
template <typename To>
typename std::enable_if_t<std::is_same<To, StringType>::value,
Result<std::shared_ptr<Scalar>>>
CastImpl(const UnionScalar& from, std::shared_ptr<DataType> to_type) {
const auto& union_ty = checked_cast<const UnionType&>(*from.type);
std::stringstream ss;
const Scalar* selected_value;
if (from.type->id() == Type::DENSE_UNION) {
selected_value = checked_cast<const DenseUnionScalar&>(from).value.get();
} else {
const auto& sparse_scalar = checked_cast<const SparseUnionScalar&>(from);
selected_value = sparse_scalar.value[sparse_scalar.child_id].get();
}
ss << "union{" << union_ty.field(union_ty.child_ids()[from.type_code])->ToString()
<< " = " << selected_value->ToString() << '}';
return std::make_shared<StringScalar>(Buffer::FromString(ss.str()), std::move(to_type));
}

What do you think of this approach? I believe this is a way to address each issue and improve the code accordingly. Mentioning everyone who has reviewed the cast code.

cc @kou, @felipecrv, @ZhangHuiGui

@llama90
Copy link
Contributor Author

llama90 commented May 28, 2024

I have created an issue to add support for nested types in StringFormatter. It seems like addressing this first would be beneficial.

@ZhangHuiGui
Copy link
Collaborator

ZhangHuiGui commented May 29, 2024

In the short term, there may be code that requires improvement, but I would like to approach this with the following steps:

  1. Implement Cast Compute Kernel.

  2. Remove Legacy Cast.

  3. Extend StringFormatter to support additional types (e.g., Union, Struct, List-Like, etc.).

    • New Issues. e.g., [C++] Support StringFormatter Union Type
  4. Enable the use of StringFormatter within the Cast Compute Kernel.

    • New Issues. e.g., [C++] Enhance Cast Kernel from Union to String using StringFormatter

Even the existing implementation of CastTo (to string) does not utilize StringFormatter for complex types.

I'm ok to move these issues forward with this way.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 2, 2024
const ArraySpan& child_span = input.child_data[union_type.child_ids()[type_id]];

std::shared_ptr<Scalar> child_scalar;
auto child_index = union_type.mode() == UnionMode::DENSE ? offsets[i] : i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a template parameter (a UnionMode::type) so we can specialize with if constexpr checks in the loop.

@felipecrv
Copy link
Contributor

@llama90 what is the motivation for working on these kernels? It's hard to define what would be a desired string representation of nested types (something for humans to read? JSON-like notation?) and it's quite a lot of work.

If you're looking for compute kernel work, may I interest you in making sure run-end encoded [1] types are handled everywhere? Scalar kernels can be fixed generically by applying the transformation to the values and keeping the same run-ends. There might be something like this for handling DICTIONARY automatically as well that you can use for inspiration.

[1] https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout

@llama90
Copy link
Contributor Author

llama90 commented Jun 3, 2024

@felipecrv Thank you for your review.

The main reason for handling this issue is to address the following problem:

I have implemented Cast Kernels for the necessary types. And I wanted to tackle the next issue to finally remove the Scalar::CastTo function:

However, I encountered a problem (such as in #39192 (comment)) where various languages binding C++ produced errors (e.g., Python, Ruby). Recently, I found a hint that implementing the ${TYPES} to String Cast Kernel could solve the issue while trying to fix the recent problems again.

As you mentioned, the long-term aspect of "something for humans to read? JSON-like notation?" is important, but the primary problem I wanted to solve was to remove the Scalar::CastTo function while maintaining compatibility with the existing code.

For example, the Scalar::ToString function used CastTo(utf8()) to handle ToString(), and I thought I could solve this by implementing the Cast Kernel for String Type. This is why I attempted to create these kernels.

std::string Scalar::ToString() const {
if (!this->is_valid) {
return "null";
}
if (type->id() == Type::DICTIONARY) {
auto dict_scalar = checked_cast<const DictionaryScalar*>(this);
return dict_scalar->value.dictionary->ToString() + "[" +
dict_scalar->value.index->ToString() + "]";
}
auto maybe_repr = CastTo(utf8());
if (maybe_repr.ok()) {
return checked_cast<const StringScalar&>(*maybe_repr.ValueOrDie()).value->ToString();
}
std::string result;
std::shared_ptr<Array> as_array = *MakeArrayFromScalar(*this, 1);
DCHECK_OK(PrettyPrint(*as_array, PrettyPrintOptions::Defaults(), &result));
return result;
}

Thank you again for the in-depth review. I feel a bit overwhelmed about how to proceed.

Solving new issues is important, but I also want to finish the PRs I have already submitted. It feels like I'm not making much progress. Anyway, I will think more about it based on your review.

@felipecrv
Copy link
Contributor

For example, the Scalar::ToString function used CastTo(utf8()) to handle ToString(), and I thought I could solve this by implementing the Cast Kernel for String Type. This is why I attempted to create these kernels.

OK, it's good that there is precedent in the codebase for the string conversions to be this way.

Thank you again for the in-depth review. I feel a bit overwhelmed about how to proceed.

I'm sorry. Anything that touches all the Arrow types becomes very overwhelming.

And when you're dealing with nested types you move away from the world where static dispatching works and you need a way to perform dynamic dispatching (either via switch on type ids or virtual calls — vtables).

Imagine converting an instance of list<list<struct<utf8, list<union<int64, utf8>>>>> to string. You need a tree of string formatters prepared before the loop and it has to follow the same shape of the DataType. Doing it value-by-value is just as hard, but slower.

@llama90
Copy link
Contributor Author

llama90 commented Jun 12, 2024

Imagine converting an instance of list<list<struct<utf8, list<union<int64, utf8>>>>> to string. You need a tree of string formatters prepared before the loop and it has to follow the same shape of the DataType. Doing it value-by-value is just as hard, but slower.

I think I understand what you mean. Thank you for your response!

If you proceed with the work on the StringFormatter as mentioned, I will review it and continue with the process. Thank you once again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants