Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ZhangHuiGui
Copy link
Collaborator

@ZhangHuiGui ZhangHuiGui commented Jun 5, 2024

Rationale for this change

Mainly for below improvement and codes simplification:

  1. NullGeneralization not support chunked-array's null status check which may prevent some optimization for preallocation.
  2. ComputeDataPreallocate not support [Large]ListView's preallocate.
  3. Some DCHECK and branch check for preallocate the validity-bitmaps are unnecessary.

What changes are included in this PR?

Described as above.

Are these changes tested?

Yes, covered by exists.

Are there any user-facing changes?

No

Copy link

github-actions bot commented Jun 5, 2024

⚠️ GitHub issue #41974 has been automatically assigned in GitHub to PR creator.

@@ -966,12 +998,6 @@ class ScalarExecutor : public KernelExecutorImpl<ScalarKernel> {
data_preallocated_.size() == static_cast<size_t>(output_num_buffers_ - 1) &&
!is_nested(out_type_id) && !is_dictionary(out_type_id));

// TODO(wesm): why was this check ever here? Fixed width binary
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeDataPreallocate can make sure all of the element in data_preallocated_ could satisfy the DCHECK.
It's unnecessary to keep this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keeping this because it's only a debug build?

Copy link
Collaborator Author

@ZhangHuiGui ZhangHuiGui Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the DCHECK you suggested before ( https://github.com/apache/arrow/pull/41975/files#r1636669846). In essence, they detect the same content, otherwise there will be two duplicate DCHECKs.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 5, 2024
// chunked arrays (VectorKernel::exec_chunked), so we check if we
// have any chunked arrays. If we do and an exec_chunked function
// is defined then we call that.
bool have_chunked_arrays = false;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have_chunked_arrays only used in kernel_->can_execute_chunkwise=false's branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this code move. That i could only understand when looking commit by commit.

output_num_buffers_ = static_cast<int>(output_type_.type->layout().buffers.size());

// Decide if we need to preallocate memory for this kernel
validity_preallocated_ =
Copy link
Collaborator Author

@ZhangHuiGui ZhangHuiGui Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use more precise pre-allocation strategy by check the input batches as what ScalarExecutor do.

We could prevent some unnecessary pre-allocation by this precise strategy (checked the input batches status). And this could make an improvement for the situations like below:
before:

TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000        4395737 ns      4393473 ns          138 items_per_second=119.333M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10          5109157 ns      5106714 ns          129 items_per_second=102.666M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2           6586773 ns      6583881 ns          104 items_per_second=79.6321M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1           1868864 ns      1867719 ns          364 items_per_second=280.71M/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0           3133277 ns      3131282 ns          248 items_per_second=167.436M/s null_percent=0 size=524.288k

after:

TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000        1813275 ns      1812657 ns          392 items_per_second=289.237M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10          2711716 ns      2710747 ns          255 items_per_second=193.411M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2           4460442 ns      4459029 ns          154 items_per_second=117.579M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1            248396 ns       248270 ns         2695 items_per_second=2.11176G/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0            878407 ns       878028 ns          769 items_per_second=597.12M/s null_percent=0 size=524.288k

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I will take a look.
Current PR just optimize the execution logic of chunked-array from the NULL pre-allocation part.

@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) {
namespace {

struct NullGeneralization {
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL };
enum type { PERHAPS_NULL = 0, ALL_VALID = 1, ALL_NULL = 2, INVALID = 3 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the name "invalid" ambigious here? Could it be ALL_NULL ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be replaced by ALL_NULL with ALL_NULL =3, so we could keep the original NullGeneralization::type of each chunk when doing & operation in type Get(const ChunkedArray& chunk_array) .

@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor {
}
for (size_t i = 0; i < data_preallocated_.size(); ++i) {
const auto& prealloc = data_preallocated_[i];
if (prealloc.bit_width >= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would NA type do here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_preallocated_ will be filled in ComputeDataPreallocate, and the NA type will not be added to data_preallocated_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems this could be a DCHECK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_size_binary<0> can have bit_width == 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_size_binary<0> can have bit_width == 0

Yes, fixed_size_binary<0> can be added into data_preallocated_ normally in ComputeDataPreallocate, and the function call make sure all of element in data_preallocated_ should satisfy >=0. So we just add a DCHECK here.

@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) {
namespace {

struct NullGeneralization {
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL };
enum type { PERHAPS_NULL = 1, ALL_VALID = 2, ALL_NULL = 3 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this introducing a undefined value here? (return 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).

Comment on lines 523 to 524
const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get();
value.SetArray(*curr_chunk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get();
value.SetArray(*curr_chunk);
const ArrayData& curr_chunk = chunk_array.chunk(chunk_idx)->data();
value.SetArray(curr_chunk);

Or construct a ArraySpan?

@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor {
}
for (size_t i = 0; i < data_preallocated_.size(); ++i) {
const auto& prealloc = data_preallocated_[i];
if (prealloc.bit_width >= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems this could be a DCHECK?

@@ -966,12 +998,6 @@ class ScalarExecutor : public KernelExecutorImpl<ScalarKernel> {
data_preallocated_.size() == static_cast<size_t>(output_num_buffers_ - 1) &&
!is_nested(out_type_id) && !is_dictionary(out_type_id));

// TODO(wesm): why was this check ever here? Fixed width binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keeping this because it's only a debug build?

@@ -1123,6 +1142,34 @@ class VectorExecutor : public KernelExecutorImpl<VectorKernel> {
}
}

Status SetupPreallocation(const std::vector<Datum>& args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between it and SetupPreallocation in Scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?

Copy link
Collaborator Author

@ZhangHuiGui ZhangHuiGui Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this code churn necessary to achieve the goals of this PR?

Yes, that's one of the goals of this pr, we could use a more precise pre-allocation way for vector-executor.

The main change is to support the detection of null values ​​when kernel_->null_handling == NullHandling::INTERSECTION, so as to skip some unnecessary preallocations, including the null value detection of chunk-array, which was not done before. And also, the logic here is just same as ScalarExecutor::SetupPreallocation.

Because this logic is decoupled from the scheduling of VectorExecutor Execute, it is better to consider extracting it.

Whats the difference between it and SetupPreallocation in Scalar?

The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.

Do you think we can extract SetupPreallocation to base class and extract common logic out, then implement handling for chunked array with:

void SetupPreallocation(...) final {
  Base::SetupPreallocation(...)
  ...
}

?

And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?

Copy link
Collaborator Author

@ZhangHuiGui ZhangHuiGui Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL?

Kernel output is never null and a validity bitmap does not need to be allocated.
validity_preallocated_ just preallocate the validity bitmap-buffer which unnecessary for OUTPUT_NOT_NULL.

Comment on lines 313 to 314
widths->emplace_back(64, /*added_length=*/1);
widths->emplace_back(64, /*added_length=*/1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list-views don't need an extra offset/size, so you don't need the added_length=1.

@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) {
namespace {

struct NullGeneralization {
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL };
enum type { PERHAPS_NULL = 1, ALL_VALID = 2, ALL_NULL = 3 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).

Comment on lines 520 to 521
int chunk_idx = 0;
while (chunk_idx < chunk_array.num_chunks()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a for loop here.

cpp/src/arrow/compute/exec.cc Outdated Show resolved Hide resolved
@@ -498,15 +510,36 @@ struct NullGeneralization {
return PERHAPS_NULL;
}

static type Get(const ChunkedArray& chunk_array) {
if (chunk_array.num_chunks() == 0 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if num_chunks() == 0, we should return ALL_VALID instead of ALL_NULL or PERHAPS_NULL like Get does above (it doesn't check .length() at all which I think was a missed opportunity of making ALL_VALID more likely).

output_num_buffers_ = static_cast<int>(output_type_.type->layout().buffers.size());

// Decide if we need to preallocate memory for this kernel
validity_preallocated_ =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700

@@ -1123,6 +1142,34 @@ class VectorExecutor : public KernelExecutorImpl<VectorKernel> {
}
}

Status SetupPreallocation(const std::vector<Datum>& args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 13, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:

  1. refactoring the setup preallocation logic in-place
  2. extracting that refactored logic to a function

Code movement + changes mixed together takes more time to review.

cpp/src/arrow/compute/exec.cc Outdated Show resolved Hide resolved
@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor {
}
for (size_t i = 0; i < data_preallocated_.size(); ++i) {
const auto& prealloc = data_preallocated_[i];
if (prealloc.bit_width >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_size_binary<0> can have bit_width == 0

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 13, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 13, 2024
@ZhangHuiGui
Copy link
Collaborator Author

I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:

  1. refactoring the setup preallocation logic in-place
  2. extracting that refactored logic to a function

Code movement + changes mixed together takes more time to review.

It's my fault. Thanks so much for your review @felipecrv @mapleFU.
I reorganized the commits.

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally understand the first goal of the PR. And I disapprove of it. I want kernel dispatching logic to depend solely on types and immutable properties of the kernels.

Extending the range of types that are pre-allocated is definitely a good idea though.

}
validity_preallocated_ = !elide_validity_bitmap;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is safe. If null_handling is INTERSECTION the function might want to follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection.

And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_ when it's configured.

If you want the speed benefits of not pre-allocating a validity bitmap when not necessary, declare the kernel as COMPUTED_NO_PREALLOCATE and work on the kernel logic to allocate only when needed. The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.

Copy link
Collaborator Author

@ZhangHuiGui ZhangHuiGui Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection

Are you talking about PropagateNulls? Currently, seems this logic is used when kernel set INTERSECTION.

If not, i understand that if the kernel internal execute path must follow a single code path that uses the pre-allocated buffer, COMPUTED_PREALLOCATE can be set statically, which should be safer for this kernel?

Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.

And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_ when it's configured.

From a performance perspective, the current logic does introduce many branches at runtime for INTERSECTION, compared to PropagateNulls, it actually passes NullGeneralization::Get introduces very few branches.

The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.

Yes, the current pre-allocation logic for INTERSECTION introduced NullPropagator and NullGeneralization is complicated.

@pitrou What do you think of this PR for VectorExecutor's INTERSECTION validity-bitmap preallocation?
It seems that you have changed the logic here before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that, current ScalarExecutor will pre-allocate with NullGeneralization in INTERSECTION mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by branches I meant branches in the state-space the execution framework and the kernels might traverse. Meaning that without duplication of test cases for every kernel, we might hide a latent bug until an unlikely array configuration causes a SIGSEGV on the non-pre-allocated validity buffer.

Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.

By statically I mean a one-time configuration of the kernel_ when it's added to the function in the registry.

For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.

I see, thank you for your explanation @felipecrv .

Do you think the changes in the third commit of this PR are reasonable? If so, I will revert the first two commits. The third commit is mainly related to the pre-allocation of output data-buffer and refactored part of the code, which can bring the benefits mentioned by @mapleFU .

In addition, I will create an issue to track the pre-allocation of validity-buffer in INTERSECTION mode, because currently ScalarExecutor will do pre-allocation according to NullGeneralization in this mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the handling of list-view types in the pre-allocation code is good. The 3rd commit has a mix of things in it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, except for [Large]ListView and chunked-array NullGeneration support, the others are some minor refactorings (some DCHECKs are unnecessary for now).

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Jun 13, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Jun 13, 2024
@mapleFU
Copy link
Member

mapleFU commented Jun 14, 2024

@felipecrv I just think about is there some techniques to denote an ArrayData is "owned and writable", like the output of PrepareOutput. This can do some techniques, like:

  1. Optimizing data transform in Parquet reader
  2. Do some optimization like reuse-input

@felipecrv
Copy link
Contributor

@felipecrv I just think about is there some techniques to denote an ArrayData is "owned and writable", like the output of PrepareOutput. This can do some techniques, like:

  1. Optimizing data transform in Parquet reader
  2. Do some optimization like reuse-input

It's tricky to re-use given that Buffers wrapped in shared_ptrs are supposed to be immutable. You never know if someone is also holding a reference to that shared Buffer.

if (chunk_array.num_chunks() == 0) {
return ALL_VALID;
}
if (chunk_array.null_count() == chunk_array.length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow the same constraint as Array above: "Do not count the bits if they haven't been counted already".

I think you should instead just iterate on the chunks, such as (untested):

  static type Get(const ArraySpan& array) {
    // Do not count the bits if they haven't been counted already
    if ((arr.null_count == 0) || (arr.buffers[0].data == nullptr)) {
      return ALL_VALID;
    }
    if (arr.null_count == arr.length) {
      return ALL_NULL;
    }
    return PERHAPS_NULL;
  }

  static type Get(const ChunkedArray& chunk_array) {
    std::optional<type> current_gen;
    for (const auto& chunk : chunk_array.chunks()) {
      if (chunk->length() == 0) {
        continue;
      }
      chunk_gen = Get(*chunk);
      if (current_gen.has_value() && chunk_gen != *current_gen) {
        return PERHAPS_NULL;
      }
      current_gen = chunk_gen;
    }
    return current_gen.value_or(ALL_VALID);
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, addressed.

@felipecrv felipecrv requested a review from pitrou June 26, 2024 14:58
@ZhangHuiGui
Copy link
Collaborator Author

ZhangHuiGui commented Jul 11, 2024

cc @pitrou, i think this can be moved forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants