-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41974: [C++][Compute] Support more precise pre-allocation and more pre-allocated types for ScalarExecutor and VectorExecutor #41975
base: main
Are you sure you want to change the base?
Conversation
|
@@ -966,12 +998,6 @@ class ScalarExecutor : public KernelExecutorImpl<ScalarKernel> { | |||
data_preallocated_.size() == static_cast<size_t>(output_num_buffers_ - 1) && | |||
!is_nested(out_type_id) && !is_dictionary(out_type_id)); | |||
|
|||
// TODO(wesm): why was this check ever here? Fixed width binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ComputeDataPreallocate
can make sure all of the element in data_preallocated_
could satisfy the DCHECK
.
It's unnecessary to keep this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keeping this because it's only a debug build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the DCHECK you suggested before ( https://github.com/apache/arrow/pull/41975/files#r1636669846). In essence, they detect the same content, otherwise there will be two duplicate DCHECKs.
// chunked arrays (VectorKernel::exec_chunked), so we check if we | ||
// have any chunked arrays. If we do and an exec_chunked function | ||
// is defined then we call that. | ||
bool have_chunked_arrays = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have_chunked_arrays
only used in kernel_->can_execute_chunkwise=false
's branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this code move. That i could only understand when looking commit by commit.
cpp/src/arrow/compute/exec.cc
Outdated
output_num_buffers_ = static_cast<int>(output_type_.type->layout().buffers.size()); | ||
|
||
// Decide if we need to preallocate memory for this kernel | ||
validity_preallocated_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use more precise pre-allocation strategy by check the input batches as what ScalarExecutor
do.
We could prevent some unnecessary pre-allocation by this precise strategy (checked the input batches status). And this could make an improvement for the situations like below:
before:
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 4395737 ns 4393473 ns 138 items_per_second=119.333M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 5109157 ns 5106714 ns 129 items_per_second=102.666M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 6586773 ns 6583881 ns 104 items_per_second=79.6321M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 1868864 ns 1867719 ns 364 items_per_second=280.71M/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 3133277 ns 3131282 ns 248 items_per_second=167.436M/s null_percent=0 size=524.288k
after:
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1000 1813275 ns 1812657 ns 392 items_per_second=289.237M/s null_percent=0.1 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/10 2711716 ns 2710747 ns 255 items_per_second=193.411M/s null_percent=10 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/2 4460442 ns 4459029 ns 154 items_per_second=117.579M/s null_percent=50 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/1 248396 ns 248270 ns 2695 items_per_second=2.11176G/s null_percent=100 size=524.288k
TakeChunkedFlatInt64RandomIndicesWithNulls/524288/0 878407 ns 878028 ns 769 items_per_second=597.12M/s null_percent=0 size=524.288k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, I will take a look.
Current PR just optimize the execution logic of chunked-array from the NULL pre-allocation part.
bdcd19d
to
e2f1a1f
Compare
cpp/src/arrow/compute/exec.cc
Outdated
@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) { | |||
namespace { | |||
|
|||
struct NullGeneralization { | |||
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL }; | |||
enum type { PERHAPS_NULL = 0, ALL_VALID = 1, ALL_NULL = 2, INVALID = 3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the name "invalid" ambigious here? Could it be ALL_NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be replaced by ALL_NULL
with ALL_NULL =3
, so we could keep the original NullGeneralization::type
of each chunk when doing & operation in type Get(const ChunkedArray& chunk_array)
.
@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor { | |||
} | |||
for (size_t i = 0; i < data_preallocated_.size(); ++i) { | |||
const auto& prealloc = data_preallocated_[i]; | |||
if (prealloc.bit_width >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would NA
type do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_preallocated_
will be filled in ComputeDataPreallocate, and the NA
type will not be added to data_preallocated_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems this could be a DCHECK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0>
can have bit_width == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0>
can havebit_width == 0
Yes, fixed_size_binary<0>
can be added into data_preallocated_
normally in ComputeDataPreallocate
, and the function call make sure all of element in data_preallocated_
should satisfy >=0
. So we just add a DCHECK here.
cpp/src/arrow/compute/exec.cc
Outdated
@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) { | |||
namespace { | |||
|
|||
struct NullGeneralization { | |||
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL }; | |||
enum type { PERHAPS_NULL = 1, ALL_VALID = 2, ALL_NULL = 3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this introducing a undefined value here? (return 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).
cpp/src/arrow/compute/exec.cc
Outdated
const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get(); | ||
value.SetArray(*curr_chunk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ArrayData* curr_chunk = chunk_array.chunk(chunk_idx)->data().get(); | |
value.SetArray(*curr_chunk); | |
const ArrayData& curr_chunk = chunk_array.chunk(chunk_idx)->data(); | |
value.SetArray(curr_chunk); |
Or construct a ArraySpan
?
@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor { | |||
} | |||
for (size_t i = 0; i < data_preallocated_.size(); ++i) { | |||
const auto& prealloc = data_preallocated_[i]; | |||
if (prealloc.bit_width >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems this could be a DCHECK?
@@ -966,12 +998,6 @@ class ScalarExecutor : public KernelExecutorImpl<ScalarKernel> { | |||
data_preallocated_.size() == static_cast<size_t>(output_num_buffers_ - 1) && | |||
!is_nested(out_type_id) && !is_dictionary(out_type_id)); | |||
|
|||
// TODO(wesm): why was this check ever here? Fixed width binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keeping this because it's only a debug build?
cpp/src/arrow/compute/exec.cc
Outdated
@@ -1123,6 +1142,34 @@ class VectorExecutor : public KernelExecutorImpl<VectorKernel> { | |||
} | |||
} | |||
|
|||
Status SetupPreallocation(const std::vector<Datum>& args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the difference between it and SetupPreallocation
in Scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code churn necessary to achieve the goals of this PR?
Yes, that's one of the goals of this pr, we could use a more precise pre-allocation way for vector-executor.
The main change is to support the detection of null values when kernel_->null_handling == NullHandling::INTERSECTION
, so as to skip some unnecessary preallocations, including the null value detection of chunk-array, which was not done before. And also, the logic here is just same as ScalarExecutor::SetupPreallocation
.
Because this logic is decoupled from the scheduling of VectorExecutor Execute, it is better to consider extracting it.
Whats the difference between it and
SetupPreallocation
in Scalar?
The difference between this function and SetupPreallocation
of ScalarExecutor
is that ScalarExecutor does not have an independent chunk_exec
, so it needs to use preallocate_contiguous_
to confirm whether it needs to execute PrepareOutput
according to chunks through span_iterator_
, which is not required in VectorExecutor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this function and SetupPreallocation of ScalarExecutor is that ScalarExecutor does not have an independent chunk_exec, so it needs to use preallocate_contiguous_ to confirm whether it needs to execute PrepareOutput according to chunks through span_iterator_, which is not required in VectorExecutor.
Do you think we can extract SetupPreallocation
to base class and extract common logic out, then implement handling for chunked array with:
void SetupPreallocation(...) final {
Base::SetupPreallocation(...)
...
}
?
And what if kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if
kernel_->null_handling == NullHandling::OUTPUT_NOT_NULL
?
Kernel output is never null and a validity bitmap does not need to be allocated
.
validity_preallocated_
just preallocate the validity bitmap-buffer which unnecessary for OUTPUT_NOT_NULL
.
cpp/src/arrow/compute/exec.cc
Outdated
widths->emplace_back(64, /*added_length=*/1); | ||
widths->emplace_back(64, /*added_length=*/1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list-views don't need an extra offset/size, so you don't need the added_length=1.
cpp/src/arrow/compute/exec.cc
Outdated
@@ -473,7 +485,7 @@ bool ExecSpanIterator::Next(ExecSpan* span) { | |||
namespace { | |||
|
|||
struct NullGeneralization { | |||
enum type { PERHAPS_NULL, ALL_VALID, ALL_NULL }; | |||
enum type { PERHAPS_NULL = 1, ALL_VALID = 2, ALL_NULL = 3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should start from 0 to give an easier time to the compiler (when it's proving boundaries in the optimizer).
cpp/src/arrow/compute/exec.cc
Outdated
int chunk_idx = 0; | ||
while (chunk_idx < chunk_array.num_chunks()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a for
loop here.
cpp/src/arrow/compute/exec.cc
Outdated
@@ -498,15 +510,36 @@ struct NullGeneralization { | |||
return PERHAPS_NULL; | |||
} | |||
|
|||
static type Get(const ChunkedArray& chunk_array) { | |||
if (chunk_array.num_chunks() == 0 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if num_chunks() == 0
, we should return ALL_VALID
instead of ALL_NULL
or PERHAPS_NULL
like Get
does above (it doesn't check .length()
at all which I think was a missed opportunity of making ALL_VALID
more likely).
cpp/src/arrow/compute/exec.cc
Outdated
output_num_buffers_ = static_cast<int>(output_type_.type->layout().buffers.size()); | ||
|
||
// Decide if we need to preallocate memory for this kernel | ||
validity_preallocated_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm working on Take at the moment and setting the chunked exec kernels #41700
cpp/src/arrow/compute/exec.cc
Outdated
@@ -1123,6 +1142,34 @@ class VectorExecutor : public KernelExecutorImpl<VectorKernel> { | |||
} | |||
} | |||
|
|||
Status SetupPreallocation(const std::vector<Datum>& args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hard to review this code extraction. Was this code churn necessary to achieve the goals of this PR?
2979750
to
5e2207d
Compare
5e2207d
to
c9a13a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will finish the review later. I need to check that code movement mor carefully. It would help to have a commit history with two commits:
- refactoring the setup preallocation logic in-place
- extracting that refactored logic to a function
Code movement + changes mixed together takes more time to review.
@@ -738,12 +772,10 @@ class KernelExecutorImpl : public KernelExecutor { | |||
} | |||
for (size_t i = 0; i < data_preallocated_.size(); ++i) { | |||
const auto& prealloc = data_preallocated_[i]; | |||
if (prealloc.bit_width >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_size_binary<0>
can have bit_width == 0
d0a599b
to
cf02207
Compare
It's my fault. Thanks so much for your review @felipecrv @mapleFU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally understand the first goal of the PR. And I disapprove of it. I want kernel dispatching logic to depend solely on types and immutable properties of the kernels.
Extending the range of types that are pre-allocated is definitely a good idea though.
cpp/src/arrow/compute/exec.cc
Outdated
} | ||
validity_preallocated_ = !elide_validity_bitmap; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logic is safe. If null_handling
is INTERSECTION
the function might want to follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection.
And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the kernel_
when it's configured.
If you want the speed benefits of not pre-allocating a validity bitmap when not necessary, declare the kernel as COMPUTED_NO_PREALLOCATE
and work on the kernel logic to allocate only when needed. The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection
Are you talking about PropagateNulls
? Currently, seems this logic is used when kernel set INTERSECTION
.
If not, i understand that if the kernel internal execute path must follow a single code path that uses the pre-allocated buffer
, COMPUTED_PREALLOCATE
can be set statically, which should be safer for this kernel?
Back to preallocation-validity buffer, if kernel-function sets NullHandling::INTERSECTION
, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.
And even if that is not the case (assuming all kernels are perfect at the moment), this extra logic introduces branching in the possible states of pre-allocated buffers based on runtime input data and not something statically defined in the
kernel_
when it's configured.
From a performance perspective, the current logic does introduce many branches at runtime for INTERSECTION
, compared to PropagateNulls
, it actually passes NullGeneralization::Get
introduces very few branches.
The generic binding and execution logic should be simple and not depend on values so much. It's too complicated as it is.
Yes, the current pre-allocation logic for INTERSECTION
introduced NullPropagator
and NullGeneralization
is complicated.
@pitrou What do you think of this PR for VectorExecutor's INTERSECTION
validity-bitmap preallocation?
It seems that you have changed the logic here before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that, current ScalarExecutor will pre-allocate with NullGeneralization
in INTERSECTION mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, by branches I meant branches in the state-space the execution framework and the kernels might traverse. Meaning that without duplication of test cases for every kernel, we might hide a latent bug until an unlikely array configuration causes a SIGSEGV on the non-pre-allocated validity buffer.
Back to preallocation-validity buffer, if kernel-function sets
NullHandling::INTERSECTION
, does it mean that kernel-function may not be sure whether it needs a pre-allocated validity buffer and let the Executor decide for itself? At this time, the Executor can only dynamically determine whether to set the pre-allocate buffer based on the input data.
By statically I mean a one-time configuration of the kernel_
when it's added to the function in the registry.
For a kernel implementer, it's simpler to assume NullHandling::INTERSECTION
implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a kernel implementer, it's simpler to assume
NullHandling::INTERSECTION
implies pre-allocated validity bitmap buffer. No matter what arrays are passed as input.
I see, thank you for your explanation @felipecrv .
Do you think the changes in the third commit of this PR are reasonable? If so, I will revert the first two commits. The third commit is mainly related to the pre-allocation of output data-buffer and refactored part of the code, which can bring the benefits mentioned by @mapleFU .
In addition, I will create an issue to track the pre-allocation of validity-buffer in INTERSECTION
mode, because currently ScalarExecutor
will do pre-allocation according to NullGeneralization
in this mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the handling of list-view types in the pre-allocation code is good. The 3rd commit has a mix of things in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, except for [Large]ListView and chunked-array NullGeneration
support, the others are some minor refactorings (some DCHECKs are unnecessary for now).
@felipecrv I just think about is there some techniques to denote an
|
It's tricky to re-use given that Buffers wrapped in shared_ptrs are supposed to be immutable. You never know if someone is also holding a reference to that shared |
c8616c6
to
70d239e
Compare
cpp/src/arrow/compute/exec.cc
Outdated
if (chunk_array.num_chunks() == 0) { | ||
return ALL_VALID; | ||
} | ||
if (chunk_array.null_count() == chunk_array.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not follow the same constraint as Array above: "Do not count the bits if they haven't been counted already".
I think you should instead just iterate on the chunks, such as (untested):
static type Get(const ArraySpan& array) {
// Do not count the bits if they haven't been counted already
if ((arr.null_count == 0) || (arr.buffers[0].data == nullptr)) {
return ALL_VALID;
}
if (arr.null_count == arr.length) {
return ALL_NULL;
}
return PERHAPS_NULL;
}
static type Get(const ChunkedArray& chunk_array) {
std::optional<type> current_gen;
for (const auto& chunk : chunk_array.chunks()) {
if (chunk->length() == 0) {
continue;
}
chunk_gen = Get(*chunk);
if (current_gen.has_value() && chunk_gen != *current_gen) {
return PERHAPS_NULL;
}
current_gen = chunk_gen;
}
return current_gen.value_or(ALL_VALID);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, addressed.
2. support NullGeneralization for chunked-array 3. simple refactor some execute logic and remove finished TODO
ffda81d
to
7bbe96b
Compare
cc @pitrou, i think this can be moved forward? |
Rationale for this change
Mainly for below improvement and codes simplification:
NullGeneralization
not support chunked-array's null status check which may prevent some optimization for preallocation.ComputeDataPreallocate
not support [Large]ListView's preallocate.What changes are included in this PR?
Described as above.
Are these changes tested?
Yes, covered by exists.
Are there any user-facing changes?
No