-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36892: [C++] Fix performance regressions in FieldPath::Get
#37032
Conversation
|
I was able to use the new benchmarks at four different points in time. Got some interesting results... (these are all approximate best-case) Immediately before PR-35197This was right after I added initial support for tables/chunked arrays. Fortunately, those implementations probably didn't see any use before they were replaced.
Immediately after PR-35197Major regressions for array data and record batches.
Immediately before this PRAt some point, performance went way up for everything besides array data and batches. Not sure why.
Immediately after this PR
|
@ursabot please benchmark |
Benchmark runs are scheduled for commit c7c3059. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Out of curiosity I ran these benchmarks against 12.0.1: 12.0.1
After this PR
|
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit c7c3059. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
It seems that Original regression: |
@github-actions crossbow submit -g cpp |
Revision: 0b75284 Submitted crossbow builds: ursacomputing/crossbow @ actions-c076bca0da |
I pushed a variation so that Table inputs are benchmarks with multiple chunk sizes.
|
That's probably because selecting a top-level column from a table is just a basic index into a vector, whereas for chunked arrays, you need to process all of the chunks individually to construct a new Right now, the benchmarks only test top-level selections - but you'd likely see a similar degradation at higher depth levels. |
@raulcd You can cherry-pick this now. |
### Rationale for this change #35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: #36892 Lead-authored-by: benibus <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Thanks, I've cherry-picked and I am re-running some jobs on the maintenance branch to validate the status before creating the new RC |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 59f30f0. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#37032) ### Rationale for this change apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: apache#36892 Lead-authored-by: benibus <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
#35197 appears to have introduced significant performance regressions in
FieldPath::Get
- indicated here, in a benchmark that uses a wide (10K column) dataframe.What changes are included in this PR?
FieldPath::Get
across various input types, as they didn't previously existRecordBatch
andArrayData
overloads specificallyNestedSelector
Are these changes tested?
Yes (covered by existing tests)
Are there any user-facing changes?
No