Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44081: [C++][Parquet] Fix reported metrics in parquet-arrow-reader-writer-benchmark #44082

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 12, 2024

Rationale for this change

  1. items/sec and bytes/sec were set to the same value in some benchmarks
  2. bytes/sec was incorrectly computed for boolean columns

What changes are included in this PR?

Fix parquet-arrow-reader-writer-benchmark to report correct metrics.

Example (column writing)

Before:

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
BM_WriteColumn<false,Int32Type>                             43138428 ns     43118609 ns           15 bytes_per_second=927.674Mi/s items_per_second=972.736M/s
BM_WriteColumn<true,Int32Type>                             150528627 ns    150480597 ns            5 bytes_per_second=265.815Mi/s items_per_second=278.727M/s
BM_WriteColumn<false,Int64Type>                             49243514 ns     49214955 ns           14 bytes_per_second=1.58742Gi/s items_per_second=1.70448G/s
BM_WriteColumn<true,Int64Type>                             151526550 ns    151472832 ns            5 bytes_per_second=528.148Mi/s items_per_second=553.803M/s
BM_WriteColumn<false,DoubleType>                            59101372 ns     59068058 ns           12 bytes_per_second=1.32263Gi/s items_per_second=1.42016G/s
BM_WriteColumn<true,DoubleType>                            159944872 ns    159895095 ns            4 bytes_per_second=500.328Mi/s items_per_second=524.632M/s
BM_WriteColumn<false,BooleanType>                           32855604 ns     32845322 ns           21 bytes_per_second=304.457Mi/s items_per_second=319.247M/s
BM_WriteColumn<true,BooleanType>                           150566118 ns    150528329 ns            5 bytes_per_second=66.4327Mi/s items_per_second=69.6597M/s

After:

Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
BM_WriteColumn<false,Int32Type>                             43919180 ns     43895926 ns           16 bytes_per_second=911.246Mi/s items_per_second=238.878M/s
BM_WriteColumn<true,Int32Type>                             153981290 ns    153929841 ns            5 bytes_per_second=259.859Mi/s items_per_second=68.1204M/s
BM_WriteColumn<false,Int64Type>                             49906105 ns     49860098 ns           14 bytes_per_second=1.56688Gi/s items_per_second=210.304M/s
BM_WriteColumn<true,Int64Type>                             154273499 ns    154202319 ns            5 bytes_per_second=518.799Mi/s items_per_second=68M/s
BM_WriteColumn<false,DoubleType>                            59789490 ns     59733498 ns           12 bytes_per_second=1.30789Gi/s items_per_second=175.542M/s
BM_WriteColumn<true,DoubleType>                            161235860 ns    161169670 ns            4 bytes_per_second=496.371Mi/s items_per_second=65.0604M/s
BM_WriteColumn<false,BooleanType>                           32962097 ns     32950864 ns           21 bytes_per_second=37.9353Mi/s items_per_second=318.224M/s
BM_WriteColumn<true,BooleanType>                           154103499 ns    154052873 ns            5 bytes_per_second=8.1141Mi/s items_per_second=68.066M/s

Example (column reading)

Before:

---------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------
BM_ReadColumn<false,BooleanType>/-1/0                               6456731 ns      6453510 ns          108 bytes_per_second=1.51323Gi/s items_per_second=1.62482G/s
BM_ReadColumn<false,BooleanType>/1/20                              19012505 ns     19006068 ns           36 bytes_per_second=526.148Mi/s items_per_second=551.706M/s
BM_ReadColumn<true,BooleanType>/-1/1                               58365426 ns     58251529 ns           12 bytes_per_second=171.669Mi/s items_per_second=180.008M/s
BM_ReadColumn<true,BooleanType>/5/10                               46498966 ns     46442191 ns           15 bytes_per_second=215.321Mi/s items_per_second=225.781M/s

BM_ReadIndividualRowGroups                                         29617575 ns     29600557 ns           24 bytes_per_second=2.63931Gi/s items_per_second=2.83394G/s
BM_ReadMultipleRowGroups                                           47416980 ns     47288951 ns           15 bytes_per_second=1.65208Gi/s items_per_second=1.7739G/s
BM_ReadMultipleRowGroupsGenerator                                  29741012 ns     29722112 ns           24 bytes_per_second=2.62851Gi/s items_per_second=2.82235G/s

After:

---------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------
BM_ReadColumn<false,BooleanType>/-1/0                               6438249 ns      6435159 ns          109 bytes_per_second=194.245Mi/s items_per_second=1.62945G/s
BM_ReadColumn<false,BooleanType>/1/20                              19427495 ns     19419378 ns           37 bytes_per_second=64.3687Mi/s items_per_second=539.964M/s
BM_ReadColumn<true,BooleanType>/-1/1                               58342877 ns     58298236 ns           12 bytes_per_second=21.4415Mi/s items_per_second=179.864M/s
BM_ReadColumn<true,BooleanType>/5/10                               46591584 ns     46532288 ns           15 bytes_per_second=26.8631Mi/s items_per_second=225.344M/s

BM_ReadIndividualRowGroups                                         30039049 ns     30021676 ns           23 bytes_per_second=2.60229Gi/s items_per_second=349.273M/s
BM_ReadMultipleRowGroups                                           47877663 ns     47650438 ns           15 bytes_per_second=1.63954Gi/s items_per_second=220.056M/s
BM_ReadMultipleRowGroupsGenerator                                  30377987 ns     30360019 ns           23 bytes_per_second=2.57329Gi/s items_per_second=345.381M/s

Are these changes tested?

Manually by running benchmarks.

Are there any user-facing changes?

No, but this breaks historical comparisons in continuous benchmarking.

@pitrou
Copy link
Member Author

pitrou commented Sep 12, 2024

cc @boshek @austin3dickey for CB history breakage

@pitrou pitrou requested a review from mapleFU September 12, 2024 12:20
…reader-writer-benchmark

1. items/sec and bytes/sec were set to the same value in some benchmarks
2. bytes/sec was incorrectly computed for boolean columns
@pitrou pitrou force-pushed the gh44081-pq-benchmarks-metrics branch from 9076d07 to 726e7de Compare September 12, 2024 12:32
@pitrou
Copy link
Member Author

pitrou commented Sep 12, 2024

Also, note that in all cases, some of the reported figures were too optimistic (never too pessimistic).

@@ -104,13 +107,28 @@ std::shared_ptr<ColumnDescriptor> MakeSchema(Repetition::type repetition) {
repetition == Repetition::REPEATED);
}

template <bool nullable, typename ParquetType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nullable is unused previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as you can see.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 12, 2024
@mapleFU
Copy link
Member

mapleFU commented Sep 12, 2024

Float16Type is not a physical type, just a logical type, this is different from other case. But this LGTM

@pitrou
Copy link
Member Author

pitrou commented Sep 12, 2024

Float16Type is not a physical type, just a logical type.

I know, but this was convenient :-)

@pitrou pitrou merged commit e0ac5d5 into apache:main Sep 12, 2024
37 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Sep 12, 2024
@pitrou pitrou deleted the gh44081-pq-benchmarks-metrics branch September 12, 2024 14:29
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit e0ac5d5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 32 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…reader-writer-benchmark (apache#44082)

### Rationale for this change

1. items/sec and bytes/sec were set to the same value in some benchmarks
2. bytes/sec was incorrectly computed for boolean columns

### What changes are included in this PR?

Fix parquet-arrow-reader-writer-benchmark to report correct metrics.

#### Example (column writing)

Before:
```
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
BM_WriteColumn<false,Int32Type>                             43138428 ns     43118609 ns           15 bytes_per_second=927.674Mi/s items_per_second=972.736M/s
BM_WriteColumn<true,Int32Type>                             150528627 ns    150480597 ns            5 bytes_per_second=265.815Mi/s items_per_second=278.727M/s
BM_WriteColumn<false,Int64Type>                             49243514 ns     49214955 ns           14 bytes_per_second=1.58742Gi/s items_per_second=1.70448G/s
BM_WriteColumn<true,Int64Type>                             151526550 ns    151472832 ns            5 bytes_per_second=528.148Mi/s items_per_second=553.803M/s
BM_WriteColumn<false,DoubleType>                            59101372 ns     59068058 ns           12 bytes_per_second=1.32263Gi/s items_per_second=1.42016G/s
BM_WriteColumn<true,DoubleType>                            159944872 ns    159895095 ns            4 bytes_per_second=500.328Mi/s items_per_second=524.632M/s
BM_WriteColumn<false,BooleanType>                           32855604 ns     32845322 ns           21 bytes_per_second=304.457Mi/s items_per_second=319.247M/s
BM_WriteColumn<true,BooleanType>                           150566118 ns    150528329 ns            5 bytes_per_second=66.4327Mi/s items_per_second=69.6597M/s
```
After:
```
Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
BM_WriteColumn<false,Int32Type>                             43919180 ns     43895926 ns           16 bytes_per_second=911.246Mi/s items_per_second=238.878M/s
BM_WriteColumn<true,Int32Type>                             153981290 ns    153929841 ns            5 bytes_per_second=259.859Mi/s items_per_second=68.1204M/s
BM_WriteColumn<false,Int64Type>                             49906105 ns     49860098 ns           14 bytes_per_second=1.56688Gi/s items_per_second=210.304M/s
BM_WriteColumn<true,Int64Type>                             154273499 ns    154202319 ns            5 bytes_per_second=518.799Mi/s items_per_second=68M/s
BM_WriteColumn<false,DoubleType>                            59789490 ns     59733498 ns           12 bytes_per_second=1.30789Gi/s items_per_second=175.542M/s
BM_WriteColumn<true,DoubleType>                            161235860 ns    161169670 ns            4 bytes_per_second=496.371Mi/s items_per_second=65.0604M/s
BM_WriteColumn<false,BooleanType>                           32962097 ns     32950864 ns           21 bytes_per_second=37.9353Mi/s items_per_second=318.224M/s
BM_WriteColumn<true,BooleanType>                           154103499 ns    154052873 ns            5 bytes_per_second=8.1141Mi/s items_per_second=68.066M/s
```

#### Example (column reading)

Before:
```
---------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------
BM_ReadColumn<false,BooleanType>/-1/0                               6456731 ns      6453510 ns          108 bytes_per_second=1.51323Gi/s items_per_second=1.62482G/s
BM_ReadColumn<false,BooleanType>/1/20                              19012505 ns     19006068 ns           36 bytes_per_second=526.148Mi/s items_per_second=551.706M/s
BM_ReadColumn<true,BooleanType>/-1/1                               58365426 ns     58251529 ns           12 bytes_per_second=171.669Mi/s items_per_second=180.008M/s
BM_ReadColumn<true,BooleanType>/5/10                               46498966 ns     46442191 ns           15 bytes_per_second=215.321Mi/s items_per_second=225.781M/s

BM_ReadIndividualRowGroups                                         29617575 ns     29600557 ns           24 bytes_per_second=2.63931Gi/s items_per_second=2.83394G/s
BM_ReadMultipleRowGroups                                           47416980 ns     47288951 ns           15 bytes_per_second=1.65208Gi/s items_per_second=1.7739G/s
BM_ReadMultipleRowGroupsGenerator                                  29741012 ns     29722112 ns           24 bytes_per_second=2.62851Gi/s items_per_second=2.82235G/s
```

After:
```
---------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------
BM_ReadColumn<false,BooleanType>/-1/0                               6438249 ns      6435159 ns          109 bytes_per_second=194.245Mi/s items_per_second=1.62945G/s
BM_ReadColumn<false,BooleanType>/1/20                              19427495 ns     19419378 ns           37 bytes_per_second=64.3687Mi/s items_per_second=539.964M/s
BM_ReadColumn<true,BooleanType>/-1/1                               58342877 ns     58298236 ns           12 bytes_per_second=21.4415Mi/s items_per_second=179.864M/s
BM_ReadColumn<true,BooleanType>/5/10                               46591584 ns     46532288 ns           15 bytes_per_second=26.8631Mi/s items_per_second=225.344M/s

BM_ReadIndividualRowGroups                                         30039049 ns     30021676 ns           23 bytes_per_second=2.60229Gi/s items_per_second=349.273M/s
BM_ReadMultipleRowGroups                                           47877663 ns     47650438 ns           15 bytes_per_second=1.63954Gi/s items_per_second=220.056M/s
BM_ReadMultipleRowGroupsGenerator                                  30377987 ns     30360019 ns           23 bytes_per_second=2.57329Gi/s items_per_second=345.381M/s
```

### Are these changes tested?

Manually by running benchmarks.

### Are there any user-facing changes?

No, but this breaks historical comparisons in continuous benchmarking.
* GitHub Issue: apache#44081

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants