Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for large number of items to DevicePartition::If with the ThreeWayPartition overload #2506

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Oct 4, 2024

Description

Closes #2442

Note: this is a chained PR building on #2400

Performance summary:

Diff i32 vs i32.main any num items Diff i32 vs i32.main 2^28 num items Diff i64 vs i32.main any num items Diff i64 vs i32.main 2^28 num items
min -4.39% 0.09% -8.46% 0.14%
max 3.91% 1.99% 19.53% 19.53%
avg -0.07% 0.79% 0.30% 3.58%
H100 three_way_partition
T{ct} Entropy Elements{io} I32 Diff i32 vs i32.main i64 Diff i64 vs. main.i32
I8 1 2^16 11.589 1.10% 11.235 -1.99%
I8 1 2^20 14.819 2.68% 13.909 -3.63%
I8 1 2^24 70.931 1.33% 70.711 1.02%
I8 1 2^28 1015 1.47% 1021 2.07%
I8 0.544 2^16 11.565 1.64% 10.898 -4.22%
I8 0.544 2^20 14.352 2.94% 13.563 -2.72%
I8 0.544 2^24 69.678 1.31% 69.484 1.03%
I8 0.544 2^28 998.363 1.37% 1003 1.84%
I8 0 2^16 10.897 0.33% 10.768 -0.86%
I8 0 2^20 13.8 1.34% 13.567 -0.37%
I8 0 2^24 68.644 1.03% 68.741 1.17%
I8 0 2^28 987.011 0.98% 991.846 1.47%
I16 1 2^16 11.857 -2.89% 11.338 -7.14%
I16 1 2^20 14.88 -2.56% 14.793 -3.13%
I16 1 2^24 78.842 0.97% 90.149 15.45%
I16 1 2^28 1095 1.86% 1285 19.53%
I16 0.544 2^16 11.453 -4.39% 11.237 -6.19%
I16 0.544 2^20 14.416 -2.39% 14.654 -0.78%
I16 0.544 2^24 77.864 1.10% 89.04 15.61%
I16 0.544 2^28 1079 1.56% 1267 19.26%
I16 0 2^16 11.125 -2.97% 10.972 -4.30%
I16 0 2^20 14.002 -3.00% 14.414 -0.15%
I16 0 2^24 76.452 0.51% 87.869 15.52%
I16 0 2^28 1064 1.27% 1248 18.78%
I32 1 2^16 11.718 -0.48% 11.113 -5.62%
I32 1 2^20 18.86 1.87% 18.143 -2.00%
I32 1 2^24 123.674 1.72% 123.868 1.88%
I32 1 2^28 1773 1.99% 1787 2.80%
I32 0.544 2^16 11.671 3.91% 11.139 -0.83%
I32 0.544 2^20 18.122 0.78% 18.008 0.15%
I32 0.544 2^24 121.747 1.65% 122.065 1.92%
I32 0.544 2^28 1740 1.74% 1749 2.27%
I32 0 2^16 11.142 3.44% 11.172 3.72%
I32 0 2^20 17.897 0.46% 17.522 -1.64%
I32 0 2^24 120.483 0.97% 120.853 1.28%
I32 0 2^28 1726 1.31% 1734 1.78%
I64 1 2^16 10.963 -0.45% 10.616 -3.60%
I64 1 2^20 20.237 -0.97% 19.995 -2.15%
I64 1 2^24 170.984 0.36% 170.828 0.27%
I64 1 2^28 2621 0.26% 2620 0.22%
I64 0.544 2^16 10.98 -1.24% 10.635 -4.34%
I64 0.544 2^20 20.299 -0.90% 19.967 -2.52%
I64 0.544 2^24 170.839 0.31% 170.451 0.08%
I64 0.544 2^28 2613 0.22% 2616 0.34%
I64 0 2^16 10.608 -4.30% 10.607 -4.31%
I64 0 2^20 19.703 -2.09% 19.724 -1.99%
I64 0 2^24 168.043 -0.10% 168.246 0.02%
I64 0 2^28 2584 0.09% 2588 0.24%
I128 1 2^16 12.05 -2.70% 11.337 -8.46%
I128 1 2^20 29.705 0.11% 29.355 -1.07%
I128 1 2^24 324.479 -0.17% 324.349 -0.21%
I128 1 2^28 5095 0.14% 5095 0.14%
I128 0.544 2^16 11.516 -1.27% 11.351 -2.68%
I128 0.544 2^20 29.736 1.01% 29.341 -0.33%
I128 0.544 2^24 324.803 0.24% 324.555 0.16%
I128 0.544 2^28 5085 0.16% 5084 0.14%
I128 0 2^16 11.392 -1.78% 11.508 -0.78%
I128 0 2^20 28.933 -0.57% 28.719 -1.31%
I128 0 2^24 322.584 -0.01% 321.669 -0.29%
I128 0 2^28 5053 0.17% 5053 0.17%
F32 1 2^16 11.828 -0.28% 10.945 -7.72%
F32 1 2^20 18.054 -4.21% 17.843 -5.33%
F32 1 2^24 122.669 -0.15% 123.065 0.17%
F32 1 2^28 1762 0.77% 1778 1.69%
F32 0.544 2^16 11.624 -1.42% 10.903 -7.53%
F32 0.544 2^20 17.732 -3.87% 17.557 -4.82%
F32 0.544 2^24 120.894 -0.10% 121.181 0.14%
F32 0.544 2^28 1733 0.60% 1742 1.12%
F32 0 2^16 10.869 -2.72% 10.936 -2.12%
F32 0 2^20 17.408 -3.00% 17.352 -3.31%
F32 0 2^24 119.708 -0.35% 120.096 -0.03%
F32 0 2^28 1715 0.15% 1722 0.56%
F64 1 2^16 10.532 -1.70% 10.565 -1.39%
F64 1 2^20 20.149 -0.40% 20.244 0.07%
F64 1 2^24 170.075 0.19% 170.476 0.43%
F64 1 2^28 2609 0.17% 2612 0.29%
F64 0.544 2^16 10.639 0.25% 10.776 1.54%
F64 0.544 2^20 20.318 -0.15% 20.117 -1.14%
F64 0.544 2^24 170.03 0.17% 170.433 0.41%
F64 0.544 2^28 2604 0.19% 2609 0.38%
F64 0 2^16 10.441 -1.67% 10.822 1.92%
F64 0 2^20 19.894 -0.58% 19.733 -1.38%
F64 0 2^24 167.27 -0.04% 167.655 0.19%
F64 0 2^28 2576 0.11% 2577 0.15%

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested review from a team as code owners October 4, 2024 05:06
Copy link
Contributor

github-actions bot commented Oct 4, 2024

🟨 CI finished in 52m 14s: Pass: 97%/208 | Total: 1d 06h | Avg: 8m 51s | Max: 41m 07s | Hits: 99%/11150
  • 🟨 cub: Pass: 94%/104 | Total: 20h 12m | Avg: 11m 39s | Max: 41m 07s

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  93%/96  | Total: 18h 41m | Avg: 11m 41s | Max: 41m 07s
      🟩 arm64              Pass: 100%/8   | Total:  1h 30m | Avg: 11m 19s | Max: 13m 07s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 14m 08s | Avg:  7m 04s | Max:  7m 10s
      🔍 nvcc               Pass:  94%/102 | Total: 19h 58m | Avg: 11m 44s | Max: 41m 07s
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  3h 05m | Avg: 12m 23s | Max: 41m 07s
      🟩 11.8               Pass: 100%/3   | Total: 43m 52s | Avg: 14m 37s | Max: 15m 53s
      🟨 12.6               Pass:  94%/86  | Total: 16h 22m | Avg: 11m 25s | Max: 33m 39s
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 14m 08s | Avg:  7m 04s | Max:  7m 10s
      🟨 nvcc11.1           Pass:  93%/15  | Total:  3h 05m | Avg: 12m 23s | Max: 41m 07s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 43m 52s | Avg: 14m 37s | Max: 15m 53s
      🟨 nvcc12.6           Pass:  94%/84  | Total: 16h 08m | Avg: 11m 31s | Max: 33m 39s
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  1h 03m | Avg: 10m 35s | Max: 11m 51s
      🟩 Clang10            Pass: 100%/3   | Total: 31m 57s | Avg: 10m 39s | Max: 11m 19s
      🟩 Clang11            Pass: 100%/4   | Total: 38m 18s | Avg:  9m 34s | Max: 10m 38s
      🟩 Clang12            Pass: 100%/4   | Total: 39m 52s | Avg:  9m 58s | Max: 11m 16s
      🟩 Clang13            Pass: 100%/4   | Total: 38m 39s | Avg:  9m 39s | Max: 10m 37s
      🟩 Clang14            Pass: 100%/4   | Total: 40m 04s | Avg: 10m 01s | Max: 10m 15s
      🟩 Clang15            Pass: 100%/4   | Total: 38m 23s | Avg:  9m 35s | Max: 10m 19s
      🟩 Clang16            Pass: 100%/4   | Total: 41m 41s | Avg: 10m 25s | Max: 11m 38s
      🟩 Clang17            Pass: 100%/4   | Total: 40m 06s | Avg: 10m 01s | Max: 12m 01s
      🟩 Clang18            Pass: 100%/9   | Total:  1h 48m | Avg: 12m 00s | Max: 23m 27s
      🟩 GCC6               Pass: 100%/2   | Total: 23m 23s | Avg: 11m 41s | Max: 11m 49s
      🟨 GCC7               Pass:  83%/6   | Total: 59m 03s | Avg:  9m 50s | Max: 11m 14s
      🟩 GCC8               Pass: 100%/6   | Total: 57m 26s | Avg:  9m 34s | Max: 10m 22s
      🟩 GCC9               Pass: 100%/6   | Total:  1h 24m | Avg: 14m 07s | Max: 33m 39s
      🟩 GCC10              Pass: 100%/4   | Total: 39m 15s | Avg:  9m 48s | Max: 10m 59s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 23m | Avg: 11m 55s | Max: 15m 53s
      🟩 GCC12              Pass: 100%/4   | Total: 39m 32s | Avg:  9m 53s | Max: 10m 45s
      🟨 GCC13              Pass:  93%/16  | Total:  3h 26m | Avg: 12m 55s | Max: 25m 09s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 37m 21s | Avg: 12m 27s | Max: 13m 44s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 41m 07s | Avg: 41m 07s | Max: 41m 07s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 38m 01s | Avg: 19m 00s | Max: 20m 01s
      🟥 MSVC14.39          Pass:   0%/1   | Total: 21m 36s | Avg: 21m 36s | Max: 21m 36s
    🟨 cxx_family
      🟩 Clang              Pass: 100%/46  | Total:  8h 00m | Avg: 10m 26s | Max: 23m 27s
      🟨 GCC                Pass:  96%/51  | Total:  9h 53m | Avg: 11m 38s | Max: 33m 39s
      🟩 Intel              Pass: 100%/3   | Total: 37m 21s | Avg: 12m 27s | Max: 13m 44s
      🟥 MSVC               Pass:   0%/4   | Total:  1h 40m | Avg: 25m 11s | Max: 41m 07s
    🟨 jobs
      🟨 Build              Pass:  94%/96  | Total: 17h 34m | Avg: 10m 59s | Max: 41m 07s
      🟥 DeviceLaunch       Pass:   0%/1   | Total: 18m 02s | Avg: 18m 02s | Max: 18m 02s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 47s | Avg: 15m 47s | Max: 15m 47s
      🟩 HostLaunch         Pass: 100%/3   | Total: 50m 34s | Avg: 16m 51s | Max: 17m 39s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 13m | Avg: 24m 28s | Max: 25m 09s
    🟨 gpu
      🟨 v100               Pass:  94%/104 | Total: 20h 12m | Avg: 11m 39s | Max: 41m 07s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 43m 52s | Avg: 14m 37s | Max: 15m 53s
      🟩 90a                Pass: 100%/4   | Total: 23m 37s | Avg:  5m 54s | Max:  6m 30s
    🟨 std
      🟨 11                 Pass:  96%/28  | Total:  5h 53m | Avg: 12m 37s | Max: 33m 39s
      🟨 14                 Pass:  92%/27  | Total:  5h 35m | Avg: 12m 26s | Max: 41m 07s
      🟨 17                 Pass:  96%/26  | Total:  4h 05m | Avg:  9m 25s | Max: 18m 00s
      🟨 20                 Pass:  91%/23  | Total:  4h 38m | Avg: 12m 05s | Max: 24m 48s
    
  • 🟩 thrust: Pass: 100%/103 | Total: 10h 15m | Avg: 5m 58s | Max: 26m 05s | Hits: 99%/11150

    🟩 cpu
      🟩 amd64              Pass: 100%/95  | Total:  9h 38m | Avg:  6m 05s | Max: 26m 05s | Hits:  99%/11150 
      🟩 arm64              Pass: 100%/8   | Total: 36m 57s | Avg:  4m 37s | Max:  5m 14s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 33m | Avg:  6m 12s | Max: 21m 34s | Hits:  99%/2230  
      🟩 11.8               Pass: 100%/3   | Total: 14m 42s | Avg:  4m 54s | Max:  5m 28s
      🟩 12.6               Pass: 100%/85  | Total:  8h 28m | Avg:  5m 58s | Max: 26m 05s | Hits:  99%/8920  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  4m 59s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 33m | Avg:  6m 12s | Max: 21m 34s | Hits:  99%/2230  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 42s | Avg:  4m 54s | Max:  5m 28s
      🟩 nvcc12.6           Pass: 100%/83  | Total:  8h 18m | Avg:  6m 00s | Max: 26m 05s | Hits:  99%/8920  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  4m 59s
      🟩 nvcc               Pass: 100%/101 | Total: 10h 05m | Avg:  5m 59s | Max: 26m 05s | Hits:  99%/11150 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 30m 05s | Avg:  5m 00s | Max:  5m 58s
      🟩 Clang10            Pass: 100%/3   | Total: 16m 51s | Avg:  5m 37s | Max:  5m 56s
      🟩 Clang11            Pass: 100%/4   | Total: 19m 11s | Avg:  4m 47s | Max:  5m 10s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 22s | Avg:  4m 35s | Max:  4m 46s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 47s | Avg:  4m 41s | Max:  4m 59s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 42s | Avg:  4m 40s | Max:  4m 58s
      🟩 Clang15            Pass: 100%/4   | Total: 20m 16s | Avg:  5m 04s | Max:  5m 32s
      🟩 Clang16            Pass: 100%/4   | Total: 19m 06s | Avg:  4m 46s | Max:  4m 56s
      🟩 Clang17            Pass: 100%/4   | Total: 19m 35s | Avg:  4m 53s | Max:  5m 19s
      🟩 Clang18            Pass: 100%/9   | Total: 54m 14s | Avg:  6m 01s | Max: 15m 18s
      🟩 GCC6               Pass: 100%/2   | Total: 25m 22s | Avg: 12m 41s | Max: 21m 34s
      🟩 GCC7               Pass: 100%/6   | Total: 24m 29s | Avg:  4m 04s | Max:  4m 46s
      🟩 GCC8               Pass: 100%/6   | Total: 25m 00s | Avg:  4m 10s | Max:  5m 04s
      🟩 GCC9               Pass: 100%/6   | Total: 26m 39s | Avg:  4m 26s | Max:  4m 54s
      🟩 GCC10              Pass: 100%/4   | Total: 19m 10s | Avg:  4m 47s | Max:  5m 00s
      🟩 GCC11              Pass: 100%/7   | Total: 34m 45s | Avg:  4m 57s | Max:  5m 32s
      🟩 GCC12              Pass: 100%/4   | Total: 20m 19s | Avg:  5m 04s | Max:  5m 30s
      🟩 GCC13              Pass: 100%/14  | Total:  1h 27m | Avg:  6m 15s | Max: 15m 11s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 18m 45s | Avg:  6m 15s | Max:  6m 31s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 19m 11s | Avg: 19m 11s | Max: 19m 11s | Hits:  99%/2230  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 33m 54s | Avg: 16m 57s | Max: 17m 17s | Hits:  99%/4460  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 45m 20s | Avg: 22m 40s | Max: 26m 05s | Hits:  99%/4460  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/46  | Total:  3h 55m | Avg:  5m 06s | Max: 15m 18s
      🟩 GCC                Pass: 100%/49  | Total:  4h 23m | Avg:  5m 22s | Max: 21m 34s
      🟩 Intel              Pass: 100%/3   | Total: 18m 45s | Avg:  6m 15s | Max:  6m 31s
      🟩 MSVC               Pass: 100%/5   | Total:  1h 38m | Avg: 19m 41s | Max: 26m 05s | Hits:  99%/11150 
    🟩 gpu
      🟩 v100               Pass: 100%/103 | Total: 10h 15m | Avg:  5m 58s | Max: 26m 05s | Hits:  99%/11150 
    🟩 jobs
      🟩 Build              Pass: 100%/96  | Total:  8h 46m | Avg:  5m 28s | Max: 21m 34s | Hits:  99%/8920  
      🟩 TestCPU            Pass: 100%/4   | Total: 47m 25s | Avg: 11m 51s | Max: 26m 05s | Hits:  99%/2230  
      🟩 TestGPU            Pass: 100%/3   | Total: 41m 55s | Avg: 13m 58s | Max: 15m 18s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 42s | Avg:  4m 54s | Max:  5m 28s
      🟩 90a                Pass: 100%/4   | Total: 16m 07s | Avg:  4m 01s | Max:  4m 16s
    🟩 std
      🟩 11                 Pass: 100%/28  | Total:  2h 29m | Avg:  5m 20s | Max: 21m 34s
      🟩 14                 Pass: 100%/27  | Total:  2h 34m | Avg:  5m 43s | Max: 19m 11s | Hits:  99%/4460  
      🟩 17                 Pass: 100%/26  | Total:  2h 21m | Avg:  5m 26s | Max: 16m 37s | Hits:  99%/2230  
      🟩 20                 Pass: 100%/22  | Total:  2h 50m | Avg:  7m 44s | Max: 26m 05s | Hits:  99%/4460  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 12s | Avg: 16m 12s | Max: 16m 12s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 208)

# Runner
171 linux-amd64-cpu16
16 linux-arm64-cpu16
12 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

Add support for large num_items to DevicePartition::ThreeWayPartition
1 participant