Skip configurations with fewer than 4 warps in tuning #188

thomasfaingnaert · 2024-01-25T22:28:23Z

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the N dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations:

2 x 4
4 x 2
4 x 4
8 x 4
4 x 8

That would reduce the search space by 68.75% in total.

@maleadt Thoughts?

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...). We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%. We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the M dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations: - 2 x 4 - 4 x 2 - 4 x 4 - 8 x 4 - 4 x 8 That would reduce the search space by 68.75% in total.

thomasfaingnaert · 2024-01-25T22:47:25Z

FWIW, for the 7 GPUs we ran the tuning on (V100, RTX 4070, V100S, RTX6000, A100, RTX 2080 Ti, H100), this is the distribution of (WARPS_M, WARPS_N) for the optimal configurations the tuning script found:

julia> counters = countmap(sizes)
Dict{Any, Int64} with 13 entries:
  (1, 2) => 4
  (8, 4) => 1
  (1, 4) => 21
  (4, 1) => 35
  (2, 1) => 5
  (2, 8) => 5
  (4, 2) => 44
  (2, 2) => 39
  (4, 4) => 21
  (8, 1) => 10
  (2, 4) => 31
  (1, 8) => 3
  (8, 2) => 5

and the amount of optimal configurations that would no longer be tested for different changes:

julia> 100 * sum(v for (k, v) in counters if prod(k) < 4) / sum(v for (k, v) in counters)
4.017857142857143

julia> 100 * sum(v for (k, v) in counters if prod(k) < 8) / sum(v for (k, v) in counters)
46.42857142857143

julia> 100 * sum(v for (k, v) in counters if k ∉ [(2, 4), (4, 2), (4, 4), (8, 4), (4, 8)]) / sum(v for (k, v) in counters)
56.69642857142857

Maybe we should hold off on this for now...
Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

maleadt · 2024-01-26T11:39:53Z

Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

Maybe we could gate the extended coverage behind a --slow arg or so (or the limited one behind --fast) and evaluate both? Once we figure out the other tuning script issues, that is.

thomasfaingnaert requested a review from maleadt January 25, 2024 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip configurations with fewer than 4 warps in tuning #188

Skip configurations with fewer than 4 warps in tuning #188

thomasfaingnaert commented Jan 25, 2024

thomasfaingnaert commented Jan 25, 2024 •

edited

Loading

maleadt commented Jan 26, 2024

Skip configurations with fewer than 4 warps in tuning #188

Are you sure you want to change the base?

Skip configurations with fewer than 4 warps in tuning #188

Conversation

thomasfaingnaert commented Jan 25, 2024

thomasfaingnaert commented Jan 25, 2024 • edited Loading

maleadt commented Jan 26, 2024

thomasfaingnaert commented Jan 25, 2024 •

edited

Loading