Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip configurations with fewer than 4 warps in tuning #188

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thomasfaingnaert
Copy link
Member

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the N dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations:

  • 2 x 4
  • 4 x 2
  • 4 x 4
  • 8 x 4
  • 4 x 8

That would reduce the search space by 68.75% in total.

@maleadt Thoughts?

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four
processing blocks, each with one warp scheduler, I don't think it makes
sense to try configurations during tuning where the number of warps per
CTA is less than 4. This reduces the search space by 18.75% (well,
assuming that each of the options of WARPS_M and WARPS_N amounts to the
same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per
processing block. That allows the SM to switch to another warp if one
warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think
a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has
reduced data reuse across the M dimension compared to the configuration
WARPS_M = 2, WARPS_N = 4, so we might also only want to try the
following configurations:

- 2 x 4
- 4 x 2
- 4 x 4
- 8 x 4
- 4 x 8

That would reduce the search space by 68.75% in total.
@thomasfaingnaert
Copy link
Member Author

thomasfaingnaert commented Jan 25, 2024

FWIW, for the 7 GPUs we ran the tuning on (V100, RTX 4070, V100S, RTX6000, A100, RTX 2080 Ti, H100), this is the distribution of (WARPS_M, WARPS_N) for the optimal configurations the tuning script found:

julia> counters = countmap(sizes)
Dict{Any, Int64} with 13 entries:
  (1, 2) => 4
  (8, 4) => 1
  (1, 4) => 21
  (4, 1) => 35
  (2, 1) => 5
  (2, 8) => 5
  (4, 2) => 44
  (2, 2) => 39
  (4, 4) => 21
  (8, 1) => 10
  (2, 4) => 31
  (1, 8) => 3
  (8, 2) => 5

and the amount of optimal configurations that would no longer be tested for different changes:

julia> 100 * sum(v for (k, v) in counters if prod(k) < 4) / sum(v for (k, v) in counters)
4.017857142857143

julia> 100 * sum(v for (k, v) in counters if prod(k) < 8) / sum(v for (k, v) in counters)
46.42857142857143

julia> 100 * sum(v for (k, v) in counters if k ∉ [(2, 4), (4, 2), (4, 4), (8, 4), (4, 8)]) / sum(v for (k, v) in counters)
56.69642857142857

Maybe we should hold off on this for now...
Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

@maleadt
Copy link
Member

maleadt commented Jan 26, 2024

Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

Maybe we could gate the extended coverage behind a --slow arg or so (or the limited one behind --fast) and evaluate both? Once we figure out the other tuning script issues, that is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants