Skip configurations with fewer than 4 warps in tuning #188

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...). We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%. We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the M dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations: - 2 x 4 - 4 x 2 - 4 x 4 - 8 x 4 - 4 x 8 That would reduce the search space by 68.75% in total.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip configurations with fewer than 4 warps in tuning #188

Skip configurations with fewer than 4 warps in tuning #188

Commits on Jan 25, 2024

Skip configurations with fewer than 4 warps in tuning #188

Are you sure you want to change the base?

Skip configurations with fewer than 4 warps in tuning #188

Commits on Jan 25, 2024