Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark sample for vector times matrix transposed #38

Closed
wants to merge 1 commit into from

Conversation

qedawkins
Copy link
Collaborator

This adds benchmarks for vmt, with very similar supporting structure to the existing mmt benchmark, but with different strategies tuned for matvec. This add three strategies:

  1. Treat it like a reduction with one workgroup per row, relying on
    cache to get reuse of the vector.
  2. Copy the vector to shared memory using all threads in the workgroup
    and then process N0 rows per workgroup, with WG_Y | N0 threadgroups.
  3. Use a fixed number of workgroups and each workgroup strides the
    full problem space. This should limit the overhead of setting up the
    vector in shared memory, as well as improves scheduling overhead.

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

This adds benchmarks for `vmt`, with very similar supporting structure
to the existing `mmt` benchmark, but with different strategies tuned for
matvec. This add three strategies:

1) Treat it like a reduction with one workgroup per row, relying on
   cache to get reuse of the vector.
2) Copy the vector to shared memory using all threads in the workgroup
   and then process N0 rows per workgroup, with WG_Y | N0 threadgroups.
3) Use a fixed number of workgroups and each workgroup strides the
   full problem space. This should limit the overhead of setting up the
   vector in shared memory, as well as improves scheduling overhead.

Currently, the best configuration for each of the above three strategies
are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec
on an AMD 7900xtx).
@qedawkins
Copy link
Collaborator Author

I don't see a way to assign reviewers, so @antiagainst @kuhar I am posting progress here as discussed offline.

@@ -0,0 +1,106 @@
// Copyright 2023 Google LLC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For any substantially modified file, I believe you should put your affiliation here instead of Google. In case of files where multiple parties made significant contributions to, you can list multiple copyright lines, e.g.:

// Copyright 2020-2022 Google LLC
// Copyright 2023 Costco Inc.

Or we can switch it to "uVkCompute Authors" if that's all too annoying.

const uint threadID = gl_SubgroupInvocationID;
const uvec2 gridDim = gl_NumWorkGroups.xy;

const uint laneCount = gl_WorkGroupSize.x;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we rely on the subgroup size being the same as the workgroup size? Or a specific multiple of the workgroup size. If yes, I think we can set it with VK_EXT_subgroup_size_control.

Comment on lines +99 to +102
int32_t wgResult = subgroupAdd(laneResult);
if (subgroupElect()) {
outputO.x[r] = wgResult;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's not clear to me if we process all the elements because the outer loop has the trip count gl_WorkGroupSize.x * K0_VEC but we reduce only within the subgroup. Don't we need an outer loop that reduces over the whole workgroup?

Comment on lines +372 to +392
for (const ShaderCode &shader : kShaderCodeCases) {
std::string vecmat_size = absl::StrCat(N, "x", K);
std::string tiling_scheme = absl::StrCat(shader.N0, "x", shader.K0);
BM_CHECK(isMultipleOf(N, shader.N0))
<< "Incompatible tiling scheme: " << tiling_scheme;
BM_CHECK(isMultipleOf(K, shader.K0))
<< "Incompatible tiling scheme: " << tiling_scheme;
BM_CHECK(isMultipleOf(shader.K0, 4))
<< "Incompatible tiling scheme: " << tiling_scheme;

std::string workgroup_size =
absl::StrCat(shader.wg_size_x, "x", shader.wg_size_y, "x1");
std::string type_info = absl::StrCat(GetName(shader.input_type), "->",
GetName(shader.output_type));
std::string test_name = absl::StrCat(
gpu_name, "/vmt[", vecmat_size, "]/", type_info, "/", shader.name,
"/Workgroup", "_", shader.B0, "x", "[", workgroup_size, "]");
::benchmark::RegisterBenchmark(test_name.c_str(), Vmt, device,
latency_measure, shader, N, K)
->UseManualTime()
->Unit(::benchmark::kMicrosecond);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have one main.cc file for both variants? This is a lot of that is hard to tell apart without running under diff. Or alternatively, maybe move common code into a header/library?

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty nice! Just a few nits. I mainly reviewed the grid_strided_vmt_i8.glsl file and comments there applies to other simpler strategies. Thanks for adding it!

layout(binding = 1) buffer InputB { i8vec4 x[]; } inputB;
layout(binding = 2) buffer Output { int32_t x[]; } outputO;

layout(local_size_x = WG_X, local_size_y = WG_Y, local_size_z = 1) in;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add comment here to mention that WG_X and WG_Y is macros that gots their values during shader compilation.

SRC
"vmt_i8.glsl"
PERMUTATION
"N0=[1|2|4]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide some background on why we have such permuation in the build system--it was to bypass limitations in driver compilers in mobile GPUs. This can and should be specialization constants in the kernel actually; but mobile GPU driver compilers used to have the issue if it's appearing as loop bounds it cannot properly unroll so tanking the perf. That might be not a problem anymore today. But we never checked. Anyway, nothing to do here; just wanted to explain a bit.


VkExtent3D dimensions1 = {uint32_t(N / 8), uint32_t(K), 1};
BM_CHECK_OK_AND_ASSIGN(
auto src_image1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images and samplers aren't used and should be removed.


layout(constant_id = 0) const uint N = 1;
layout(constant_id = 1) const uint K = 1;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly it would be nice to explain what M, N, K, M0, N0, K0 mean. And mention that the latter three are defined during shader compilation too.


const uint strideB = K_VEC; // Stride of the `inputB` matrix.

// Each workgroup processes a total of N0 rows per iteration, therefore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to give an overview of the algorithm at the beginning, after explaining various macros and constant values. Saying how we decide WG_X (subgroup size) and WG_Y (subgroup count), and how workload are distributed (which is this comment).

void main() {
const uvec2 wgID = gl_WorkGroupID.xy;
const uvec2 localID = gl_LocalInvocationID.xy;
const uint threadID = gl_SubgroupInvocationID;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadID is just localID.x? It would be nice to just use one source to avoid confusion.

shared i8vec4 LHS[K_VEC]; // Shared data for the LHS.

void main() {
const uvec2 wgID = gl_WorkGroupID.xy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only ever use the workgroup X dim. So it's better just to only assign that to wgID.

int32_t laneResult = 0;

for (uint k = 0; k < K_VEC; k += partialVec) {
[[unroll]] for (uint kk = 0; kk < K0_VEC; ++kk) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, do you know what kind of ISA it generates with this unrolled scalar (i8vec4=>i32) load? Does the driver compiler potentially merge them into 128-bit loads?

}

double numOperation = double(N) * double(K) * 2.;
state.counters["Ops"] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of showing Ops/s, it's more meaningful to show bytes/s given this is mostly memory bound so we can compare with theoretical peak easier.

@antiagainst
Copy link
Collaborator

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

That's (4096 LHS + 4096 * 4096 RHS + 4096 * 4 OUTPUT) bytes / (20 * 10^(-6)) s ~= 0.84 TB/s? The theoritcal peak is 3.5 TB/s. So that's still quite far from it. So memory access is still not best. May need to dump ISA and see if there is anything suspicious and grab RGP traces to check.

@qedawkins
Copy link
Collaborator Author

Closing this in favor of #40. If any of the other strategies tried here seem relevant at a later point I will open a new PR on top.

@qedawkins qedawkins closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants