Add benchmark sample for vector times matrix transposed #38

qedawkins · 2023-09-14T02:49:32Z

This adds benchmarks for vmt, with very similar supporting structure to the existing mmt benchmark, but with different strategies tuned for matvec. This add three strategies:

Treat it like a reduction with one workgroup per row, relying on
cache to get reuse of the vector.
Copy the vector to shared memory using all threads in the workgroup
and then process N0 rows per workgroup, with WG_Y | N0 threadgroups.
Use a fixed number of workgroups and each workgroup strides the
full problem space. This should limit the overhead of setting up the
vector in shared memory, as well as improves scheduling overhead.

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

This adds benchmarks for `vmt`, with very similar supporting structure to the existing `mmt` benchmark, but with different strategies tuned for matvec. This add three strategies: 1) Treat it like a reduction with one workgroup per row, relying on cache to get reuse of the vector. 2) Copy the vector to shared memory using all threads in the workgroup and then process N0 rows per workgroup, with WG_Y | N0 threadgroups. 3) Use a fixed number of workgroups and each workgroup strides the full problem space. This should limit the overhead of setting up the vector in shared memory, as well as improves scheduling overhead. Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

qedawkins · 2023-09-14T02:51:24Z

I don't see a way to assign reviewers, so @antiagainst @kuhar I am posting progress here as discussed offline.

kuhar · 2023-09-15T01:58:34Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

@@ -0,0 +1,106 @@
+// Copyright 2023 Google LLC


nit: For any substantially modified file, I believe you should put your affiliation here instead of Google. In case of files where multiple parties made significant contributions to, you can list multiple copyright lines, e.g.:

// Copyright 2020-2022 Google LLC // Copyright 2023 Costco Inc.

Or we can switch it to "uVkCompute Authors" if that's all too annoying.

kuhar · 2023-09-15T02:27:48Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+  const uint threadID = gl_SubgroupInvocationID;
+  const uvec2 gridDim = gl_NumWorkGroups.xy;
+
+  const uint laneCount = gl_WorkGroupSize.x;


Do we rely on the subgroup size being the same as the workgroup size? Or a specific multiple of the workgroup size. If yes, I think we can set it with VK_EXT_subgroup_size_control.

kuhar · 2023-09-15T02:35:08Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+        int32_t wgResult = subgroupAdd(laneResult);
+        if (subgroupElect()) {
+          outputO.x[r] = wgResult;
+        }


Here it's not clear to me if we process all the elements because the outer loop has the trip count gl_WorkGroupSize.x * K0_VEC but we reduce only within the subgroup. Don't we need an outer loop that reduces over the whole workgroup?

kuhar · 2023-09-15T02:39:39Z

benchmarks/vmt/grid_strided_vmt_main.cc

+  for (const ShaderCode &shader : kShaderCodeCases) {
+    std::string vecmat_size = absl::StrCat(N, "x", K);
+    std::string tiling_scheme = absl::StrCat(shader.N0, "x", shader.K0);
+    BM_CHECK(isMultipleOf(N, shader.N0))
+        << "Incompatible tiling scheme: " << tiling_scheme;
+    BM_CHECK(isMultipleOf(K, shader.K0))
+        << "Incompatible tiling scheme: " << tiling_scheme;
+    BM_CHECK(isMultipleOf(shader.K0, 4))
+        << "Incompatible tiling scheme: " << tiling_scheme;
+
+    std::string workgroup_size =
+        absl::StrCat(shader.wg_size_x, "x", shader.wg_size_y, "x1");
+    std::string type_info = absl::StrCat(GetName(shader.input_type), "->",
+                                         GetName(shader.output_type));
+    std::string test_name = absl::StrCat(
+        gpu_name, "/vmt[", vecmat_size, "]/", type_info, "/", shader.name,
+        "/Workgroup", "_", shader.B0, "x", "[", workgroup_size, "]");
+    ::benchmark::RegisterBenchmark(test_name.c_str(), Vmt, device,
+                                   latency_measure, shader, N, K)
+        ->UseManualTime()
+        ->Unit(::benchmark::kMicrosecond);


Can we have one main.cc file for both variants? This is a lot of that is hard to tell apart without running under diff. Or alternatively, maybe move common code into a header/library?

antiagainst

Overall looks pretty nice! Just a few nits. I mainly reviewed the grid_strided_vmt_i8.glsl file and comments there applies to other simpler strategies. Thanks for adding it!

antiagainst · 2023-09-14T16:29:17Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+layout(binding = 1) buffer InputB { i8vec4 x[]; } inputB;
+layout(binding = 2) buffer Output { int32_t x[]; } outputO;
+
+layout(local_size_x = WG_X, local_size_y = WG_Y, local_size_z = 1) in;


It would be nice to add comment here to mention that WG_X and WG_Y is macros that gots their values during shader compilation.

antiagainst · 2023-09-14T16:39:04Z

benchmarks/vmt/CMakeLists.txt

+  SRC
+    "vmt_i8.glsl"
+  PERMUTATION
+    "N0=[1|2|4]"


Just to provide some background on why we have such permuation in the build system--it was to bypass limitations in driver compilers in mobile GPUs. This can and should be specialization constants in the kernel actually; but mobile GPU driver compilers used to have the issue if it's appearing as loop bounds it cannot properly unroll so tanking the perf. That might be not a problem anymore today. But we never checked. Anyway, nothing to do here; just wanted to explain a bit.

antiagainst · 2023-09-15T03:04:59Z

benchmarks/vmt/grid_strided_vmt_main.cc

+
+  VkExtent3D dimensions1 = {uint32_t(N / 8), uint32_t(K), 1};
+  BM_CHECK_OK_AND_ASSIGN(
+      auto src_image1,


The images and samplers aren't used and should be removed.

antiagainst · 2023-09-15T03:07:54Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+
+layout(constant_id = 0) const uint N = 1;
+layout(constant_id = 1) const uint K = 1;
+


Similarly it would be nice to explain what M, N, K, M0, N0, K0 mean. And mention that the latter three are defined during shader compilation too.

antiagainst · 2023-09-15T03:10:01Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+
+const uint strideB = K_VEC; // Stride of the `inputB` matrix.
+
+// Each workgroup processes a total of N0 rows per iteration, therefore


It would be nice to give an overview of the algorithm at the beginning, after explaining various macros and constant values. Saying how we decide WG_X (subgroup size) and WG_Y (subgroup count), and how workload are distributed (which is this comment).

antiagainst · 2023-09-15T03:18:02Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+void main() {
+  const uvec2 wgID = gl_WorkGroupID.xy;
+  const uvec2 localID = gl_LocalInvocationID.xy;
+  const uint threadID = gl_SubgroupInvocationID;


threadID is just localID.x? It would be nice to just use one source to avoid confusion.

antiagainst · 2023-09-15T03:22:35Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+shared i8vec4 LHS[K_VEC]; // Shared data for the LHS.
+
+void main() {
+  const uvec2 wgID = gl_WorkGroupID.xy;


We only ever use the workgroup X dim. So it's better just to only assign that to wgID.

antiagainst · 2023-09-15T03:31:50Z

benchmarks/vmt/grid_strided_vmt_i8.glsl

+        int32_t laneResult = 0;
+
+        for (uint k = 0; k < K_VEC; k += partialVec) {
+          [[unroll]] for (uint kk = 0; kk < K0_VEC; ++kk) {


Curious, do you know what kind of ISA it generates with this unrolled scalar (i8vec4=>i32) load? Does the driver compiler potentially merge them into 128-bit loads?

antiagainst · 2023-09-15T03:40:35Z

benchmarks/vmt/grid_strided_vmt_main.cc

+  }
+
+  double numOperation = double(N) * double(K) * 2.;
+  state.counters["Ops"] =


Instead of showing Ops/s, it's more meaningful to show bytes/s given this is mostly memory bound so we can compare with theoretical peak easier.

antiagainst · 2023-09-15T04:01:16Z

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

That's (4096 LHS + 4096 * 4096 RHS + 4096 * 4 OUTPUT) bytes / (20 * 10^(-6)) s ~= 0.84 TB/s? The theoritcal peak is 3.5 TB/s. So that's still quite far from it. So memory access is still not best. May need to dump ISA and see if there is anything suspicious and grab RGP traces to check.

qedawkins · 2023-11-02T13:51:06Z

Closing this in favor of #40. If any of the other strategies tried here seem relevant at a later point I will open a new PR on top.

kuhar requested review from antiagainst and kuhar September 14, 2023 04:38

kuhar reviewed Sep 15, 2023

View reviewed changes

antiagainst requested changes Sep 15, 2023

View reviewed changes

antiagainst reviewed Sep 15, 2023

View reviewed changes

kuhar mentioned this pull request Oct 25, 2023

Add vector-times-matrix-transposed benchmark (V2) #40

Merged

qedawkins closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark sample for vector times matrix transposed #38

Add benchmark sample for vector times matrix transposed #38

qedawkins commented Sep 14, 2023

qedawkins commented Sep 14, 2023

kuhar Sep 15, 2023

kuhar Sep 15, 2023

kuhar Sep 15, 2023

kuhar Sep 15, 2023

antiagainst left a comment

antiagainst Sep 14, 2023

antiagainst Sep 14, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst Sep 15, 2023

antiagainst commented Sep 15, 2023

qedawkins commented Nov 2, 2023


		layout(constant_id = 0) const uint N = 1;
		layout(constant_id = 1) const uint K = 1;


		const uint strideB = K_VEC; // Stride of the `inputB` matrix.

		// Each workgroup processes a total of N0 rows per iteration, therefore

Add benchmark sample for vector times matrix transposed #38

Add benchmark sample for vector times matrix transposed #38

Conversation

qedawkins commented Sep 14, 2023

qedawkins commented Sep 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst commented Sep 15, 2023

qedawkins commented Nov 2, 2023