Mesh shader emulation over draw-indirect #38

Try · 2022-08-07T11:15:11Z

Based on #33

Initial implementation is practically working, this ticket is to track technical depth and for profiling work.

TODO:

Lines/Points
test flat and other interpolators
Fix in uvec3 gl_WorkGroupIDpolution
Fix in uvec3 gl_NumWorkGroups - polluted due to dispatch indirect
Fix in uvec3 gl_GlobalInvocationID // polluted, since it is byproduct of gl_WorkGroupID
Control/sanitize out-of-memory case
perprimitiveEXT - no immediate need

ERR(wont't fix):

Draw order is lost inside a single draw-call (not an issue for 3D)

The text was updated successfully, but these errors were encountered:

Try · 2022-08-07T11:16:55Z

Profiler view on NV (native mesh-shader is disabled):

#38

Try · 2022-08-09T22:00:36Z

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Update to strategy:

Cross-thread semantics is a big issue.
different thread can populate different parts of gl_Positions/varying.

Initial toughs: for any write-out store
out[exp1] = exp2;
analyze exp1, exp2:
exp2, like been said: constant-uniform-input expression
exp1 - simple function of gl_LocalInvocationID
all other parts of left side (.x) - straight constants

for all possible values of gl_LocalInvocationID [0..31), engine should populate mapping table, for each varying:
gl_LocalInvocationID <--> id in var[id]

if all varyings are written from same thread, than that gl_LocalInvocationID can be written out to index-buffer

#38

Try · 2022-08-18T19:57:51Z

Initial work on ShaderAnalyzer in 203ab2d.

Can roughly estimate thread-mapping for vertex/varyings for simple(OpenGothic) cases.
TODO: handle more advance control flow instructions.

#38

Try · 2022-08-23T20:05:44Z

Mesh-emulation still slower than draw-call spam in opengothic case.
Hi-Z pass become surprisingly expensive: ~1.4ms in NVIDIA, with total 162k triangles.

Current ideas:

Render only closemost pieces if HiZ (possible false holes)
Cull HiZ against itself
a. Draw close-most normally, cull others against previous HiZ
b. Since HiZ is 64x64 at most, compute-driven rasterization is possible

Try · 2022-08-27T18:49:16Z

Testing FPS on Intel UHD:

	SM	NoSM
DrawIndexed	38	53
Compute	24	32
Compute+Vertex	29	40

Try · 2022-08-30T16:43:54Z

More numbers:

		HiZ(M)	HiZ(sort)	HiZ(draw)	HiZ	SM0	SM1	BasePass	Transparent	FPS
Nvidia	native(MS)				0.14					222
Intel	native(DI)									30.05

Nvidia	MS0	0.46	0.53	0.10	1.66	0.36	1.21	5.13	0.8	75.00	«Cake» shading
Intel										14.2

Nvidia	MS-VS	0.21	0.17	0.67	1.05	0.24	0.47	1.23	0.09	86	VRAM, L1 pressure in draw
Intel										19.2

#38

Try · 2023-03-05T18:25:43Z

Mesh stage is update to EXT:

Since neither NVidia, neither Intel support compute-to-graphics overlap in same command buffer new take on runtime is:

Mesh-compute populates desc-buffer, index-length, scratch buffers. Scratch has indices+varying data
Prefix pass - computes first index for each draw-call; per-initializes indirect buffer
Sort pass copies only indices to back of scratch buffer.
Scratch buffer then used directly in indirect draw-pass

TODO:
task stage/task payload
nested structs in outputs

Try · 2023-03-06T19:21:41Z

Hm, task shader appear to be way bigger problem that I expected. In straight indirect-based workflow one dispatchMeshTask can emit up to gl_NumWorkGroups followup mesh grids.
That mean gl_NumWorkGroups of compute-indirect commands. Not to mention: no clean way to pass payload to those dispatches.

Some ideas:

Inline mesh stage into task shader:

void main()
{
  if(gl_LocalInvocationID < max_task_threads)
    task_main();
  barrier(); // make sure that task stage is done
  for(int i=0; i<mesh_groups; ++i)
  {
    if(gl_LocalInvocationID < max_mesh_threads)
      mesh_main();  
  }
}

Cons: wont work reliable with inner barriers, wont work fast with large expansion factor

Mega-dispath-indirect
issue only single vkCmdDispathIndirect, with total groups count = sum of emitted workgroups from each group.
supliment it with some-sort of LUT-table, so mesh call can understand his 'real' gl_GlobalInvocationID and fetch payload memory

#38

Try · 2024-01-02T23:53:35Z

Task + Mesh. all emulated via gpu-compute, Intel UHD:

Task for GBuffer takes lot of time. Probably due to FE pressure (on NVidia it is FE-pressure, with 2k of 1x1x1 compute jobs)
Native task probably has similar problem, just hidden via pipelining.
In theory multi-queue can be a solution, but Intel doesn't have that.
Mesh-shader sort takes too big of a budget. can be fixed by per-allocating IBO for max_primitives upfront.
Cull-face should be utilized, to help with Shadow-passes

Try · 2024-01-03T00:54:50Z

Recent test on Intel, with time-stamp based profiler.

	Task	Task-lut	Mesh	Mesh-sort	Draw
HiZ occluder	0.03	0.03	0.33	0.23	0.49
GBuf all	0.89	0.04	2.77	0.57	4.54
Shadow0	0.11	0.02	0.19	0.11	0.83
Shadow1	0.33	0.03	1.49	0.46	3.06

Try added a commit that referenced this issue Aug 7, 2022

minor improvals to emulated mesh shaders

db38f54

#38

Try mentioned this issue Aug 8, 2022

Low FPS issue Try/OpenGothic#188

Closed

Try added a commit that referenced this issue Aug 9, 2022

optimize mesh emulator

c763de1

#38

Try added a commit that referenced this issue Aug 10, 2022

double-buffer indirect buffer

7ceed53

#38

Try added a commit that referenced this issue Aug 18, 2022

mesh-shader: ShaderAnalyzer initial

203ab2d

#38

Try added a commit that referenced this issue Aug 21, 2022

mesh shaders in progress

532a8b2

#38

Try added a commit that referenced this issue Aug 21, 2022

fix working CS <-> VS split

42e8916

#38

Try added a commit that referenced this issue Mar 2, 2023

moving mesh shadet to EXT

250b13c

#38

Try added a commit that referenced this issue Mar 4, 2023

mesh shader indirect pass updated to EXT

aa2b62b

#38

Try added a commit that referenced this issue Mar 4, 2023

mesh-shader cleanup

4bd8922

#38

Try added a commit that referenced this issue Jun 5, 2023

mesh emulation in progress

379c1ef

#38

Try added a commit that referenced this issue Jan 1, 2024

mesh-emulation in progress

5ccc156

#38

Try added a commit that referenced this issue Jan 2, 2024

task emulation cleanup

b0f47db

#38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mesh shader emulation over draw-indirect #38

Mesh shader emulation over draw-indirect #38

Try commented Aug 7, 2022 •

edited

Loading

Try commented Aug 7, 2022 •

edited

Loading

Try commented Aug 9, 2022

Try commented Aug 18, 2022

Try commented Aug 23, 2022 •

edited

Loading

Try commented Aug 27, 2022

Try commented Aug 30, 2022

Try commented Mar 5, 2023 •

edited

Loading

Try commented Mar 6, 2023 •

edited

Loading

Try commented Jan 2, 2024

Try commented Jan 3, 2024

Mesh shader emulation over draw-indirect #38

Mesh shader emulation over draw-indirect #38

Comments

Try commented Aug 7, 2022 • edited Loading

Try commented Aug 7, 2022 • edited Loading

Try commented Aug 9, 2022

Try commented Aug 18, 2022

Try commented Aug 23, 2022 • edited Loading

Try commented Aug 27, 2022

Try commented Aug 30, 2022

Try commented Mar 5, 2023 • edited Loading

Try commented Mar 6, 2023 • edited Loading

Try commented Jan 2, 2024

Try commented Jan 3, 2024

Try commented Aug 7, 2022 •

edited

Loading

Try commented Aug 7, 2022 •

edited

Loading

Try commented Aug 23, 2022 •

edited

Loading

Try commented Mar 5, 2023 •

edited

Loading

Try commented Mar 6, 2023 •

edited

Loading