Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesh shader emulation over draw-indirect #38

Open
1 of 8 tasks
Try opened this issue Aug 7, 2022 · 10 comments
Open
1 of 8 tasks

Mesh shader emulation over draw-indirect #38

Try opened this issue Aug 7, 2022 · 10 comments

Comments

@Try
Copy link
Owner

Try commented Aug 7, 2022

Based on #33

Initial implementation is practically working, this ticket is to track technical depth and for profiling work.

TODO:

  • Lines/Points
  • test flat and other interpolators
  • Fix in uvec3 gl_WorkGroupIDpolution
  • Fix in uvec3 gl_NumWorkGroups - polluted due to dispatch indirect
  • Fix in uvec3 gl_GlobalInvocationID // polluted, since it is byproduct of gl_WorkGroupID
  • Control/sanitize out-of-memory case
  • perprimitiveEXT - no immediate need

ERR(wont't fix):

  • Draw order is lost inside a single draw-call (not an issue for 3D)
@Try
Copy link
Owner Author

Try commented Aug 7, 2022

Profiler view on NV (native mesh-shader is disabled):
изображение

Try added a commit that referenced this issue Aug 7, 2022
Try added a commit that referenced this issue Aug 9, 2022
@Try
Copy link
Owner Author

Try commented Aug 9, 2022

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Update to strategy:

Cross-thread semantics is a big issue.
different thread can populate different parts of gl_Positions/varying.

Initial toughs: for any write-out store
out[exp1] = exp2;
analyze exp1, exp2:
exp2, like been said: constant-uniform-input expression
exp1 - simple function of gl_LocalInvocationID
all other parts of left side (.x) - straight constants

for all possible values of gl_LocalInvocationID [0..31), engine should populate mapping table, for each varying:
gl_LocalInvocationID <--> id in var[id]

if all varyings are written from same thread, than that gl_LocalInvocationID can be written out to index-buffer

Try added a commit that referenced this issue Aug 10, 2022
Try added a commit that referenced this issue Aug 18, 2022
@Try
Copy link
Owner Author

Try commented Aug 18, 2022

Initial work on ShaderAnalyzer in 203ab2d.

Can roughly estimate thread-mapping for vertex/varyings for simple(OpenGothic) cases.
TODO: handle more advance control flow instructions.

Try added a commit that referenced this issue Aug 21, 2022
Try added a commit that referenced this issue Aug 21, 2022
@Try
Copy link
Owner Author

Try commented Aug 23, 2022

Mesh-emulation still slower than draw-call spam in opengothic case.
Hi-Z pass become surprisingly expensive: ~1.4ms in NVIDIA, with total 162k triangles.

Current ideas:

  1. Render only closemost pieces if HiZ (possible false holes)
  2. Cull HiZ against itself
    a. Draw close-most normally, cull others against previous HiZ
    b. Since HiZ is 64x64 at most, compute-driven rasterization is possible

@Try
Copy link
Owner Author

Try commented Aug 27, 2022

Testing FPS on Intel UHD:

SM NoSM
DrawIndexed 38 53
Compute 24 32
Compute+Vertex 29 40

изображение

@Try
Copy link
Owner Author

Try commented Aug 30, 2022

More numbers:

    HiZ(M) HiZ(sort) HiZ(draw) HiZ SM0 SM1 BasePass Transparent FPS    
Nvidia native(MS)       0.14         222    
Intel native(DI)                 30.05    
                         
Nvidia MS0 0.46 0.53 0.10 1.66 0.36 1.21 5.13 0.8 75.00   «Cake» shading
Intel                   14.2    
                         
Nvidia MS-VS 0.21 0.17 0.67 1.05 0.24 0.47 1.23 0.09 86   VRAM, L1 pressure in draw
Intel                   19.2    

Try added a commit that referenced this issue Mar 2, 2023
Try added a commit that referenced this issue Mar 4, 2023
Try added a commit that referenced this issue Mar 4, 2023
@Try
Copy link
Owner Author

Try commented Mar 5, 2023

Mesh stage is update to EXT:
изображение

Since neither NVidia, neither Intel support compute-to-graphics overlap in same command buffer new take on runtime is:

  1. Mesh-compute populates desc-buffer, index-length, scratch buffers. Scratch has indices+varying data
  2. Prefix pass - computes first index for each draw-call; per-initializes indirect buffer
  3. Sort pass copies only indices to back of scratch buffer.
  4. Scratch buffer then used directly in indirect draw-pass

TODO:
task stage/task payload
nested structs in outputs

@Try
Copy link
Owner Author

Try commented Mar 6, 2023

Hm, task shader appear to be way bigger problem that I expected. In straight indirect-based workflow one dispatchMeshTask can emit up to gl_NumWorkGroups followup mesh grids.
That mean gl_NumWorkGroups of compute-indirect commands. Not to mention: no clean way to pass payload to those dispatches.

Some ideas:

  1. Inline mesh stage into task shader:
void main()
{
  if(gl_LocalInvocationID < max_task_threads)
    task_main();
  barrier(); // make sure that task stage is done
  for(int i=0; i<mesh_groups; ++i)
  {
    if(gl_LocalInvocationID < max_mesh_threads)
      mesh_main();  
  }
}

Cons: wont work reliable with inner barriers, wont work fast with large expansion factor

  1. Mega-dispath-indirect
    issue only single vkCmdDispathIndirect, with total groups count = sum of emitted workgroups from each group.
    supliment it with some-sort of LUT-table, so mesh call can understand his 'real' gl_GlobalInvocationID and fetch payload memory

Try added a commit that referenced this issue Jun 5, 2023
Try added a commit that referenced this issue Jan 1, 2024
Try added a commit that referenced this issue Jan 2, 2024
@Try
Copy link
Owner Author

Try commented Jan 2, 2024

Task + Mesh. all emulated via gpu-compute, Intel UHD:
изображение

  • Task for GBuffer takes lot of time. Probably due to FE pressure (on NVidia it is FE-pressure, with 2k of 1x1x1 compute jobs)
    Native task probably has similar problem, just hidden via pipelining.
    In theory multi-queue can be a solution, but Intel doesn't have that.
  • Mesh-shader sort takes too big of a budget. can be fixed by per-allocating IBO for max_primitives upfront.
  • Cull-face should be utilized, to help with Shadow-passes

@Try
Copy link
Owner Author

Try commented Jan 3, 2024

Recent test on Intel, with time-stamp based profiler.

Task Task-lut Mesh Mesh-sort Draw
HiZ occluder 0.03 0.03 0.33 0.23 0.49
GBuf all 0.89 0.04 2.77 0.57 4.54
Shadow0 0.11 0.02 0.19 0.11 0.83
Shadow1 0.33 0.03 1.49 0.46 3.06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant