Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low FPS issue #188

Closed
okkindel opened this issue Aug 23, 2021 · 26 comments
Closed

Low FPS issue #188

okkindel opened this issue Aug 23, 2021 · 26 comments

Comments

@okkindel
Copy link
Contributor

I love this project and I always check the commits you are currently pushing to see what new things you are adding. What hurts me is the very low frame rate I'm able to achieve on my hardware, which is not weak though. I have an impression that the graphics engine generates the whole world at once, generally something is wrong if such an old game on a nvidia card runs at less than 20 FPS. Maybe it would be possible to limit the rendering range or use other tricks to overcome this problem?

@Try
Copy link
Owner

Try commented Aug 24, 2021

Hi, @okkindel !

Generally speaking OpenGothic!=Gothic, when it comes to appearance:
изображение

We have:

  • ~2km draw distance - effectively you can see the whole game world from corner to corner.
  • full life simulation - in vanilla npc do exist only in 20 proximity of player, but in OpenGothic - they always do exist and do things
  • draw quality is not same - we have per-pixel lighting, shadows, procedural atmosphere

my hardware

What is your hardware then? My HW is GTX960M + core-i7, resulting in ~30 in debug and 60 in release builds.
On my machine the pain-points are:

  1. [GPU] Atmosphere (Nishita math model) - lot's of math and full screen blending pass
  2. [CPU] Bullet collision - provided by bullet, no clear way for major improve
  3. [CPU] Draw call recording - while draw-call them self's do not draw much of triangles, on average frame contains ~65k draw
    3.1 Very little data about BSP/Portals/Occludes - to little data for smart processing for scene

Can you measure/test what is slows down game on your setup (my is probably not very representative)? Is it CPU or GPU performance for you?

@okkindel
Copy link
Contributor Author

okkindel commented Aug 25, 2021

I will try to measure it. another thing was that i only played debug builds. I will check one of release versions.

But maybe it would be possible, for example, to limit the number of simulated npc to some radius around the player? That would improve the collision issues, likewise, maybe you could only render the elements visible in the camera? Probably quite a lot of room for optimization can be found here.

This would reduce resource consumption and had no negative impact on gameplay.

@Try
Copy link
Owner

Try commented Aug 27, 2021

Profile run on latest master, with my hardware.
Scene:
изображение

CPU:
изображение

Gothic::tick - 45% (most of it is Npc::tick - 37.23%)
Renderer::draw - 33.56%
Workers::execWork - 21.67% - all threaded work

GPU:
изображение

pass drawcalls time(ns)
shadowmap layer 0 903 8962
shadowmap layer 1 7768 19797
main pass 37335 44314
transparent pass 1899 58768
transparent pass:sky 1 41800

@Try
Copy link
Owner

Try commented Aug 27, 2021

maybe you could only render the elements visible in the camera?

Oh I wish to, but how? :)
Latest attempt was to use HiZ culling, and there is a HiZ-prototype branch for it.

for example, to limit the number of simulated npc to some radius around the player?

I wish to save this method as the last resort, to use it on mobile platforms (or if all else fails)

Try added a commit that referenced this issue Aug 29, 2021
Try added a commit that referenced this issue Aug 29, 2021
* use skip-move strategy to save CPU performance on idle npc's
#188
@Try
Copy link
Owner

Try commented Aug 29, 2021

Situation update on top-level:

Renderer::draw 40.87% all drawings
MainWindow::tick 36.22% game logic
WorldObjects::updateAnimation 15.18% skeletal animation
Tempest::Detail::VSwapchain::present 6.17% presenting (probably issue with my laptop - should be ~0.0%)

@Eulenmensch93
Copy link
Contributor

I also had a quick look into this issue.
The only thing I noticed was that in NPC::tick() we call the same method at some point and it needs a lot of CPU (moveAlgo::implTick).
I did not get to look into the method to see what the function does, but maybe its possible to optimize something here by cashing?

Screenshot_20210908_173100

Quick maybe stupid question...
Are there any things like animations or draw calls that are called for objects that are behind the player for example that could be saved based on the player angle and position (basically just disabling those things for everything behind the player)?

@Try
Copy link
Owner

Try commented Sep 9, 2021

moveAlgo::implTick

This what I was talking about before, when BulletCollision was mention. In this function there rare cases like swim/fly/jump and a common one - straight movement. Straight move will come down to DynamicWorld::NpcItem::implTryMove

There is already caching-like optimisation inside: physic model is discrete with step of 2 cm.

Are there any things like animations

Animations - sure, draws - no.
Animation cannot be skipped, since in G2 move speed is driven by animation frames. Also animation is relatively cheap - 15.18% in overall performance (with the help of parallel for)

@pseregiet
Copy link

"full life simulation - in vanilla npc do exist only in 20 proximity of player, but in OpenGothic - they always do exist and do things"

That seems needlessly excessive since Gothic npc's daily routines are not that advanced. Shouldn't make any difference if you let them "think" when you are far away. I understand increasing the original radius a bit, but making all the npcs in the world think at all times seems pointless.

@Try
Copy link
Owner

Try commented Dec 24, 2021

@pseregiet
The plan was to start with full simulation, optimize it as far as we can, while have easy time testing.
On next step, once AI itself it optimized, introduce spawn radius - to make it even faster

Currently AI is good enough to work in 60fps on cur-gen hardware, so effort shifted to drawcall optimization

@pseregiet
Copy link

@pseregiet The plan was to start with full simulation, optimize it as far as we can, while have easy time testing. On next step, once AI itself it optimized, introduce spawn radius - to make it even faster

Currently AI is good enough to work in 60fps on cur-gen hardware, so effort shifted to drawcall optimization

I see, sounds reasonable.

@okkindel
Copy link
Contributor Author

okkindel commented Dec 29, 2021

Maybe it's worth taking this NPC intelligence radius to a variable so that you can compile or run a project with more FPS?

@Try
Copy link
Owner

Try commented Feb 7, 2022

Minor update from runs on RTX:

00:26.01 : vkQueuePresentKHR = 8
00:26.03 : vkQueuePresentKHR = 5
00:26.03 : vkQueuePresentKHR = 8
00:26.05 : vkQueuePresentKHR = 8
00:26.07 : vkQueuePresentKHR = 3
00:26.07 : vkQueuePresentKHR = 3
00:26.07 : vkQueuePresentKHR = 10
00:26.08 : vkQueuePresentKHR = 5
00:26.10 : vkQueuePresentKHR = 9
00:26.11 : vkQueuePresentKHR = 3
00:26.11 : vkQueuePresentKHR = 3
00:26.12 : vkQueuePresentKHR = 9
00:26.13 : vkQueuePresentKHR = 9
00:26.15 : vkQueuePresentKHR = 3
00:26.15 : vkQueuePresentKHR = 3
00:26.15 : vkQueuePresentKHR = 9

That quite ridiculous amount of time for one api call. Similar timing happen on DX12 as well.
Tried also the vkcube - similar problem, vkQueuePresentKHR takes ~2ms
To me look very much line a driver issue so far.

@CoffeeParser
Copy link

CoffeeParser commented Jun 25, 2022

Profile run on latest master, with my hardware. Scene: изображение

CPU: изображение

Gothic::tick - 45% (most of it is Npc::tick - 37.23%)
Renderer::draw - 33.56%
Workers::execWork - 21.67% - all threaded work

GPU: изображение

pass drawcalls time(ns)
shadowmap layer 0 903 8962
shadowmap layer 1 7768 19797
main pass 37335 44314
transparent pass 1899 58768
transparent pass:sky 1 41800

Hi, I'm following this project and I'd like to give some tips to help improve the rendering performance. Maybe you're already following some of the tips, anyway I hope this is useful...

Draw skybox as last object in opaque geometry pass instead of transparent pass to reduce overdraw. You're currently basically saying draw the biggest thing last on top of all transparent objects. Each transparent object adds lots of overdraw, where do these many transparent objects come from, rendering each particle separate?

Implement view frustum culling with max distance to reduce number of drawn objects, I guess you're already doing this.

implement draw call batching technique for static meshes like trees to reduce vertexbuffer binds and drawcalls. This will probably outperform occlusion culling

order your drawcalls around to minimize overdraw (calling fragment shader for same fragment multiple times when it's already filled == overdraw) and therefore reduce pixel shader runs for opaque geometry, see this detailed post
https://realtimecollisiondetection.net/blog/?p=86
Generally a good idea to do command acquisition in parallel and then sort based on compound key using radix sort. Sort criteria for opaque and transparent geometry is distance squared to camera position. Distance squared doesn't use sqrt and gets the job done.

try to use interleaved vertexbuffers for non shadow casters objects, the vertex shader runs will be faster due to cache locality.

use a different update interval for bullet... Rendering at 60fps and physics update at 20fps. Only simulate physics things within a certain radius to the player. Maybe consider using octree to dynamically add and remove objects from physics world when in distance to player.

Use octree to shrink and unshrink NPCs. ReGoth has a good description of the process and states that shrinking uses bigger distance than unshrinking. Shrinked npc will be placed according to their schedule. Maybe make that shrinking a config for weak target devices.

For lighting it's best to use forward+ rendering technique. Using standard forward rendering is expensive when you're having multiple lights in a scene.

Profile npc tick with Intel vtune to identify cache misses. Optimize animation processing for cache performance by making sure alignment fits. When possible switch over to ecs style instead of sceneobjects with components.

Use simd for matrix vector multiplications and vector math. I guess you're already doing that.

@Try
Copy link
Owner

Try commented Jun 26, 2022

Hi, @CoffeeParser , thanks for feedback!

I'll address your proposes in groups:

Overdraw

OpenGothic uses hybrid approach: gbuffer pass + forward of sunlight and forward for translucent stuff.
Normally there is no need to optimize gbuffer overdraw up until 4k textures.
Also tested without forward of sunlight - no observable performance difference on NVidia and Intell

Drawcalls/Geometry

Generally I'm looking forward for compute driven geometry(and mesh-shaders on NV) to handle geometry. That fully solves visibility tests issue: each meshlet tests itself versus HiZ of landscape, while landscape takes advantage of z-prepass (1 dispatch to fill z-buffer for opaque pieces of landscape)
Ideally bindless should go on top of this solution, but so far bindless works poorly in engine.

interleaved vertexbuffers

Should use it for shadows and raytracing, but it's hard to implement into engine, since drawcall is like:
template<class V, class T> drawIndexed(const VertexBuffer<V>&, const IndexBuffer<I>&)

use a different update interval for bullet...

Nope, that won't work: bullet is used to test npc-to-landscape collision. So instead of optimizing it by time, OpenGothic(should be already there in since 1.0.1311) optimizes it by space: physical representation is updated only if logical position diverse for more than 2cm. Naturally it also takes care of standing still npc's.

shrink and unshrink NPCs.

Probably only at supper low-profile devices, not for PC. Bullet optimizations made a good deal of improval + in next build there going to be more npc-related stuff. I think we can have setup, when processing of all npc at once fusible on current gen CPU with no shrinking.

Forward+

I was experimenting a little, yet there is a problem: when it's pure forward+, you won't have a gbuffer. No gbuffer means no ssao/ssdo and such. And hybrid solution is somewhat pointless - can use regular gbuffer in such case.

Optimize animation processing for cache performance by making sure alignment fits.

I was planning to move it to gpu-compute.

@Try
Copy link
Owner

Try commented Jun 26, 2022

Implement view frustum culling

While frustum culling goes as default option and OpenGothic has it, now I have a doubts:
If no culling, we can start to use resubmitable command buffers and save ~30% of cpu load.
For meshlets and meshlet-like there is already HiZ culling, so draw-everything won't be an issue.

@CoffeeParser
Copy link

Implement view frustum culling

While frustum culling goes as default option and OpenGothic has it, now I have a doubts: If no culling, we can start to use resubmitable command buffers and save ~30% of cpu load. For meshlets and meshlet-like there is already HiZ culling, so draw-everything won't be an issue.

That sounds like the acquisition and preparation of the command buffers + frustum culling is inefficient. Are you using octree for frustum culling or brute force loop over every collider and test if it intersects with frustum sides?

Anyway I doubt that having 60k+ draw calls per frame is reasonable and required. Batching static meshes together for instanced rendering would shave of lots of drawcalls and improve performance tremendously. In your example scene I saw lots of grass and trees, that could become 2 drawcalls instead of one per grass and one per tree just by using instanced rendering.

Are you sorting opaque geometry by distance to camera? Draw opaque objects near camera first and then farther objects, lastly the terrain and skybox. In translucent pass the sorting goes in the opposite direction to achieve maximum overdraw of overlapping transparent geometry.

Best rendering performance can be achieved by combining view frustum culling + hiz culling + using instanced rendering + sorting of gpu commands before submit. The problem of hiz culling is that you must use one full framebuffer drawing everything in frustum therefore limiting use of cards with low fillrate and again issuing lots of drawcalls??

On low fillrate cards it's anyway better to limit draw distance + fog and disable shadows or use simple circle shadows for npc. Same with ssao.

Maybe it's worth it to rethink the whole animation drives movement speed thing as this seems to use most of game tick and contributes only in close distance to immersion. A simple solution would be to precalculate walk speed per animation frame upon loading of the character and then just interpolate the walk speed without interpolating the rig, thus getting precise movement speeds without actually having to animate the whole rig. This way it becomes possible to deactivate animation calculation for NPCs out of players view while still maintaining animation frames and getting same movement speed.

Optimal solution would be to provide settings for this optimizations so that users can setup according to their machine. Full npc simulation and npc shrinking mode. Draw distance. Ambient occlusion. Shadow quality.

@Try
Copy link
Owner

Try commented Jun 27, 2022

@CoffeeParser you analyzing almost a year old trace. Let me do profile runs on latest build.

CPU side

изображение
Note: same view-angle, roughly same day-time. But newer machine - my old laptop is no more :(

So in here:

Function %
Renderer::draw 35,68 %
MainWindow::tick 19,79 %
Tempest::Detail::VSwapchain::present 14,56 %
WorldObjects::updateAnimation 13,53 %

VSwapchain::present - contacted nvidia about it, they saying it's a display driver bug (they send a frame to intell but intell driver takes a nap, before sending it forward to the screen)

updateAnimation - takes most of it's time converting quaternions to matrixes. Can be moved to GPU, with no big risk.
tick:
изображение
npcMoveSpeed implemented as moveOffset(time+dt)-moveOffset(time) and move offset - linear sampling of prefix table + period,

draw
изображение

Draw workflow:
Objects are sorted in buckets. There are 2 major types of buckets:

  1. Standard - each object shares same mesh,textures and shaders, therefore only single descriptor set is needed.
  2. Multi-bucket - objects do share only same shader-pipeline - used for uniq/non-trivial stuff.

Bucket can have up to 255 objects, memory for those is pre-allocated and managed by the bucket
Bucket have only single VisibleSet and 255 of VisibilityGroup::Token for each object.

VisibleSet is 3 lists (one for each viewport) of visible objects indexes + count for each list.

Visibility token is a way to decouple drawing and visibility workload. VisibilityGroup know every object in regard to bbox, position, static-vs-dynamic hint.

On each frame engine resets all VisibleSet's to zero size and then traverse objects:

  • Dynamic objects via parallelfor
  • Static objects - binary tree (also multi-threaded traverse)

If object is visible - atomic_inc to VisibleSet::size[view] + write obj index.

For drawing:
Bucket can have one of 3 instancing strategy's:

  1. no instancing
  2. regular - can use instancing withing continuous range of objects
  3. aggressive - can schedule dead objects (this is valid, since memory is always allocated for full bucket + zero obj matrix)

If hardware supports mesh-shaders regular never used, always favors aggressive - let HiZ do the job.

GPU side

RTX 3070, mesh shader are enabled.
изображение
Also sorry - no debug markers, never had a time to implement them :(
From left to right:

  • HiZ pass - blue pipeline, almost invisible
  • HiZ mips - purple compute job
  • shadow layers 0,1 - orange 27.5%
  • sky_lut + fog_lut - green
  • gbuffer - red 35.4%
  • ssao - blue 20% (rayquery used)
  • lighting + translucent + sky + fog - green

It's 10k api calls. Yet there is a catch - without mesh api-call count is way worse, due lack of HiZ and other stuff.
Generally I'm looking forward to implement something-like a layer to have mesh-shader on any hardware and make this workflow as main one.

[writing in progress]

@Try
Copy link
Owner

Try commented Jun 27, 2022

@CoffeeParser

That sounds like the acquisition and preparation of the command buffers + frustum culling is inefficient.

Writing command to vulkan/dx command buffer is a bottleneck here - mostly it's about vkCmdBind* and vkCmdDraw*.

Are you sorting opaque geometry by distance to camera?

No, that not promising at all - almost all hardware now has a tile-base rendering, and writes-out result only at the end of renderpass. Saving fragment shader invocations is not interesting - shading is trivial anyway.

view frustum culling + hiz culling

There is a frustum culling on CPU + HiZ on GPU for objects + frustum for landscape meshlets.
On GPU side it's never both - HiZ is fast enough and capable of early-out, on screen-space-bbox is empty.

sorting of gpu commands before submit.

  1. this is no possible in api (neither in Vulkan or DX or Metal), you can only issue command in order, if you want to
  2. even sorting 10k every frame is way too much work
  3. this assumes rebuilding command buffer every frame from scratch - something that we want to avoid in future

On low fillrate cards it's anyway better to limit draw distance + fog and disable shadows or use simple circle shadows for npc

Again it's super-low profile. Currently on intel game runs at 30-40 fps (assuming no SSAO). I assume it's possible to reach 60 only by massaging index buffers.
And on fog side - fog calculation is quite a complex to begin with and unfortunately is not fully implemented yet. And it's doesn't hide anything, since game world is too small.

Optimal solution would be to provide settings for this optimizations so that users can setup according to their machine.

Yep, at this moment only SSAO checkbox is there, but not much else.

@CoffeeParser
Copy link

Thanks for going into great detail on the current implementation. 10k is already much better than the previous 50k+ drawcalls. Please don't remove building your commandbuffer every frame. I doubt that sorting 10k longs is slow. While shading is trivial it can add up even on tiled hardware and sorting is worth it to allow the depth test to skip fragment shader calls and worth comparing frametimes. The real problem is coming up with a nice way to generate the command key. You can find a thorough explanation here: https://blog.molecular-matters.com/2014/11/06/stateless-layered-multi-threaded-rendering-part-1/

@Try
Copy link
Owner

Try commented Aug 8, 2022

Some good(almost) new on performance.

Now engine support mesh-shader emulation for any Vulkan 1.1 hardware. This will enable algorithmic-level solutions to culling, to be used not only on RTX, but practically everywhere.

On current iteration emulator works correct, but uses ~128MB of scratch memory. Also it still requires compiler-level optimizations.

More on that in: Try/Tempest#38 and Try/Tempest#33

@Dehela
Copy link

Dehela commented Sep 18, 2022

A bit of necro-posting.

Until now, I didn't bother to try to make changes in Extended Configuration because the game is running somewhat acceptable for me (~22 fps). I just tried to go through settings there and realized how heavily Cloud Shadows affects game performance, giving roughly 2x boost (~46 fps) if that setting is off.

Is it really about shadows casted by clouds? From what I can see, visually it's more like extra layer of shadowing but to me it doesn't look being caused by clouds (shadows seem to be static). I wonder if it can be improved. Of course, turning it off in game settings for low-spec devices is also an option, it doesn't improve (?) things enough to justify cutting performance in half.

@Try
Copy link
Owner

Try commented Sep 18, 2022

@Dehela
This is SSAO #224

Since menu is driven by script, engine has to cannibalize existing settings.
Also, yes SSAO is very expensive effect, specially on low end gpu

@simonsample
Copy link

Just wanted to "+1" this issue.

On my setup (rather up-to-date laptop, i7-6600U, integrated graphic card) I get ~4 fps which makes the game absolutely unplayable. (All settings set to minimum/off, but that did not really make a difference)
FYI, the original Gothic works super smooth on my setup.
Is there any chance that you can/will improve performance that much on older devices that is is playable? Maybe with lower resolution / graphic settings?

Anyway, this is a really exciting and awesome project, thank you for your work!

@Try
Copy link
Owner

Try commented Sep 20, 2022

Hi @simonsample !

I've spent most of june/july on optimizing game for Intel-UHD (integrated gpu on my laptop, GEN11). In short - this is very bad gpu by design :(
Intel build hardware that has low compute capability, and compensate for it by cheating draw-calls with binning and HiZB. And also no programmable culling support (except, maybe, for VK_EXT_conditional_rendering).
With that setup:
Brute-force doesn't work (4fps on your setup; ~30fps on my)
Good stuff, like mesh-shading is not supported (buy they promise to add them).
GPU-Compute - tested, shows with ~25fps on my machine, limited by memory throughput. Also crashes from time-to-time with out-of-memory.
Basically they attempting to brute-force stuff, with no hardware-power to implement so.

On my side - now take a break from optimizations, if memory-speed issue can be solve in here, than it can just work.

@Try
Copy link
Owner

Try commented Sep 20, 2022

While on the topic:

It would be nice to have good algorithm, similar to persistent-culling, that can work with alpha-tested geometry. Without bindless support... (bindless is semi-broken in vulkan). And without 64bit atomics ;)

@Try
Copy link
Owner

Try commented Jan 11, 2024

For general graphics-optimization stuff we have now: #568

@Try Try closed this as completed Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants