Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zzre: Allocation-less colliders #371

Merged
merged 50 commits into from
Oct 20, 2024
Merged

zzre: Allocation-less colliders #371

merged 50 commits into from
Oct 20, 2024

Conversation

Helco
Copy link
Owner

@Helco Helco commented Aug 20, 2024

As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.

For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.

For testing we can use the TestRaycaster but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.

  • TreeCollider
  • WorldCollider
  • HumanPhysics
  • LensFlare
  • FairyHoverBehind
  • FairyPhysics
  • FindActorFloorCollisions
  • WorldViewer
  • (search for additional collider usages)

Review todos:

  • PrefixSums instead of SumSums
  • Remove baseline assembly and benchmark project
  • Tests for StackOverSpan
  • Impl and tests for ListOverSpan, maybe PooledList (TemporaryList? FixedTemporaryList? Inline-array?)
  • Use same collider for geometry instances Some other time.
  • Mass test executable to check world loading and collisions (also similar mass tests) (in different branch, will be rebased and merged after this PR)
  • Make IEnumerable Intersections harder to call
  • Replace line intersections with raycasts whereever possible
  • Add proper link in Triangle.ClosestPoint
  • Last benchmark results against baseline
  • Compare GC profile against baseline

Final benchmark results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
Intersections
IntersectionsBaseline 10.731 ms 0.0162 ms 0.0826 ms 10.711 ms 1.00 640.6250 6827633 B 1.000
IntersectionsList 4.273 ms 0.0094 ms 0.0486 ms 4.281 ms 0.40 - 8 B 0.000
Raycasts
Baseline 33.15 ms 0.014 ms 0.070 ms 1.00 62.5000 814483 B 1.000
Merged 13.07 ms 0.004 ms 0.020 ms 0.39 - 7 B 0.000

GC Profile comparison

The GC profiler shows that TreeCollider was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.

image

@Helco
Copy link
Owner Author

Helco commented Aug 25, 2024

First benchmark results BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun IterationCount=100 LaunchCount=3
MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsGenerator 9.125 ms 0.0115 ms 0.0588 ms 9.121 ms 1.00 140.6250 1475.41 KB 1.00
IntersectionsList 8.780 ms 0.0081 ms 0.0409 ms 8.773 ms 0.96 93.7500 989.76 KB 0.67
IntersectionsStruct 8.306 ms 0.0063 ms 0.0319 ms 8.303 ms 0.91 62.5000 774.8 KB 0.53
IntersectionsTaggedUnion 8.498 ms 0.0122 ms 0.0633 ms 8.486 ms 0.93 62.5000 774.8 KB 0.53

Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers

Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method.

@Helco
Copy link
Owner Author

Helco commented Aug 26, 2024

Results with amortized split stacks

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3  
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15  

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 8.961 ms 0.0116 ms 0.0598 ms 8.951 ms 1.00 62.5000 782732 B 1.000
IntersectionsGenerator 8.969 ms 0.0082 ms 0.0427 ms 8.970 ms 1.00 62.5000 686644 B 0.877
IntersectionsList 9.131 ms 0.0085 ms 0.0442 ms 9.129 ms 1.02 15.6250 277140 B 0.354
IntersectionsStruct 8.591 ms 0.0080 ms 0.0414 ms 8.590 ms 0.96 - 12 B 0.000
IntersectionsTaggedUnion 8.505 ms 0.0113 ms 0.0587 ms 8.496 ms 0.95 - 12 B 0.000

Still a bit curious why IntersectionsList is both slower (with supposedly less branching) and allocates per intersection. The struct enumerator have an allocation, but that might be by the benchmark and not by the intersection query.
(Also baseline is not correct as I forgot to revert the amortization on the atomic layer)

@Helco
Copy link
Owner Author

Helco commented Aug 26, 2024

With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization).

And still are spending 20% less runtime.

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3  
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15  

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.642 ms 0.0141 ms 0.0731 ms 10.624 ms 1.00 640.6250 6830460 B 1.000
IntersectionsGenerator 10.532 ms 0.0092 ms 0.0476 ms 10.531 ms 0.99 562.5000 5991292 B 0.877
IntersectionsList 8.342 ms 0.0072 ms 0.0374 ms 8.343 ms 0.78 - 12 B 0.000
IntersectionsStruct 8.585 ms 0.0071 ms 0.0366 ms 8.585 ms 0.81 - 12 B 0.000
IntersectionsTaggedUnion 8.384 ms 0.0112 ms 0.0583 ms 8.372 ms 0.79 - 12 B 0.000

@Helco
Copy link
Owner Author

Helco commented Aug 29, 2024

Now we fix the baseline as separate assembly, because I want to tackle some more shared code within zzre.core
Starting with plastering most of the math functions with AggressiveInlining | AggressiveOptimize after observing that the JITted assembly is abysmal for hot-loop methods.
Then we can see that Triangle.ClosestPoint(Vector3) responsible for all end-stage math in most intersection queries (which are using Sphere as primitive) uses a non-optimal implementation and replace that entirely. The new implementation apparently has some other behavior (probably in extreme or special cases) but gameplay seems to still work and

the benchmark results warrant taking that risk

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.713 ms 0.0379 ms 0.1897 ms 10.648 ms 1.00 0.02 640.6250 6827628 B 1.000
IntersectionsGenerator 7.621 ms 0.0401 ms 0.2065 ms 7.672 ms 0.71 0.02 570.3125 5988646 B 0.877
IntersectionsBaselineList 8.343 ms 0.0142 ms 0.0740 ms 8.320 ms 0.78 0.02 - 12 B 0.000
IntersectionsList 5.170 ms 0.0043 ms 0.0224 ms 5.169 ms 0.48 0.01 - 6 B 0.000
IntersectionsStruct 5.474 ms 0.0190 ms 0.0963 ms 5.427 ms 0.51 0.01 - 6 B 0.000
IntersectionsBaselineTaggedUnion 8.443 ms 0.0089 ms 0.0461 ms 8.440 ms 0.79 0.01 - 12 B 0.000
IntersectionsTaggedUnion 5.098 ms 0.0052 ms 0.0269 ms 5.095 ms 0.48 0.01 - 6 B 0.000

@Helco
Copy link
Owner Author

Helco commented Aug 29, 2024

And now with the KD optimization

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.543 ms 0.0182 ms 0.0945 ms 10.532 ms 1.00 640.6250 6827628 B 1.000
IntersectionsGenerator 7.447 ms 0.0112 ms 0.0582 ms 7.450 ms 0.71 570.3125 5988646 B 0.877
IntersectionsBaselineList 8.323 ms 0.0051 ms 0.0266 ms 8.324 ms 0.79 - 12 B 0.000
IntersectionsList 5.220 ms 0.0083 ms 0.0432 ms 5.233 ms 0.50 - 6 B 0.000
IntersectionsListKD 3.915 ms 0.0030 ms 0.0156 ms 3.914 ms 0.37 - 6 B 0.000
IntersectionsStruct 5.407 ms 0.0051 ms 0.0265 ms 5.401 ms 0.51 - 6 B 0.000
IntersectionsBaselineTaggedUnion 8.474 ms 0.0090 ms 0.0467 ms 8.464 ms 0.80 - 12 B 0.000
IntersectionsTaggedUnion 5.129 ms 0.0050 ms 0.0258 ms 5.126 ms 0.49 - 6 B 0.000

@Helco
Copy link
Owner Author

Helco commented Sep 2, 2024

Now we merge the two levels of kd-trees into a single structure, which brings just a bit of performance (getting us to exactly 3x faster) but should also simplify some API stuff, so maybe looking into the struct enumerator might be worthwhile again.

Not much but it is there

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.519 ms 0.0143 ms 0.0728 ms 10.506 ms 1.00 640.6250 6827628 B 1.000
IntersectionsGenerator 7.438 ms 0.0068 ms 0.0354 ms 7.433 ms 0.71 570.3125 5988646 B 0.877
IntersectionsBaselineList 8.365 ms 0.0200 ms 0.1042 ms 8.338 ms 0.80 - 12 B 0.000
IntersectionsList 5.229 ms 0.0052 ms 0.0267 ms 5.231 ms 0.50 - 6 B 0.000
IntersectionsListKD 3.887 ms 0.0049 ms 0.0256 ms 3.884 ms 0.37 - 6 B 0.000
IntersectionsListKDMerged 3.466 ms 0.0033 ms 0.0170 ms 3.464 ms 0.33 - 3 B 0.000
IntersectionsStruct 5.421 ms 0.0054 ms 0.0283 ms 5.422 ms 0.52 - 6 B 0.000
IntersectionsBaselineTaggedUnion 8.458 ms 0.0083 ms 0.0434 ms 8.454 ms 0.80 - 12 B 0.000
IntersectionsTaggedUnion 5.266 ms 0.0081 ms 0.0416 ms 5.263 ms 0.50 - 6 B 0.000

Also I should probably clean up a bit, both the math optimization as well as the KD optimization have proven themselves and we do not longer need them run them every time.
Meaning: every test except baseline ones will get KD, just the suffix is not kept

@Helco
Copy link
Owner Author

Helco commented Sep 8, 2024

All benchmarks should have KD optimization. also I checked the differences which seem to be just between Baseline and Current due to the triangle-sphere intersection. These differences seem to point to erroneous behaviour of the old one. So I will let that slide.

And the benchmark results

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
IntersectionsBaseline 11.115 ms 0.0337 ms 0.1743 ms 1.00 0.02 640.6250 6827628 B 1.000
IntersectionsList 4.086 ms 0.0055 ms 0.0282 ms 0.37 0.01 - 6 B 0.000
IntersectionsListKDMerged 3.609 ms 0.0067 ms 0.0347 ms 0.32 0.01 - 3 B 0.000
IntersectionsStruct 4.108 ms 0.0051 ms 0.0261 ms 0.37 0.01 - - 0.000
IntersectionsTaggedUnion 4.890 ms 0.0065 ms 0.0335 ms 0.44 0.01 - 6 B 0.000

@Helco
Copy link
Owner Author

Helco commented Sep 8, 2024

While writing the MergedCollider I asked myself whether the memory layout of the full split array would affect performance.

So enjoy this one-off benchmark

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 11.184 ms 0.0465 ms 0.0681 ms 11.180 ms 1.00 640.6250 6827623 B 1.000
IntersectionsList 4.120 ms 0.0352 ms 0.0527 ms 4.115 ms 0.37 - 6 B 0.000
IntersectionsListKDMergedDF1 3.609 ms 0.0387 ms 0.0580 ms 3.605 ms 0.32 - 3 B 0.000
IntersectionsListKDMergedDF2 3.666 ms 0.0300 ms 0.0449 ms 3.665 ms 0.33 - 2 B 0.000
IntersectionsListKDMergedBF 3.632 ms 0.0382 ms 0.0560 ms 3.602 ms 0.32 - - 0.000
IntersectionsStruct 4.144 ms 0.0309 ms 0.0453 ms 4.131 ms 0.37 - 6 B 0.000
IntersectionsTaggedUnion 5.082 ms 0.0761 ms 0.1091 ms 5.082 ms 0.45 - 6 B 0.000

The answer: Not really, any difference here is pretty near the threshold of error... So let's go with the simplest one.

@Helco
Copy link
Owner Author

Helco commented Sep 12, 2024

Finally the SIMD (two-split) benchmarks are in with three-split being scribbled up.

Let's see how it turned out

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.593 ms 0.0500 ms 0.0717 ms 1.00 640.6250 6827628 B 1.000
IntersectionsList 4.010 ms 0.0196 ms 0.0293 ms 0.38 - 6 B 0.000
IntersectionsListKDMerged 3.563 ms 0.0170 ms 0.0249 ms 0.34 - 3 B 0.000
IntersectionsListKDMergedInty 3.579 ms 0.0054 ms 0.0081 ms 0.34 - 3 B 0.000
IntersectionsStruct 4.058 ms 0.0102 ms 0.0152 ms 0.38 - 6 B 0.000
IntersectionsTaggedUnion 4.906 ms 0.0348 ms 0.0488 ms 0.46 - 6 B 0.000
IntersectionsSIMD128MoreBranches 3.472 ms 0.0083 ms 0.0124 ms 0.33 - 3 B 0.000
IntersectionsSIMD128 3.805 ms 0.0061 ms 0.0090 ms 0.36 - 3 B 0.000
IntersectionsSIMD256 3.772 ms 0.0045 ms 0.0067 ms 0.36 - 3 B 0.000

oh well, this is surprisingly bad :) I probably still want to try the three-split one just for good measure, but we can already see that the branch reduction is not helpful and if it does help performance, it is a miniscule benefit.

@Helco
Copy link
Owner Author

Helco commented Sep 12, 2024

And here are the results for the SIMD512 three-split collider. Because we have more loops iterations I also readded the less branching variant for the new benchmark.

I usually look at the results only after posting them here, so I cannot tell what hides under here...

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
IntersectionsBaseline 10.551 ms 0.0273 ms 0.0392 ms 1.00 640.6250 6827628 B 1.000
IntersectionsList 4.012 ms 0.0233 ms 0.0349 ms 0.38 - 6 B 0.000
IntersectionsListKDMerged 3.581 ms 0.0136 ms 0.0199 ms 0.34 - 3 B 0.000
IntersectionsStruct 4.056 ms 0.0121 ms 0.0182 ms 0.38 - 6 B 0.000
IntersectionsTaggedUnion 4.795 ms 0.0142 ms 0.0213 ms 0.45 - 6 B 0.000
IntersectionsSIMD128 3.462 ms 0.0080 ms 0.0117 ms 0.33 - 3 B 0.000
IntersectionsSIMD256 3.555 ms 0.0157 ms 0.0235 ms 0.34 - 3 B 0.000
IntersectionsSIMD512 3.628 ms 0.0098 ms 0.0144 ms 0.34 - 3 B 0.000
IntersectionsSIMD512LB 3.850 ms 0.0115 ms 0.0169 ms 0.36 - 3 B 0.000

The answer: not very much, we have again a minimal performance benefit of the SIMD128 two-split collider but anything higher performs worse and is naturally more complex.
At this point I might scrap SIMD altogether for this usecase unless I have another idea for this. If I get crazy I might try attaching Intel VTune for example and look whether the SIMD ones have some solvable problem.

Just as a text note without further benchmark results: I tested a SOA variant of the SIMD128 with no discernable difference in performance. VTune showed a major bottleneck to be branch mispredictions, especially in the leaf Triangle-Sphere intersection test, which I guess is to be expected (if we reasonably knew the outcome we would not have to ask this very question), so I can see no obvious fault in the algorithm in the microarchitecture level.

@Helco
Copy link
Owner Author

Helco commented Sep 14, 2024

I am almost at the end of the Intersections method, with the winner being the MergedCollider and some of the more simpler variant like List, Struct or TaggedUnion. I had yet to benchmark the latter two in the merged collider,

so here are the numbers for that

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
IntersectionsBaseline 11.121 ms 0.0906 ms 0.1356 ms 1.00 0.02 640.6250 6827628 B 1.000
IntersectionsListKDMerged 3.730 ms 0.0356 ms 0.0522 ms 0.34 0.01 - 3 B 0.000
IntersectionsStructMerged 3.747 ms 0.0094 ms 0.0140 ms 0.34 0.00 - 3 B 0.000
IntersectionsTaggedUnionMerged 4.044 ms 0.0176 ms 0.0258 ms 0.36 0.00 - 6 B 0.000

These numbers again show: simpler is better, so I will leave it at that. We can still cheat for one actual usecase in the game, where a line intersection is equivalent to a raycast. But for the other usecases (especially physics) we would need to incorporate more usecase-specific operations in order to allow for optimizations (e.g. filter by a product in order to reduce to a single-nearest-neighbor search). I am currently not inclined to do that.

I still would like to roughly benchmark raycast, making sure that it does not allocate and maybe try out a couple variants. After that I will wrap up this PR by putting the experiments into a backup branch and applying the winner variants to the productive game.
Probably nice to then have a comparison of GC behavior, but that will have to wait a bit yet again.

@Helco
Copy link
Owner Author

Helco commented Sep 15, 2024

Initial benchmark for raycasts, the results are worrying. A lot of allocations (which could be easily amortized though) and troublesome performance. I should add a benchmark with a sorted line intersection and also definitely figure out why the merged collider is so much worse than the other two. This is surprising.


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
Baseline 32.58 ms 0.111 ms 0.159 ms 1.00 62.5000 795.37 KB 1.00
SimpleOptimizations 27.32 ms 0.053 ms 0.079 ms 0.84 62.5000 795.35 KB 1.00
Merged 47.89 ms 0.102 ms 0.153 ms 1.47 - 148.5 KB 0.19

@Helco
Copy link
Owner Author

Helco commented Sep 17, 2024

Let's get unsurprised here with an easy one first:

Line intersections are not faster than ray casts

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline 33.06 ms 0.052 ms 0.078 ms 1.00 0.00 62.5000 814462 B 1.000
LineIntersectionsWorld 368.22 ms 4.566 ms 6.835 ms 11.14 0.21 - 736 B 0.001
LineIntersectionsMerged 231.64 ms 1.979 ms 2.900 ms 7.01 0.09 - 245 B 0.000
SimpleOptimizations 27.52 ms 0.072 ms 0.106 ms 0.83 0.00 62.5000 814439 B 1.000
Merged 47.58 ms 0.119 ms 0.178 ms 1.44 0.01 - 152067 B 0.187

This can be attributed to intersection queries having to always return all intersections, while raycasts can exit out as soon as there cannot be a closer hit.

@Helco
Copy link
Owner Author

Helco commented Sep 17, 2024

A one-off benchmark before I have to use a profiler again: At some point I added a SSE 4.1 version of Triangle.Barycentric but never benchmarked it (FOR SHAME!), so here is a benchmark with scalar, explicit sse 4.1 and SIMD128 versions:

I should have benchmarked it earlier

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline 33.18 ms 0.020 ms 0.101 ms 33.17 ms 1.00 0.00 62.5000 795.37 KB 1.00
SimpleOptimizations 27.65 ms 0.016 ms 0.085 ms 27.65 ms 0.83 0.00 62.5000 795.35 KB 1.00
MergedScalar 44.45 ms 0.043 ms 0.221 ms 44.40 ms 1.34 0.01 - 148.5 KB 0.19
MergedSse41 48.13 ms 0.066 ms 0.339 ms 48.23 ms 1.45 0.01 - 148.5 KB 0.19
MergedSIMD128 46.98 ms 0.362 ms 1.872 ms 48.20 ms 1.42 0.06 - 148.5 KB 0.19

We also have multi-modal distributions, vastly different results in MediumRun benchmarks, so summaries I would say: No use for either implementation, just scalar should be fine.

EDIT: Another benchmark not worthy of uploading is trying to just disable the degeneration test. We can do that during merging and safe the test for the raycasts but that is a tiny improvement over the current state. Profiler comparison it is.

@Helco
Copy link
Owner Author

Helco commented Sep 17, 2024

Also not uploading: adding MIOptions makes casting almost twice as slow. Just adding AggressiveOptimization (without inlining) is better

and here are the results for that

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline 33.67 ms 0.478 ms 0.671 ms 33.17 ms 1.00 0.03 62.5000 795.37 KB 1.00
SimpleOptimizations 23.11 ms 0.051 ms 0.075 ms 23.09 ms 0.69 0.01 62.5000 795.35 KB 1.00
Merged 39.55 ms 0.148 ms 0.221 ms 39.49 ms 1.18 0.02 - 148.49 KB 0.19

But merged it is still slower than baseline. The profiler did unfortunately tell me much so a bit of guesswork it is: I am working on an iterative version.

@Helco
Copy link
Owner Author

Helco commented Sep 20, 2024

Oh well. The first iterative raycast and... at least it is parity in performance with baseline and in allocation with merged?

Overall still bad

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
Baseline 33.15 ms 0.050 ms 0.074 ms 1.00 62.5000 795.37 KB 1.00
SimpleOptimizations 23.19 ms 0.196 ms 0.287 ms 0.70 62.5000 795.35 KB 1.00
Merged 39.49 ms 0.147 ms 0.216 ms 1.19 - 148.47 KB 0.19
MergedIterative 32.05 ms 0.056 ms 0.080 ms 0.97 - 148.48 KB 0.19

The allocations are due to the coarse check, in particular casting against a box allocates at the moment. I would still like to see better numbers for the casting itself.

@Helco
Copy link
Owner Author

Helco commented Sep 22, 2024

Now we are getting somewhere. A new iterative variant using additional subtree-elimination reaches allllllmost the WorldCollider+Recursive Cast. That it is not faster is still beyond me.

The results

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline 33.82 ms 0.457 ms 0.670 ms 34.32 ms 1.00 0.03 66.6667 814465 B 1.000
SimpleOptimizations 23.03 ms 0.043 ms 0.063 ms 23.03 ms 0.68 0.01 62.5000 814439 B 1.000
Merged 39.42 ms 0.088 ms 0.126 ms 39.41 ms 1.17 0.02 - 152057 B 0.187
MergedIterative 31.72 ms 0.073 ms 0.109 ms 31.70 ms 0.94 0.02 - 46 B 0.000
MergedRW 25.25 ms 0.033 ms 0.048 ms 25.25 ms 0.75 0.01 - 23 B 0.000

@Helco
Copy link
Owner Author

Helco commented Sep 22, 2024

I omitted the break-even and just continued. In instrumentation profiles I saw Stack<T> operations having an unusual high percentage in runtime so to test I replaced it with a StackOverSpan<T> variant that uses ArrayPool<...>.Shared as backing memory.

It worked

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Baseline 33.44 ms 0.104 ms 0.538 ms 33.18 ms 1.00 0.02 62.5000 814462 B 1.000
SimpleOptimizations 22.98 ms 0.031 ms 0.159 ms 22.94 ms 0.69 0.01 62.5000 814439 B 1.000
Merged 39.02 ms 0.043 ms 0.222 ms 39.00 ms 1.17 0.02 - 152057 B 0.187
MergedIterative 31.64 ms 0.132 ms 0.690 ms 31.43 ms 0.95 0.03 - 46 B 0.000
MergedRWBR 22.50 ms 0.023 ms 0.120 ms 22.49 ms 0.67 0.01 - 23 B 0.000
MergedRWBRSS 20.46 ms 0.016 ms 0.084 ms 20.46 ms 0.61 0.01 - 34 B 0.000

I highly suspect we can also push this further. In the same profiles there were also Nullable unusually high and we can expect some additional performance by replacing the pretty ad-hoc Ray-Triangle intersection by a more standardized one (like Möller and Trumbore)

@Helco
Copy link
Owner Author

Helco commented Sep 23, 2024

I am probably going to abandon the recursive merged as well as the naive iterative versions so I cleaned up the list of benchmarks a bit. Also instead of appending ever more acronyms to RW I am going to just compare the previous benchmarks results with the current changes (and baseline/simple opt).

The current changes prepare for the alternative Ray-Triangle intersection and also remove the usage of nullables. By using a NaN invariant we can also omit the comparison for misses entirely. At some point I might want to even move the intersection into the TreeCollider to have access to precomputed data without uglifying the Ray interface.

Another millisecond gone

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
Baseline 32.93 ms 0.098 ms 0.143 ms 1.00 62.5000 814462 B 1.000
SimpleOptimizations 23.01 ms 0.045 ms 0.067 ms 0.70 62.5000 814439 B 1.000
MergedRWPrevious 20.69 ms 0.106 ms 0.158 ms 0.63 - 34 B 0.000
MergedRWNext 19.69 ms 0.088 ms 0.132 ms 0.60 - 34 B 0.000

@Helco
Copy link
Owner Author

Helco commented Sep 23, 2024

Möller-Trumbore came through, we are well within 2x, even though I had to add a naive check to cull backfacing triangles from the test. The old intersection method did that without my explicit knowledge. Oh well...

The deeds

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
Baseline 33.03 ms 0.017 ms 0.088 ms 33.04 ms 1.00 62.5000 814462 B 1.000
SimpleOptimizations 23.10 ms 0.020 ms 0.106 ms 23.08 ms 0.70 62.5000 814439 B 1.000
MergedRWPrevious 19.61 ms 0.010 ms 0.051 ms 19.61 ms 0.59 - 34 B 0.000
MergedRWNext 14.97 ms 0.011 ms 0.057 ms 14.97 ms 0.45 - 17 B 0.000

@Helco
Copy link
Owner Author

Helco commented Sep 23, 2024

As I suspected there was an intermediate in Möller-Trumbore that we can use to cull back-faces (it's just the sign of the determinant). Also I finally removed degenerated triangles from the merged tree, removed the dummy splits of naive section collisions and reordered the triangles to remove the map indirection.

This might be the end for optimizations

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
Baseline 33.13 ms 0.020 ms 0.103 ms 33.12 ms 1.00 62.5000 814462 B 1.000
SimpleOptimizations 23.13 ms 0.014 ms 0.070 ms 23.12 ms 0.70 62.5000 814439 B 1.000
MergedRWPrevious 14.85 ms 0.015 ms 0.076 ms 14.88 ms 0.45 - 17 B 0.000
MergedRWNext 13.25 ms 0.017 ms 0.087 ms 13.22 ms 0.40 - 17 B 0.000

If I want to continue, I guess more precomputation and an even faster intersection algorithm might be the way to go. But that is not one I want to go, I already use more memory for the merged trees.

@Helco Helco marked this pull request as ready for review October 19, 2024 15:36
@Helco Helco merged commit 4a36bd8 into master Oct 20, 2024
1 check passed
@Helco Helco deleted the allocation-less-collisions branch October 20, 2024 09:13
Helco added a commit that referenced this pull request Oct 20, 2024
@Helco Helco restored the allocation-less-collisions branch October 20, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant