zzre: Allocation-less colliders #371

Helco · 2024-08-20T11:52:44Z

As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.

For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.

For testing we can use the TestRaycaster but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.

Review todos:

Final benchmark results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
Intersections
IntersectionsBaseline	10.731 ms	0.0162 ms	0.0826 ms	10.711 ms	1.00	640.6250	6827633 B	1.000
IntersectionsList	4.273 ms	0.0094 ms	0.0486 ms	4.281 ms	0.40	-	8 B	0.000
Raycasts
Baseline	33.15 ms	0.014 ms	0.070 ms		1.00	62.5000	814483 B	1.000
Merged	13.07 ms	0.004 ms	0.020 ms		0.39	-	7 B	0.000

GC Profile comparison

The GC profiler shows that TreeCollider was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.

Helco · 2024-08-25T21:28:06Z

First benchmark results

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun IterationCount=100 LaunchCount=3
MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsGenerator	9.125 ms	0.0115 ms	0.0588 ms	9.121 ms	1.00	140.6250	1475.41 KB	1.00
IntersectionsList	8.780 ms	0.0081 ms	0.0409 ms	8.773 ms	0.96	93.7500	989.76 KB	0.67
IntersectionsStruct	8.306 ms	0.0063 ms	0.0319 ms	8.303 ms	0.91	62.5000	774.8 KB	0.53
IntersectionsTaggedUnion	8.498 ms	0.0122 ms	0.0633 ms	8.486 ms	0.93	62.5000	774.8 KB	0.53

Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers

Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method.

Helco · 2024-08-26T12:42:52Z

Results with amortized split stacks


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3  
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	8.961 ms	0.0116 ms	0.0598 ms	8.951 ms	1.00	62.5000	782732 B	1.000
IntersectionsGenerator	8.969 ms	0.0082 ms	0.0427 ms	8.970 ms	1.00	62.5000	686644 B	0.877
IntersectionsList	9.131 ms	0.0085 ms	0.0442 ms	9.129 ms	1.02	15.6250	277140 B	0.354
IntersectionsStruct	8.591 ms	0.0080 ms	0.0414 ms	8.590 ms	0.96	-	12 B	0.000
IntersectionsTaggedUnion	8.505 ms	0.0113 ms	0.0587 ms	8.496 ms	0.95	-	12 B	0.000

Still a bit curious why IntersectionsList is both slower (with supposedly less branching) and allocates per intersection. The struct enumerator have an allocation, but that might be by the benchmark and not by the intersection query.
(Also baseline is not correct as I forgot to revert the amortization on the atomic layer)

Helco · 2024-08-26T18:12:29Z

With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization).

And still are spending 20% less runtime.


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3  
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.642 ms	0.0141 ms	0.0731 ms	10.624 ms	1.00	640.6250	6830460 B	1.000
IntersectionsGenerator	10.532 ms	0.0092 ms	0.0476 ms	10.531 ms	0.99	562.5000	5991292 B	0.877
IntersectionsList	8.342 ms	0.0072 ms	0.0374 ms	8.343 ms	0.78	-	12 B	0.000
IntersectionsStruct	8.585 ms	0.0071 ms	0.0366 ms	8.585 ms	0.81	-	12 B	0.000
IntersectionsTaggedUnion	8.384 ms	0.0112 ms	0.0583 ms	8.372 ms	0.79	-	12 B	0.000

Helco · 2024-08-29T11:36:44Z

Now we fix the baseline as separate assembly, because I want to tackle some more shared code within zzre.core
Starting with plastering most of the math functions with AggressiveInlining | AggressiveOptimize after observing that the JITted assembly is abysmal for hot-loop methods.
Then we can see that Triangle.ClosestPoint(Vector3) responsible for all end-stage math in most intersection queries (which are using Sphere as primitive) uses a non-optimal implementation and replace that entirely. The new implementation apparently has some other behavior (probably in extreme or special cases) but gameplay seems to still work and

the benchmark results warrant taking that risk


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.713 ms	0.0379 ms	0.1897 ms	10.648 ms	1.00	0.02	640.6250	6827628 B	1.000
IntersectionsGenerator	7.621 ms	0.0401 ms	0.2065 ms	7.672 ms	0.71	0.02	570.3125	5988646 B	0.877
IntersectionsBaselineList	8.343 ms	0.0142 ms	0.0740 ms	8.320 ms	0.78	0.02	-	12 B	0.000
IntersectionsList	5.170 ms	0.0043 ms	0.0224 ms	5.169 ms	0.48	0.01	-	6 B	0.000
IntersectionsStruct	5.474 ms	0.0190 ms	0.0963 ms	5.427 ms	0.51	0.01	-	6 B	0.000
IntersectionsBaselineTaggedUnion	8.443 ms	0.0089 ms	0.0461 ms	8.440 ms	0.79	0.01	-	12 B	0.000
IntersectionsTaggedUnion	5.098 ms	0.0052 ms	0.0269 ms	5.095 ms	0.48	0.01	-	6 B	0.000

Helco · 2024-08-29T13:29:27Z

And now with the KD optimization


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.543 ms	0.0182 ms	0.0945 ms	10.532 ms	1.00	640.6250	6827628 B	1.000
IntersectionsGenerator	7.447 ms	0.0112 ms	0.0582 ms	7.450 ms	0.71	570.3125	5988646 B	0.877
IntersectionsBaselineList	8.323 ms	0.0051 ms	0.0266 ms	8.324 ms	0.79	-	12 B	0.000
IntersectionsList	5.220 ms	0.0083 ms	0.0432 ms	5.233 ms	0.50	-	6 B	0.000
IntersectionsListKD	3.915 ms	0.0030 ms	0.0156 ms	3.914 ms	0.37	-	6 B	0.000
IntersectionsStruct	5.407 ms	0.0051 ms	0.0265 ms	5.401 ms	0.51	-	6 B	0.000
IntersectionsBaselineTaggedUnion	8.474 ms	0.0090 ms	0.0467 ms	8.464 ms	0.80	-	12 B	0.000
IntersectionsTaggedUnion	5.129 ms	0.0050 ms	0.0258 ms	5.126 ms	0.49	-	6 B	0.000

Helco · 2024-09-02T12:11:15Z

Now we merge the two levels of kd-trees into a single structure, which brings just a bit of performance (getting us to exactly 3x faster) but should also simplify some API stuff, so maybe looking into the struct enumerator might be worthwhile again.

Not much but it is there


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.519 ms	0.0143 ms	0.0728 ms	10.506 ms	1.00	640.6250	6827628 B	1.000
IntersectionsGenerator	7.438 ms	0.0068 ms	0.0354 ms	7.433 ms	0.71	570.3125	5988646 B	0.877
IntersectionsBaselineList	8.365 ms	0.0200 ms	0.1042 ms	8.338 ms	0.80	-	12 B	0.000
IntersectionsList	5.229 ms	0.0052 ms	0.0267 ms	5.231 ms	0.50	-	6 B	0.000
IntersectionsListKD	3.887 ms	0.0049 ms	0.0256 ms	3.884 ms	0.37	-	6 B	0.000
IntersectionsListKDMerged	3.466 ms	0.0033 ms	0.0170 ms	3.464 ms	0.33	-	3 B	0.000
IntersectionsStruct	5.421 ms	0.0054 ms	0.0283 ms	5.422 ms	0.52	-	6 B	0.000
IntersectionsBaselineTaggedUnion	8.458 ms	0.0083 ms	0.0434 ms	8.454 ms	0.80	-	12 B	0.000
IntersectionsTaggedUnion	5.266 ms	0.0081 ms	0.0416 ms	5.263 ms	0.50	-	6 B	0.000

Also I should probably clean up a bit, both the math optimization as well as the KD optimization have proven themselves and we do not longer need them run them every time.
Meaning: every test except baseline ones will get KD, just the suffix is not kept

Helco · 2024-09-08T14:24:41Z

All benchmarks should have KD optimization. also I checked the differences which seem to be just between Baseline and Current due to the triangle-sphere intersection. These differences seem to point to erroneous behaviour of the old one. So I will let that slide.

And the benchmark results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	11.115 ms	0.0337 ms	0.1743 ms	1.00	0.02	640.6250	6827628 B	1.000
IntersectionsList	4.086 ms	0.0055 ms	0.0282 ms	0.37	0.01	-	6 B	0.000
IntersectionsListKDMerged	3.609 ms	0.0067 ms	0.0347 ms	0.32	0.01	-	3 B	0.000
IntersectionsStruct	4.108 ms	0.0051 ms	0.0261 ms	0.37	0.01	-	-	0.000
IntersectionsTaggedUnion	4.890 ms	0.0065 ms	0.0335 ms	0.44	0.01	-	6 B	0.000

Helco · 2024-09-08T17:12:38Z

While writing the MergedCollider I asked myself whether the memory layout of the full split array would affect performance.

So enjoy this one-off benchmark


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	11.184 ms	0.0465 ms	0.0681 ms	11.180 ms	1.00	640.6250	6827623 B	1.000
IntersectionsList	4.120 ms	0.0352 ms	0.0527 ms	4.115 ms	0.37	-	6 B	0.000
IntersectionsListKDMergedDF1	3.609 ms	0.0387 ms	0.0580 ms	3.605 ms	0.32	-	3 B	0.000
IntersectionsListKDMergedDF2	3.666 ms	0.0300 ms	0.0449 ms	3.665 ms	0.33	-	2 B	0.000
IntersectionsListKDMergedBF	3.632 ms	0.0382 ms	0.0560 ms	3.602 ms	0.32	-	-	0.000
IntersectionsStruct	4.144 ms	0.0309 ms	0.0453 ms	4.131 ms	0.37	-	6 B	0.000
IntersectionsTaggedUnion	5.082 ms	0.0761 ms	0.1091 ms	5.082 ms	0.45	-	6 B	0.000

The answer: Not really, any difference here is pretty near the threshold of error... So let's go with the simplest one.

Helco · 2024-09-12T07:33:55Z

Finally the SIMD (two-split) benchmarks are in with three-split being scribbled up.

Let's see how it turned out


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.593 ms	0.0500 ms	0.0717 ms	1.00	640.6250	6827628 B	1.000
IntersectionsList	4.010 ms	0.0196 ms	0.0293 ms	0.38	-	6 B	0.000
IntersectionsListKDMerged	3.563 ms	0.0170 ms	0.0249 ms	0.34	-	3 B	0.000
IntersectionsListKDMergedInty	3.579 ms	0.0054 ms	0.0081 ms	0.34	-	3 B	0.000
IntersectionsStruct	4.058 ms	0.0102 ms	0.0152 ms	0.38	-	6 B	0.000
IntersectionsTaggedUnion	4.906 ms	0.0348 ms	0.0488 ms	0.46	-	6 B	0.000
IntersectionsSIMD128MoreBranches	3.472 ms	0.0083 ms	0.0124 ms	0.33	-	3 B	0.000
IntersectionsSIMD128	3.805 ms	0.0061 ms	0.0090 ms	0.36	-	3 B	0.000
IntersectionsSIMD256	3.772 ms	0.0045 ms	0.0067 ms	0.36	-	3 B	0.000

oh well, this is surprisingly bad :) I probably still want to try the three-split one just for good measure, but we can already see that the branch reduction is not helpful and if it does help performance, it is a miniscule benefit.

Helco · 2024-09-12T13:41:28Z

And here are the results for the SIMD512 three-split collider. Because we have more loops iterations I also readded the less branching variant for the new benchmark.

I usually look at the results only after posting them here, so I cannot tell what hides under here...


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	10.551 ms	0.0273 ms	0.0392 ms	1.00	640.6250	6827628 B	1.000
IntersectionsList	4.012 ms	0.0233 ms	0.0349 ms	0.38	-	6 B	0.000
IntersectionsListKDMerged	3.581 ms	0.0136 ms	0.0199 ms	0.34	-	3 B	0.000
IntersectionsStruct	4.056 ms	0.0121 ms	0.0182 ms	0.38	-	6 B	0.000
IntersectionsTaggedUnion	4.795 ms	0.0142 ms	0.0213 ms	0.45	-	6 B	0.000
IntersectionsSIMD128	3.462 ms	0.0080 ms	0.0117 ms	0.33	-	3 B	0.000
IntersectionsSIMD256	3.555 ms	0.0157 ms	0.0235 ms	0.34	-	3 B	0.000
IntersectionsSIMD512	3.628 ms	0.0098 ms	0.0144 ms	0.34	-	3 B	0.000
IntersectionsSIMD512LB	3.850 ms	0.0115 ms	0.0169 ms	0.36	-	3 B	0.000

The answer: not very much, we have again a minimal performance benefit of the SIMD128 two-split collider but anything higher performs worse and is naturally more complex.
At this point I might scrap SIMD altogether for this usecase unless I have another idea for this. If I get crazy I might try attaching Intel VTune for example and look whether the SIMD ones have some solvable problem.

Just as a text note without further benchmark results: I tested a SOA variant of the SIMD128 with no discernable difference in performance. VTune showed a major bottleneck to be branch mispredictions, especially in the leaf Triangle-Sphere intersection test, which I guess is to be expected (if we reasonably knew the outcome we would not have to ask this very question), so I can see no obvious fault in the algorithm in the microarchitecture level.

Helco · 2024-09-14T15:22:16Z

I am almost at the end of the Intersections method, with the winner being the MergedCollider and some of the more simpler variant like List, Struct or TaggedUnion. I had yet to benchmark the latter two in the merged collider,

so here are the numbers for that


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
IntersectionsBaseline	11.121 ms	0.0906 ms	0.1356 ms	1.00	0.02	640.6250	6827628 B	1.000
IntersectionsListKDMerged	3.730 ms	0.0356 ms	0.0522 ms	0.34	0.01	-	3 B	0.000
IntersectionsStructMerged	3.747 ms	0.0094 ms	0.0140 ms	0.34	0.00	-	3 B	0.000
IntersectionsTaggedUnionMerged	4.044 ms	0.0176 ms	0.0258 ms	0.36	0.00	-	6 B	0.000

These numbers again show: simpler is better, so I will leave it at that. We can still cheat for one actual usecase in the game, where a line intersection is equivalent to a raycast. But for the other usecases (especially physics) we would need to incorporate more usecase-specific operations in order to allow for optimizations (e.g. filter by a product in order to reduce to a single-nearest-neighbor search). I am currently not inclined to do that.

I still would like to roughly benchmark raycast, making sure that it does not allocate and maybe try out a couple variants. After that I will wrap up this PR by putting the experiments into a backup branch and applying the winner variants to the productive game.
Probably nice to then have a comparison of GC behavior, but that will have to wait a bit yet again.

Helco · 2024-09-15T11:02:27Z

Initial benchmark for raycasts, the results are worrying. A lot of allocations (which could be easily amortized though) and troublesome performance. I should add a benchmark with a sorted line intersection and also definitely figure out why the merged collider is so much worse than the other two. This is surprising.


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	Gen0	Allocated	Alloc Ratio
Baseline	32.58 ms	0.111 ms	0.159 ms	1.00	62.5000	795.37 KB	1.00
SimpleOptimizations	27.32 ms	0.053 ms	0.079 ms	0.84	62.5000	795.35 KB	1.00
Merged	47.89 ms	0.102 ms	0.153 ms	1.47	-	148.5 KB	0.19

Helco · 2024-09-17T11:42:24Z

Let's get unsurprised here with an easy one first:

Line intersections are not faster than ray casts


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Baseline	33.06 ms	0.052 ms	0.078 ms	1.00	0.00	62.5000	814462 B	1.000
LineIntersectionsWorld	368.22 ms	4.566 ms	6.835 ms	11.14	0.21	-	736 B	0.001
LineIntersectionsMerged	231.64 ms	1.979 ms	2.900 ms	7.01	0.09	-	245 B	0.000
SimpleOptimizations	27.52 ms	0.072 ms	0.106 ms	0.83	0.00	62.5000	814439 B	1.000
Merged	47.58 ms	0.119 ms	0.178 ms	1.44	0.01	-	152067 B	0.187

This can be attributed to intersection queries having to always return all intersections, while raycasts can exit out as soon as there cannot be a closer hit.

Helco · 2024-09-17T14:55:45Z

A one-off benchmark before I have to use a profiler again: At some point I added a SSE 4.1 version of Triangle.Barycentric but never benchmarked it (FOR SHAME!), so here is a benchmark with scalar, explicit sse 4.1 and SIMD128 versions:

I should have benchmarked it earlier


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Baseline	33.18 ms	0.020 ms	0.101 ms	33.17 ms	1.00	0.00	62.5000	795.37 KB	1.00
SimpleOptimizations	27.65 ms	0.016 ms	0.085 ms	27.65 ms	0.83	0.00	62.5000	795.35 KB	1.00
MergedScalar	44.45 ms	0.043 ms	0.221 ms	44.40 ms	1.34	0.01	-	148.5 KB	0.19
MergedSse41	48.13 ms	0.066 ms	0.339 ms	48.23 ms	1.45	0.01	-	148.5 KB	0.19
MergedSIMD128	46.98 ms	0.362 ms	1.872 ms	48.20 ms	1.42	0.06	-	148.5 KB	0.19

We also have multi-modal distributions, vastly different results in MediumRun benchmarks, so summaries I would say: No use for either implementation, just scalar should be fine.

EDIT: Another benchmark not worthy of uploading is trying to just disable the degeneration test. We can do that during merging and safe the test for the raycasts but that is a tiny improvement over the current state. Profiler comparison it is.

Helco · 2024-09-17T17:57:38Z

Also not uploading: adding MIOptions makes casting almost twice as slow. Just adding AggressiveOptimization (without inlining) is better

and here are the results for that


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Baseline	33.67 ms	0.478 ms	0.671 ms	33.17 ms	1.00	0.03	62.5000	795.37 KB	1.00
SimpleOptimizations	23.11 ms	0.051 ms	0.075 ms	23.09 ms	0.69	0.01	62.5000	795.35 KB	1.00
Merged	39.55 ms	0.148 ms	0.221 ms	39.49 ms	1.18	0.02	-	148.49 KB	0.19

But merged it is still slower than baseline. The profiler did unfortunately tell me much so a bit of guesswork it is: I am working on an iterative version.

Helco · 2024-09-20T11:27:25Z

Oh well. The first iterative raycast and... at least it is parity in performance with baseline and in allocation with merged?

Overall still bad


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	Gen0	Allocated	Alloc Ratio
Baseline	33.15 ms	0.050 ms	0.074 ms	1.00	62.5000	795.37 KB	1.00
SimpleOptimizations	23.19 ms	0.196 ms	0.287 ms	0.70	62.5000	795.35 KB	1.00
Merged	39.49 ms	0.147 ms	0.216 ms	1.19	-	148.47 KB	0.19
MergedIterative	32.05 ms	0.056 ms	0.080 ms	0.97	-	148.48 KB	0.19

The allocations are due to the coarse check, in particular casting against a box allocates at the moment. I would still like to see better numbers for the casting itself.

Helco · 2024-09-22T11:31:00Z

Now we are getting somewhere. A new iterative variant using additional subtree-elimination reaches allllllmost the WorldCollider+Recursive Cast. That it is not faster is still beyond me.

The results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Baseline	33.82 ms	0.457 ms	0.670 ms	34.32 ms	1.00	0.03	66.6667	814465 B	1.000
SimpleOptimizations	23.03 ms	0.043 ms	0.063 ms	23.03 ms	0.68	0.01	62.5000	814439 B	1.000
Merged	39.42 ms	0.088 ms	0.126 ms	39.41 ms	1.17	0.02	-	152057 B	0.187
MergedIterative	31.72 ms	0.073 ms	0.109 ms	31.70 ms	0.94	0.02	-	46 B	0.000
MergedRW	25.25 ms	0.033 ms	0.048 ms	25.25 ms	0.75	0.01	-	23 B	0.000

Helco · 2024-09-22T15:54:49Z

I omitted the break-even and just continued. In instrumentation profiles I saw Stack<T> operations having an unusual high percentage in runtime so to test I replaced it with a StackOverSpan<T> variant that uses ArrayPool<...>.Shared as backing memory.

It worked


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Baseline	33.44 ms	0.104 ms	0.538 ms	33.18 ms	1.00	0.02	62.5000	814462 B	1.000
SimpleOptimizations	22.98 ms	0.031 ms	0.159 ms	22.94 ms	0.69	0.01	62.5000	814439 B	1.000
Merged	39.02 ms	0.043 ms	0.222 ms	39.00 ms	1.17	0.02	-	152057 B	0.187
MergedIterative	31.64 ms	0.132 ms	0.690 ms	31.43 ms	0.95	0.03	-	46 B	0.000
MergedRWBR	22.50 ms	0.023 ms	0.120 ms	22.49 ms	0.67	0.01	-	23 B	0.000
MergedRWBRSS	20.46 ms	0.016 ms	0.084 ms	20.46 ms	0.61	0.01	-	34 B	0.000

I highly suspect we can also push this further. In the same profiles there were also Nullable unusually high and we can expect some additional performance by replacing the pretty ad-hoc Ray-Triangle intersection by a more standardized one (like Möller and Trumbore)

Helco · 2024-09-23T08:51:00Z

I am probably going to abandon the recursive merged as well as the naive iterative versions so I cleaned up the list of benchmarks a bit. Also instead of appending ever more acronyms to RW I am going to just compare the previous benchmarks results with the current changes (and baseline/simple opt).

The current changes prepare for the alternative Ray-Triangle intersection and also remove the usage of nullables. By using a NaN invariant we can also omit the comparison for misses entirely. At some point I might want to even move the intersection into the TreeCollider to have access to precomputed data without uglifying the Ray interface.

Another millisecond gone


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10

Method	Mean	Error	StdDev	Ratio	Gen0	Allocated	Alloc Ratio
Baseline	32.93 ms	0.098 ms	0.143 ms	1.00	62.5000	814462 B	1.000
SimpleOptimizations	23.01 ms	0.045 ms	0.067 ms	0.70	62.5000	814439 B	1.000
MergedRWPrevious	20.69 ms	0.106 ms	0.158 ms	0.63	-	34 B	0.000
MergedRWNext	19.69 ms	0.088 ms	0.132 ms	0.60	-	34 B	0.000

Helco · 2024-09-23T17:33:15Z

Möller-Trumbore came through, we are well within 2x, even though I had to add a naive check to cull backfacing triangles from the test. The old intersection method did that without my explicit knowledge. Oh well...

The deeds


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
Baseline	33.03 ms	0.017 ms	0.088 ms	33.04 ms	1.00	62.5000	814462 B	1.000
SimpleOptimizations	23.10 ms	0.020 ms	0.106 ms	23.08 ms	0.70	62.5000	814439 B	1.000
MergedRWPrevious	19.61 ms	0.010 ms	0.051 ms	19.61 ms	0.59	-	34 B	0.000
MergedRWNext	14.97 ms	0.011 ms	0.057 ms	14.97 ms	0.45	-	17 B	0.000

Helco · 2024-09-23T20:40:44Z

As I suspected there was an intermediate in Möller-Trumbore that we can use to cull back-faces (it's just the sign of the determinant). Also I finally removed degenerated triangles from the merged tree, removed the dummy splits of naive section collisions and reordered the triangles to remove the map indirection.

This might be the end for optimizations


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
Baseline	33.13 ms	0.020 ms	0.103 ms	33.12 ms	1.00	62.5000	814462 B	1.000
SimpleOptimizations	23.13 ms	0.014 ms	0.070 ms	23.12 ms	0.70	62.5000	814439 B	1.000
MergedRWPrevious	14.85 ms	0.015 ms	0.076 ms	14.88 ms	0.45	-	17 B	0.000
MergedRWNext	13.25 ms	0.017 ms	0.087 ms	13.22 ms	0.40	-	17 B	0.000

If I want to continue, I guess more precomputation and an even faster intersection algorithm might be the way to go. But that is not one I want to go, I already use more memory for the merged trees.

This reverts commit 4a36bd8.

Helco added 9 commits October 19, 2024 17:34

Manual enumerator for TreeCollider.Intersections

88cad01

Add many more variants of world and triangle intersections

696748f

Prepare for benchmark

600310a

Add simple benchmark

1e03d45

dirty: Use shared split stack

03b7e87

Readd baseline

c14580c

Correct baseline and reduce allocations more

5457ce8

Fix baseline to prebuilt assembly

d3e8fd4

Use aggressive inlining and optimization

e40d487

Helco added 21 commits October 19, 2024 17:35

Use better backface culling

7b5f7b7

Merge trees into contiguous, non-degenerated triangle lists

f5f4068

Remove unused implementations

e936ebd

Remove unused raycast implementations

f6d8830

Add Location to TreeCollider

0b96b40

Maybe remove MergedCollider

061a545

Fix axis-aligned casts

6c0a089

Remove BaseGeometryCollider

1880333

Clean up collider creation

b1634cc

Remove IIntersectionQueries.Intersections

9b8651f

Merge collider and math folder

d2ae60e

Add PooledList and tests for it and StackOverSpan

0396f1f

Rename SubSums to PrefixSums

330826f

Add TreeCollider.Intersections variant for PooledList

8b6222c

Add the more flexible ListOverSpan

c21626b

Migrate HumanPhysics, FairyPhysics and LensFlare

3636d12

Migrate FindActorFloorCollisions and FairyHoverBehind

9028731

Fix RangeCollection.AddBestFit returning out-of-range range

5c44510

Remove IIntersectionable

71b8921

Remove benchmark project

5fc036d

Nitpicks

75fb68e

Helco force-pushed the allocation-less-collisions branch from 401f309 to 75fb68e Compare October 19, 2024 15:35

Helco marked this pull request as ready for review October 19, 2024 15:36

Helco added 2 commits October 19, 2024 17:47

Fix code analysis warnings

9b1d4d8

Add hidden renderdoc option in release builds for dev prod

2c9ef67

Helco merged commit 4a36bd8 into master Oct 20, 2024
1 check passed

Helco deleted the allocation-less-collisions branch October 20, 2024 09:13

Helco added a commit that referenced this pull request Oct 20, 2024

Revert "zzre: Allocation-less colliders (#371)"

0731adc

This reverts commit 4a36bd8.

Helco restored the allocation-less-collisions branch October 20, 2024 09:14

Helco mentioned this pull request Oct 20, 2024

Reduce per-frame memory allocations #368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zzre: Allocation-less colliders #371

zzre: Allocation-less colliders #371

Helco commented Aug 20, 2024 •

edited

Loading

Helco commented Aug 25, 2024 •

edited

Loading

Helco commented Aug 26, 2024 •

edited

Loading

Helco commented Aug 26, 2024

Helco commented Aug 29, 2024

Helco commented Aug 29, 2024 •

edited

Loading

Helco commented Sep 2, 2024 •

edited

Loading

Helco commented Sep 8, 2024

Helco commented Sep 8, 2024 •

edited

Loading

Helco commented Sep 12, 2024 •

edited

Loading

Helco commented Sep 12, 2024 •

edited

Loading

Helco commented Sep 14, 2024 •

edited

Loading

Helco commented Sep 15, 2024

Helco commented Sep 17, 2024

Helco commented Sep 17, 2024 •

edited

Loading

Helco commented Sep 17, 2024 •

edited

Loading

Helco commented Sep 20, 2024 •

edited

Loading

Helco commented Sep 22, 2024

Helco commented Sep 22, 2024

Helco commented Sep 23, 2024

Helco commented Sep 23, 2024

Helco commented Sep 23, 2024

zzre: Allocation-less colliders #371

zzre: Allocation-less colliders #371

Conversation

Helco commented Aug 20, 2024 • edited Loading

Final benchmark results

GC Profile comparison

Helco commented Aug 25, 2024 • edited Loading

Helco commented Aug 26, 2024 • edited Loading

Helco commented Aug 26, 2024

Helco commented Aug 29, 2024

Helco commented Aug 29, 2024 • edited Loading

Helco commented Sep 2, 2024 • edited Loading

Helco commented Sep 8, 2024

Helco commented Sep 8, 2024 • edited Loading

Helco commented Sep 12, 2024 • edited Loading

Helco commented Sep 12, 2024 • edited Loading

Helco commented Sep 14, 2024 • edited Loading

Helco commented Sep 15, 2024

Helco commented Sep 17, 2024

Helco commented Sep 17, 2024 • edited Loading

Helco commented Sep 17, 2024 • edited Loading

Helco commented Sep 20, 2024 • edited Loading

Helco commented Sep 22, 2024

Helco commented Sep 22, 2024

Helco commented Sep 23, 2024

Helco commented Sep 23, 2024

Helco commented Sep 23, 2024

Helco commented Aug 20, 2024 •

edited

Loading

Helco commented Aug 25, 2024 •

edited

Loading

Helco commented Aug 26, 2024 •

edited

Loading

Helco commented Aug 29, 2024 •

edited

Loading

Helco commented Sep 2, 2024 •

edited

Loading

Helco commented Sep 8, 2024 •

edited

Loading

Helco commented Sep 12, 2024 •

edited

Loading

Helco commented Sep 12, 2024 •

edited

Loading

Helco commented Sep 14, 2024 •

edited

Loading

Helco commented Sep 17, 2024 •

edited

Loading

Helco commented Sep 17, 2024 •

edited

Loading

Helco commented Sep 20, 2024 •

edited

Loading