Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make delegates immutable #99200

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

MichalPetryka
Copy link
Contributor

First attempt at making delegates immutable in CoreCLR so that they can be allocated on the NonGC heap.

I've checked it with a simple app and corerun locally with a delegate from an unloadable ALC and it seemed to not crash, assert nor unload the ALC from under the delegate, however I couldn't actually find any runtime tests that would verify delegates from unloadable ALCs work so the CI coverage might be missing.

One small point of concern is that this might make delegate equality checks slower since they rely on checking the methods in the last "slow path" check, which is however always hit for different delegates AFAIR.

Contributes to #85014.

cc @jkotas

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 4, 2024

I'm not sure what's up with the failures here, tests that are failing on the CI seem to pass on my machine.
EDIT: I was testing with R2R disabled locally since VS kept complaining about being unable to load PDBs for it.

src/coreclr/vm/object.cpp Outdated Show resolved Hide resolved
@MichalPetryka MichalPetryka marked this pull request as ready for review March 4, 2024 23:28
@AndyAyersMS
Copy link
Member

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

Could you please collect some perf numbers to give us an idea about the improvements and regressions in the affected areas? We may want to do some optimizations to mitigate the regressions.

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

One small point of concern is that this might make delegate equality checks slower since they rely on checking the methods in the last "slow path" check, which is however always hit for different delegates AFAIR.

The existing code tries to compare MethodInfos as a cheap fast path. Most delegates do not have cached MethodInfo, so this fast path is hit rarely - but it is very cheap, so it is still worth it.

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 5, 2024

Could you please collect some perf numbers to give us an idea about the improvements and regressions in the affected areas? We may want to do some optimizations to mitigate the regressions.

I think the thing that'd need benchmarking here are equality checks and maybe the impact of collectible delegates being stored in the CWT on the GC, the rest of things shouldn't be performance sensitive enough to matter I think? I'm not fully sure what'd be the proper way for benchmarking the latter.

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

I am going to benchmark the impact of the equality change tomorrow, I feel like if the impact won't be big, potential optimizations to it can be done later.

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

I'm not fully sure what'd be the proper way for benchmarking the latter.

Write a small program that loads an assembly as collectible and calls a method in it. The method in collectible assembly can create delegates in a loop. (If you would like to do it under benchmarknet, it works too - but it is probably more work.)

@jkotas
Copy link
Member

jkotas commented Mar 6, 2024

We should be able to minimize the perf hit for collectible assemblies by piggy backing on the _invocationList field in MulticastDelegate. The field is used to keep things alive in some cases already - it may contain LoaderAllocator - https://github.com/dotnet/runtime/pull/99200/files#diff-1686e89b85836d3c27bf44e0d986c29cec82fba34b66bc76f719d5b1e22cfac9R32.

@jkotas
Copy link
Member

jkotas commented Mar 6, 2024

And if correctness for collectible assemblies can be taken care of by reusing the existing delegate field, ConditionalWeakTable becomes just a throughput optimization for repeated Delegate.Method calls. It may be fine to take a throughput regression for this case. I think it rare for Delegate.Method to be called repeatedly. And when it is called, it is going to be used together with other reflection that is generally slow so it does not need to be super-fast.

@MichalPetryka
Copy link
Contributor Author

One small point of concern is that this might make delegate equality checks slower since they rely on checking the methods in the last "slow path" check, which is however always hit for different delegates AFAIR.

The existing code tries to compare MethodInfos as a cheap fast path. Most delegates do not have cached MethodInfo, so this fast path is hit rarely - but it is very cheap, so it is still worth it.

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

I've finally gotten around to benchmarking the Equality here and the perf regression here is noticeable:


BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4170/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XDNOLE : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-ITZOXA : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0   : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0  

Method Job Toolchain Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
TwoCached Job-XDNOLE main 4.439 ns 0.0050 ns 0.0039 ns 0.27 523 B - NA
LeftCached Job-XDNOLE main 16.540 ns 0.1013 ns 0.0898 ns 0.99 523 B - NA
RightCached Job-XDNOLE main 16.529 ns 0.1024 ns 0.0855 ns 0.99 523 B - NA
TwoUncached Job-XDNOLE main 16.682 ns 0.0575 ns 0.0510 ns 1.00 523 B - NA
TwoCached Job-ITZOXA pr 13.864 ns 0.0261 ns 0.0203 ns 0.83 523 B - NA
LeftCached Job-ITZOXA pr 18.714 ns 0.0916 ns 0.0812 ns 1.12 523 B - NA
RightCached Job-ITZOXA pr 23.233 ns 0.0723 ns 0.0677 ns 1.39 523 B - NA
TwoUncached Job-ITZOXA pr 21.247 ns 0.0837 ns 0.0783 ns 1.27 523 B - NA
TwoCached .NET 8.0 Default 5.162 ns 0.0643 ns 0.0537 ns 0.31 565 B - NA
LeftCached .NET 8.0 Default 17.194 ns 0.0628 ns 0.0557 ns 1.03 565 B - NA
RightCached .NET 8.0 Default 17.534 ns 0.0705 ns 0.0588 ns 1.05 565 B - NA
TwoUncached .NET 8.0 Default 17.410 ns 0.0725 ns 0.0678 ns 1.04 565 B - NA

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 23, 2024

@jkotas Would it be possible to repurpose _methodPtrAux to store the VM MethodDesc* and just make equals compare that? IIRC it's fetched already on delegate creation anyway and the pointer seems to be only used for equality today?

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 24, 2024

@jkotas Would it be possible to repurpose _methodPtrAux to store the VM MethodDesc* and just make equals compare that? IIRC it's fetched already on delegate creation anyway and the pointer seems to be only used for equality today?

Hmm, it says object? _target; // Initialized by VM as needed; null if static delegate in the code but apparently this is wrong and the target is the delegate itself for static methods, for my idea to work this would need to be fixed but today setting it to null causes the runtime to crash for some reason.
EDIT: seems like static delegates work completely different than I thought (and this comment suggests), I thought that the runtime creates a new stub for every static method passed to the delegate constructor but it seems to use a single shared stub per delegate type that calls the pointer in _methodPtrAux. This makes my idea impossible to execute then.

@jkotas
Copy link
Member

jkotas commented Mar 24, 2024

it seems to use a single shared stub per delegate type that calls the pointer in _methodPtrAux

Right.

This table has the basic delegate kinds. In addition to that, there are multicast delegates and marshalled interop delegates.

The delegate fields and implementation are split between Delegate and MulticastDelegate for historic reasons. If you would like to explore playing tricks with overloading the fields, I think it would help to move all fields to Delegate in a separate prep-work PR. Similar mechanical refactoring was done for native AOT in #97959.

@MichalPetryka
Copy link
Contributor Author

If you would like to explore playing tricks with overloading the fields, I think it would help to move all fields to Delegate in a separate prep-work PR.

One issue with doing that is that it'd cause the runtime to reorder fields and put the invocationList before the pointer fields, we'd need to use some other tricks like fixed layout to prevent this.

@jkotas
Copy link
Member

jkotas commented Mar 25, 2024

One issue with doing that is that it'd cause the runtime to reorder fields and put the invocationList before the pointer fields, we'd need to use some other tricks like fixed layout to prevent this.

Yes, that's fine. It would be a temporary measure until something forces us to take R2R version reset

@MichalPetryka
Copy link
Contributor Author

@jkotas I recently had the idea to cache the MethodDesc* for equality checks instead like I pushed here today, is this approach sound enough?
This doesn't get us any size savings but should improve non-reflection usage while only adding the CWT cost for reflection and lets us put instances on FOH.

@@ -130,12 +139,7 @@ public override bool Equals([NotNullWhen(true)] object? obj)
// fall through method handle check
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could any of the checks above be removed now that the "slow" path is not that slow?

@@ -1725,8 +1729,10 @@ extern "C" void QCALLTYPE Delegate_Construct(QCall::ObjectHandleOnStack _this, Q
if (COMDelegate::NeedsWrapperDelegate(pMeth))
refThis = COMDelegate::CreateWrapperDelegate(refThis, pMeth);

refThis->SetMethodDesc(pMethOrig);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% positive I use the correct descs when setting the field from VM in places like these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to be a perf fix or correctness fix?

Delegate_Construct is a rare slow path. I do not think we need to bother optimizing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to be a perf fix or correctness fix?

Delegate_Construct is a rare slow path. I do not think we need to bother optimizing it.

It was supposed to be a perf improvement, but as per discussion with @AndyAyersMS, if we could guarantee the desc is always reliably set by the VM when a delegate is created and never lazily initialized, delegate GDV could then be improved by switching to it. I feel like guaranteeing that should be left to a future PR though since it needs somebody with more VM/reflection knowledge to check all the possible paths.

if (pMeth->GetLoaderAllocator()->IsCollectible())
refThis->SetMethodBase(pMeth->GetLoaderAllocator()->GetExposedObject());
refThis->SetInvocationList(pMeth->GetLoaderAllocator()->GetExposedObject());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvocationList seems to have already been used for LoaderAllocators in specific cases so I just moved all of them there, I'm not 100% sure if there might be a case where VM puts one there now while the field is already occupied.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

Did you try to implement and measure this variant?

This change is about perf trade-offs. We should be talking numbers. Micro-benchmark numbers for affected methods and also how often are the affected methods called in real-world. For example, it would be useful to collect some numbers for how often is Delegate equality or Delegate MethodInfo called in real-world apps.

@MichalPetryka
Copy link
Contributor Author

Did you try to implement and measure this variant?

This idea came from thinking about that, in fact I just wanted to partially inline Delegate_InternalEqualMethodHandles into managed, moving the handle fetching didn't seem trivial to do so I just cached it in a field.

For example, it would be useful to collect some numbers for how often is Delegate equality or Delegate MethodInfo called in real-world apps.

I assume using Delegates as dictionary keys would be the most popular usage for equality.

This change is about perf trade-offs.

That is actually why I like the current variant of it, it makes the least difference to numbers, only slightly regressing .Method while keeping size unchanged and Equality probably even slightly faster, while still unlocking placement of delegates on the FOH. I feel like any other simplifications could wait for later while this would unlock implementing #85014.

We should be talking numbers. Micro-benchmark numbers for affected methods and also how often are the affected methods called in real-world.

So, by affected methods you'd be referring to improvements in Equals and GetHashCode speed and regressions in .Method? Since AFAIR all other APIs should remain unaffected.

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Oct 18, 2024

We should be talking numbers. Micro-benchmark numbers for affected methods and also how often are the affected methods called in real-world.

I've benchmarked the APIs, had to massage the code a bit to make the JIT happy though:
Equals (cached means that Method was called):


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
StaticUncachedUncached Job-JVFCER PR 4.316 ns 0.1073 ns 0.2788 ns 0.14 0.01 1,161 B - NA
StaticUncachedUncached Job-XPKMUE Main 31.299 ns 0.3014 ns 0.2672 ns 1.00 0.01 1,230 B - NA
StaticCachedCached Job-JVFCER PR 3.979 ns 0.1002 ns 0.2221 ns 0.74 0.04 1,161 B - NA
StaticCachedCached Job-XPKMUE Main 5.379 ns 0.1277 ns 0.1255 ns 1.00 0.03 1,233 B - NA
InstanceUncachedUncached Job-JVFCER PR 4.892 ns 0.1166 ns 0.1296 ns 0.14 0.00 1,151 B - NA
InstanceUncachedUncached Job-XPKMUE Main 36.004 ns 0.7288 ns 0.7798 ns 1.00 0.03 1,212 B - NA
InstanceCachedCached Job-JVFCER PR 5.057 ns 0.1318 ns 0.3759 ns 0.92 0.07 1,151 B - NA
InstanceCachedCached Job-XPKMUE Main 5.522 ns 0.1119 ns 0.1569 ns 1.00 0.04 1,225 B - NA
StaticUncachedCached Job-JVFCER PR 3.628 ns 0.0808 ns 0.0716 ns 0.12 0.00 1,161 B - NA
StaticUncachedCached Job-XPKMUE Main 31.057 ns 0.3800 ns 0.3554 ns 1.00 0.02 1,230 B - NA
StaticCachedUncached Job-JVFCER PR 3.569 ns 0.0495 ns 0.0463 ns 0.11 0.00 1,161 B - NA
StaticCachedUncached Job-XPKMUE Main 31.897 ns 0.1509 ns 0.1338 ns 1.00 0.01 1,254 B - NA
InstanceUncachedCached Job-JVFCER PR 4.276 ns 0.1043 ns 0.1958 ns 0.13 0.01 1,151 B - NA
InstanceUncachedCached Job-XPKMUE Main 32.738 ns 0.3142 ns 0.2624 ns 1.00 0.01 1,212 B - NA
InstanceCachedUncached Job-JVFCER PR 4.654 ns 0.1103 ns 0.2490 ns 0.14 0.01 1,151 B - NA
InstanceCachedUncached Job-XPKMUE Main 34.013 ns 0.6977 ns 0.9315 ns 1.00 0.04 1,237 B - NA

GetHashCode:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Static Job-JVFCER PR 2.198 ns 0.0043 ns 0.0041 ns 0.50 672 B - NA
Static Job-XPKMUE Main 4.372 ns 0.0058 ns 0.0051 ns 1.00 860 B - NA
Instance Job-JVFCER PR 3.534 ns 0.0140 ns 0.0131 ns 0.59 808 B - NA
Instance Job-XPKMUE Main 5.997 ns 0.0078 ns 0.0073 ns 1.00 883 B - NA

Method:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Static Job-JVFCER PR 9.305 ns 0.0154 ns 0.0144 ns 5.55 0.02 353 B - NA
Static Job-XPKMUE Main 1.677 ns 0.0078 ns 0.0069 ns 1.00 0.01 1,556 B - NA
Instance Job-JVFCER PR 9.317 ns 0.0189 ns 0.0177 ns 5.54 0.03 353 B - NA
Instance Job-XPKMUE Main 1.682 ns 0.0098 ns 0.0091 ns 1.00 0.01 1,556 B - NA

@MichalPetryka
Copy link
Contributor Author

I'm not sure if the System.Text.Json failures here are related, I saw it here before with Windows x86 Debug and now it's on OSX x64 Debug, couldn't find an issue for them either.

@MichalPetryka
Copy link
Contributor Author

For context, my only motivation with this is unblocking #85014 cause I already got non PGO delegate inlining mostly working locally, but it currently only handles static readonly delegates and requires that to work for lambdas and #108606 and #108579 to work better with delegates in general. (I also don't have the NativeAOT VM implementation for my new JIT-EE API finished yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants