Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

Open
swtrse opened this issue Oct 10, 2024 · 50 comments
Open

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

swtrse opened this issue Oct 10, 2024 · 50 comments
Labels
area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@swtrse
Copy link

swtrse commented Oct 10, 2024

Description

HI,

I ran into a problem were using SortedDictionary<BigInteger, MyObject> ran into a System.AccessViolationException every single time.
The error occured during an BenchmarkDotNet run.
The error does not occur in debug mode. As far as I found out the error only appeared in Release mode.
Also, the error disappears if you want to analyze the behavior with dotMemory.

Reproduction Steps

MyObject is a sealed generic class bit I think it does not matter.

public sealed class Test
{
    private readonly SortedDictionary<BigInteger, MyObject> _limitSortedList;

    .......

    private MyObject? GetNextNode(BigInteger target)
    {
	if (_limitSortedList.TryGetValue(target, out Swtrse2Node<TKey, TOrder>? node)) { return node; }

	var low = 0;
	int high = _limitSortedList.Count - 1;
	List<BigInteger> list = _limitSortedList.Keys.ToList();   //<---- Exception

	while (low <= high)
	{
		int mid = (low + high) / 2;
		int comparison = _limitSortedList.Comparer.Compare(list[mid], target);

		if (comparison < 0) { low = mid + 1; }
		else { high = mid - 1; }
	}

	return low < _limitSortedList.Count ? _limitSortedList[list[low]] : null;
    }
}

Expected behavior

I expect no AccessViolationException when calling Framework functions

Actual behavior

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Generic.List1[[System.Numerics.BigInteger, System.Runtime.Numerics, Version=9.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]..ctor(System.Collections.Generic.IEnumerable1<System.Numerics.BigInteger>)

Regression?

No response

Known Workarounds

No response

Configuration

// BenchmarkDotNet v0.14.0
// Runtime=.NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256

Other information

No response

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Oct 10, 2024

Please attach a self-contained repro or a crash dump, it's not possible to diagnose the issue from just the given snippet. It could be caused by a semi-valid unsafe/interop in your application (stack traces could be confusing in such cases) or a thread-safety issue (SortedDictionary is not thread safe). Also, is this Linux or Windows? BDN has a known issue on Linux + DisassemblyDiagnoser.

@swtrse
Copy link
Author

swtrse commented Oct 11, 2024

Problem is. I tried using dotMemory to analyse the error. As soon as the memory profiler is attached the error is gone.
I see no option to generate a crashdump in any way.

I run on Windows 11.
I do not use any unsafe code.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2024

I see no option to generate a crashdump in any way.

You can enable crashdumps by following https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps (use FullDump - it has the complete info)

@swtrse
Copy link
Author

swtrse commented Oct 11, 2024

Thx here is the dump https://we.tl/t-zWlR3MhCbz. Sorry I had to use wetransfer ,github is blocking the upload.

And I strongly believe the Error is within SortedDictionary<>.Keys in combination with ToList() or ToArray()
Justification:

  1. If I change the code to a version without calling ToList() or ToArray() the error is gone
  2. If I change SortedDictionary<> to SortedList<> the error is gone
  3. The error happens instantly after starting the application. I suspect on one of the first calls.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2024

Details of the crash - bad GC pointer:

System_Collections!System.Collections.Generic.SortedSet<System.Collections.Generic.KeyValuePair<System.Numerics.BigInteger,System.__Canon>>.InOrderTreeWalk+0x135:
00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

The crash is on .NET 9 RC1. We fixed several bugs that can lead to intermittent crashes like this one for .NET 9 RC2. Could you please upgrade to .NET 9 RC2 and see whether it is going to fix the crash?

Thx here is the dump https://we.tl/t-zWlR3MhCbz.

Next time, you can create an issue at https://developercommunity.visualstudio.com/, attach the dump to it and link it from here. developercommunity does not have uploads limits like github and it has better privacy controls.

@swtrse
Copy link
Author

swtrse commented Oct 15, 2024

The crash still remains with .Net 9 RC2.

@jkotas
Copy link
Member

jkotas commented Oct 15, 2024

Could you please upload a .NET 9 RC2 crash dump?

@swtrse
Copy link
Author

swtrse commented Oct 16, 2024

@jkotas
Copy link
Member

jkotas commented Oct 16, 2024

Thanks - the RC2 crash is exactly same as the RC1 crash.

There is weird discrepancy between the state of the registers and the faulting address in EXCEPTION_RECORD in both dumps.

00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

vs.

0:024> dx -r1 ((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)
((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)                 : 0x38c4efc4f0 [Type: _EXCEPTION_RECORD *]
    [+0x000] ExceptionCode    : 0xc0000005 [Type: unsigned long]
    [+0x004] ExceptionFlags   : 0x1 [Type: unsigned long]
    [+0x008] ExceptionRecord  : 0x0 [Type: _EXCEPTION_RECORD *]
    [+0x010] ExceptionAddress : 0x7ffc351acd62 [Type: void *]
    [+0x018] NumberParameters : 0x2 [Type: unsigned long]
    [+0x020] ExceptionInformation [Type: unsigned __int64 [15]]
0:024> dx -r1 (*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))
(*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))                 [Type: unsigned __int64 [15]]
    [0]              : 0x1 [Type: unsigned __int64]
    [1]              : 0x24eba52b1a6 [Type: unsigned __int64] <- This should be same as [rbp+14h] ss:0000013a`1638abfc

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

@jkotas
Copy link
Member

jkotas commented Oct 16, 2024

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

The faulting method has 4 OptimizedTier1OSR variants of the code that is unusual. It may be related.

  IL Addr:            00007fff8c6cc624
     CodeAddr:           00007ffeb79e5ca0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5F00
     CodeAddr:           00007ffeb79e5fa0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5C40
     CodeAddr:           00007ffeb79fcb40  (QuickJitted + Instrumented)
     NativeCodeVersion:  0000013A0078C9D0
     CodeAddr:           00007ffeb79fd220  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB6180
     CodeAddr:           00007ffeb79fda80  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5AC0
     CodeAddr:           0000000000000000  (OptimizedTier1)
     NativeCodeVersion:  0000013A00794FC0
     CodeAddr:           00007ffeb79e3e50  (QuickJitted)
     NativeCodeVersion:  0000000000000000

@swtrse
Copy link
Author

swtrse commented Oct 16, 2024

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

I tried that with no effect, The error is still the same

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

Thats highly unlikely. The system despite using windows 11 as an OS runs rock solid for weeks without reboot. A Hardware defect in this kind of area should have a negative impact on system stability in general, but I do not see it.

The error did not appear on a second machine I tried. But the other machine is a complexly different CPU generation. Maybe it is something CPU or chipset specific.

I did backport the solution to .net 8 as soon as the project runs with .net8 the error is gone. Even when Benchmark.net running with .net 9 environment

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.14.0
// Runtime=.NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: Job-NHHCHO(Runtime=.NET 9.0, Toolchain=InProcessEmitToolchain, InvocationCount=1, IterationCount=3, LaunchCount=1, UnrollFactor=1, WarmupCount=3)

I can provide the whole solution if that is of any help.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

I can provide the whole solution if that is of any help.

Yes, that would help too.

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

That did it!
With this setting the error is gone.
I only did set it in the console app that's running the benchmarks, and the error is no more. I was prepared to add that switch to all libraries in the solution, but it was not necessary.

Is this enough information or still need the whole solution?

@jkotas jkotas added area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-System.Collections labels Oct 17, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

The sequence that leads to the crash (#108763 (comment) is the full dump to look at):

  • SortedSet code called CastHelpers.StelemRef
  • CastHelpers.StelemRef tail-called WriteBarrier
  • We hit some sort of bad crash inside WriteBarrier
  • We unwound from the write-barrier and patched the exception context that makes it look like the crash is in SortedSet code

I suspect that the bad crash inside WriteBarrier is somehow caused by CET-aware thread suspension interleaving with the above sequence. The crash does not repro when CET is disabled.

@janvorli @VSadov Does it ring any bells? Could you please take it from here?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 17, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Oct 17, 2024
@janvorli
Copy link
Member

What I don't understand though is that even if we hit whatever issue in the write barrier, then if we adjusted the context back to the SortedSet, we should have invoked the exception handling code, but the dump seems to indicate that it didn't happen.
I think that the fact that the exception location is not the one in the write barrier but instead at its caller is just a side effect of us unwinding write barrier in the exception context we get from the OS and watson seeing that modified context.

I can provide the whole solution if that is of any help.

@swtrse that would be great, it would enable me to debug why the managed exception handling didn't kick in. Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f. Since that's out of the null reference detection area, that's likely causing the AV.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

we should have invoked the exception handling code

We have invoked some exception handling code. Thread.m_ExceptionState.m_pCurrentTracker.m_ptrs has the information about the crash.

Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f.

I think you meant 0x0000017a27a803f1. It is address of RegionToGeneration map in the write barrier. I do not see how we ended up writing to that address.

CORINFO_HELP_ASSIGN_REF:
00007ffe`b6da0010 488911          mov     qword ptr [rcx],rdx
00007ffe`b6da0013 4c8bc1          mov     r8,rcx
00007ffe`b6da0016 48b8f103a8277a010000 mov rax,17A27A803F1h <----
00007ffe`b6da0020 48c1e916        shr     rcx,16h
00007ffe`b6da0024 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`b6da0028 7504            jne     00007ffe`b6da002e
...

@janvorli
Copy link
Member

Ah, right. We have called EEPolicy::HandleFatalError in FailFastIfCorruptingStateException and so Watson took the context passed to the HandleFatalError as the one to show as the top frame.

@janvorli
Copy link
Member

I think you meant 0x0000017a27a803f1.

Right, incomplete cut&paste

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

@janvorli
Copy link
Member

Thank you! I'll investigate it.

@janvorli
Copy link
Member

@swtrse could you please also share the command line you use to run the problematic benchmark?

@swtrse
Copy link
Author

swtrse commented Oct 18, 2024

@swtrse could you please also share the command line you use to run the problematic benchmark?
Just the exe without parameters compiled in release mode. The error happens in my case as soon as the first benchmark starts.

Cobra.Exoda.Matching\Cobra.Exoda.Matching.BenchmarkRunner\bin\Release\net9.0>Cobra.Exoda.Matching.BenchmarkRunner.exe

You may have to remove the <CETCompat>false</CETCompat> entry from the project. I do not remember if I edited the file before or after I uploaded the project.

@janvorli
Copy link
Member

When running under WinDbg, I can see that the crash really occurs when accessing the address in rax. And the current Rip is in the middle of a machine instruction of a write barrier.

CORINFO_HELP_ASSIGN_REF:
00007ffe`ffca0010 488911          mov     qword ptr [rcx],rdx
00007ffe`ffca0013 4c8bc1          mov     r8,rcx
00007ffe`ffca0016 48b86d3e9a88eb010000 mov rax,1EB889A3E6Dh
00007ffe`ffca0020 48c1e916        shr     rcx,16h
00007ffe`ffca0024 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`ffca0028 7504            jne     00007ffe`ffca002e
00007ffe`ffca002a f3c3            rep ret
00007ffe`ffca002c 6690            nop
00007ffe`ffca002e 49b90000c074ab010000 mov r9,1AB74C00000h
00007ffe`ffca0038 493bd1          cmp     rdx,r9
00007ffe`ffca003b 7301            jae     00007ffe`ffca003e
00007ffe`ffca003d c3              ret
00007ffe`ffca003e 49b900008074eb010000 mov r9,1EB74800000h
^^^ When the crash occurs, the Rip is at 00007ffe`ffca0046. This happens in all runs. Just the address is obviously different in the bits above bit 11 due to the fact it is a dynamically allocated page. 
00007ffe`ffca0048 493bd1          cmp     rdx,r9
00007ffe`ffca004b 7202            jb      00007ffe`ffca004f

Here is my theory on what happens:

  • The special user mode APC we use to suspend the thread occurs at the crashing location, but in a different flavor of write barrier
  • While in our APC handler code, the GC write barrier is switched to another flavor - the one above (WRITE_BARRIER_BIT_REGIONS64)
  • The APC handler returns to the original location, but the code of the barrier is now different, so we end up in the middle of an instruction.

I still need to verify this assumption though.

@janvorli
Copy link
Member

And I don't know how CET would influence that since we should be using the APC both with and without CET enabled.

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2024

the GC write barrier is switched to another flavor - the one above

I thought that write-barriers aren't interruptible and GC only changes write-barriers on pauses?

@janvorli
Copy link
Member

Looks like my theory was wrong. The APC handler never changes the write barrier for the case when we are in the write barrier. So I don't know what's causing the execution to go to a middle of an instruction yet. I will need to investigate that further.

@janvorli
Copy link
Member

Ok, I've found the real culprit. We hijack return address of the write barrier using the special hijack address for CET. Then the write barrier returns and that kicks in the vectored exception handler, as you can see below:

This is the write barrier code:

CORINFO_HELP_ASSIGN_REF:
00007ffe`94920010 488911          mov     qword ptr [rcx],rdx
00007ffe`94920013 488bc1          mov     rax,rcx
00007ffe`94920016 49b840dc29dba0020000 mov r8,2A0DB29DC40h
00007ffe`94920020 48c1e80c        shr     rax,0Ch
00007ffe`94920024 4903c0          add     rax,r8
00007ffe`94920027 4c8bc1          mov     r8,rcx
00007ffe`9492002a 48c1e916        shr     rcx,16h
00007ffe`9492002e 803800          cmp     byte ptr [rax],0
00007ffe`94920031 7503            jne     00007ffe`94920036
00007ffe`94920033 c600ff          mov     byte ptr [rax],0FFh
00007ffe`94920036 48b87b682f05a1020000 mov rax,2A1052F687Bh
00007ffe`94920040 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`94920044 7508            jne     00007ffe`9492004e
00007ffe`94920046 f3c3            rep ret
^^^ This is the ret whose return address is hijacked, hence it triggers the vectored exception handler
...

Stack when the vectored exception handler kicks in and waits for GC completion:

 # Child-SP          RetAddr               Call Site
00 000000db`c3c7cae8 00007ff8`08a89cee     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 251] 
01 000000db`c3c7caf0 00007ffe`f461ac40     KERNELBASE!WaitForSingleObjectEx+0x8e [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0xf [C:\git\runtime\src\coreclr\gc\windows\gcenv.windows.cpp @ 1381] 
03 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x16 [C:\git\runtime\src\coreclr\gc\windows\gcenv.windows.cpp @ 1431] 
04 000000db`c3c7cb90 00007ffe`f45d08b6     coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30 [C:\git\runtime\src\coreclr\gc\gcee.cpp @ 265] 
05 000000db`c3c7cbc0 00007ffe`f459df91     coreclr!Thread::RareDisablePreemptiveGC+0x1d6 [C:\git\runtime\src\coreclr\vm\threadsuspend.cpp @ 2211] 
06 (Inline Function) --------`--------     coreclr!Thread::DisablePreemptiveGC+0x19 [C:\git\runtime\src\coreclr\vm\threads.h @ 1297] 
07 (Inline Function) --------`--------     coreclr!Thread::PulseGCMode+0x27 [C:\git\runtime\src\coreclr\vm\threadsuspend.cpp @ 2405] 
08 000000db`c3c7cc50 00007ffe`f460ee18     coreclr!CommonTripThread+0x59 [C:\git\runtime\src\coreclr\vm\threads.cpp @ 6953] 
09 000000db`c3c7cc80 00007ffe`f460ec9d     coreclr!CLRVectoredExceptionHandler+0x138 [C:\git\runtime\src\coreclr\vm\excep.cpp @ 6546] 
0a 000000db`c3c7cd00 00007ff8`0b627f7a     coreclr!CLRVectoredExceptionHandlerShim+0xdd [C:\git\runtime\src\coreclr\vm\excep.cpp @ 7275] 
0b 000000db`c3c7cd50 00007ff8`0b5ce632     ntdll!RtlpCallVectoredHandlers+0x112 [minkernel\ntdll\vectxcpt.c @ 204] 
0c (Inline Function) --------`--------     ntdll!RtlCallVectoredExceptionHandlers+0xe [minkernel\ntdll\vectxcpt.c @ 358] 
0d 000000db`c3c7cdf0 00007ff8`0b65416e     ntdll!RtlDispatchException+0x62 [minkernel\ntos\rtl\amd64\exdsptch.c @ 406] 
0e 000000db`c3c7d040 00007ffe`94920046     ntdll!KiUserExceptionDispatch+0x2e [minkernel\ntos\rtl\amd64\trampoln.asm @ 755] 
0f 000000db`c3c7d768 00007ffe`955c16a2     0x00007ffe`94920046 <<< This is the write barrier code

In the CLRVectoredExceptionHandler, we move the interruptedContext to the caller of the write barrier, as you can see here:

        // Change the IP to be at the original return site, as if we have returned to the caller.
        // That IP is an interruptible safe point, so we can suspend right there.
        uintptr_t origIp = interruptedContext->Rip; << 0x00007ffe`94920046
        interruptedContext->Rip = (uintptr_t)pThread->GetHijackedReturnAddress(); << 00007ffe`955c16a2

        FrameWithCookie<ResumableFrame> frame(pExceptionInfo->ContextRecord);
        frame.Push(pThread);
        CommonTripThread();
        frame.Pop(pThread);

And then the thread waits for the GC to complete.

On another thread, the one that runs the GC, the GC barrier is changed from WRITE_BARRIER_WRITE_WATCH_BIT_REGIONS64 to WRITE_BARRIER_BIT_REGIONS64.
It stays there even after we return from the CLRVectoredExceptionHandler and we crash, because the 0x00007ffe`94920046 is now in the middle of an instruction and looks as if it was

00007ffe`94920046 0000            add     byte ptr [rax],al ds:000002a1`052f687b=??

@janvorli
Copy link
Member

The thing that's not clear to me yet is how come we end up hijacking return address of a write barrier. We should only attempt to hijack return address of a managed method.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

The thing that's not clear to me yet is how come we end up hijacking return address of a write barrier. We should only attempt to hijack return address of a managed method.

Yes, this is the disturbing part. There are many ways in which this could lead to a failure.
GC barriers are classified as no-GC helpers and callsites are not safepoints. Even trying to unwind could be risky.

Besides, we generally consider native code as not signal-safe, so that is the first thing that handler checks - "are we in managed code?".

@janvorli
Copy link
Member

I guess I have an answer for this. The key is that the write barrier is tail called from a managed method in this case (the CastHelpers.StelemRef). What happens is that we hijack the return address while we are in the CastHelpers.StelemRef, but then it tail calls into the barrier and the return from the barrier hits the vectored exception handler.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Hijacking the caller should be ok, I think, even if it then tailcalls the barrier.

@janvorli
Copy link
Member

janvorli commented Oct 18, 2024

The problem is that vectored exception handler returns to the location where the "ret" was located in the barrier and then the crash occurs, because there is no longer a ret. I guess that's how Windows implement the special hijacking - it just returns from the vectored exception handler. Since the return address should be fixed now, it would just re-run the ret instruction and return back to the caller. But if the return instruction is gone, it just ends up continuing execution of whatever is there.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Would we have to suspend runtime in order to swap the barrier code though?

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2024

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

And I don't know how CET would influence that since we should be using the APC both with and without CET enabled.

One way how CET can mess up hijacking is that Windows will try to unhijack as well. It will use the top of the shadow stack. This did cause problems before when our stashed hijack address (which captures return address from the actual stack) and shadow stack were not in sync.

@janvorli
Copy link
Member

Now it is actually clear, as the return address hijacking using the special address is only used when CET is enabled.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

That is not a JIT-inserted call. StelemRef calls the barrier directly - as an FCALL

        [MethodImpl(MethodImplOptions.InternalCall)]
        private static extern void WriteBarrier(ref object? dst, object? obj);

Can these be tail-called?
(I think that still would be ok, just wonder if this is the reason why StelemRef is special here)

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Would we have to suspend runtime in order to swap the barrier code though?

Figured that.
The runtime will indeed be suspended before we resume from the handler. There will be an opportunity for the barrier code to change.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

An obvious way to fix this is to forbid tailcalling of WriteBarrier. That might regress perf of array element writes.

Another thought is:

When returning from the hijack handler, we perform the "undo the pop to hijacked call site" - that is specifically to hit the ret that is in this case problematic.

        if (areShadowStacksEnabled)
        {
            // Undo the "pop", so that the ret could now succeed.
            interruptedContext->Rsp = interruptedContext->Rsp - 8;
            interruptedContext->Rip = origIp;
        }

I do not recall all the reasons why we do this.
If it is just to match the shadow stack and let both stacks pop naturally, perhaps we could pop the shadow stack instead? This would ensure that we resume to a safe point.

@janvorli
Copy link
Member

I think we need to hit the ret in order to pop from the shadow stack too.
As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2024

Can a write barrier be a source of an exception? e.g. nullref

Can these be tail-called?

Ah, I guess so then. It's not a helper call indeed in this case.

@janvorli
Copy link
Member

Can a write barrier be a source of an exception? e.g. nullref

Yes, it can. But we detect that in the EH code and pretend that the exception occurred in the caller of the barrier.

@janvorli
Copy link
Member

I've verified that the fix I have suggested above (return from VEH when we find the IP is not in managed code - we can actually explicitly check if it is in any helper or just write barrier) works with the repro app.

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

I think we need to hit the ret in order to pop from the shadow stack too.

That is what I mean - instead of re-popping the context and relying on ret to pop both stacks, we could just pop the shadow stack (i.e. incsspq)

That would not require probing the return address for managedness or being in range of barriers.

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

As in - unhijack current thread and return VEH_CONTINUE_EXECUTION ?

I think that would work. Perhaps could check just for the write barrier as that is the only thing that can get its code replaced. Other cases where we tailcall FCALLs should be ok.

(and the "incsspq" solution might not work if OS pushes some stuff on SSP before calling our handler)

@janvorli
Copy link
Member

We basically don't need to do anything, just check the IP to be in a write barrier and return the VEH_CONTINUE_EXECUTION if it was, right after the _ASSERTE in the code below:

if (areShadowStacksEnabled)
{
// OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
_ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());
// When the CET is enabled, the interruption happens on the ret instruction in the calee.
// We need to "pop" rsp to the caller, as if the ret has consumed it.
interruptedContext->Rsp += 8;
}

So it looks like this:

        if (areShadowStacksEnabled)
        {
            // OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
            _ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());

            if (IsIPInWriteBarrierCodeCopy(interruptedContext->Rip))
            {
                return VEH_CONTINUE_EXECUTION;
            }

            // When the CET is enabled, the interruption happens on the ret instruction in the calee.
            // We need to "pop" rsp to the caller, as if the ret has consumed it.
            interruptedContext->Rsp += 8;
        }

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

We need to clear the hijacked state before allowing the thread to “escape” above it.
It is permitted for a thread to clear its hijack, as long as it is in coop mode, but not permitted to simply ignore it. On AOT a thread seen in “escaped” state may cause asserts, not sure about core.
The reason is that removing the hijack later could become dangerous, if the thread ends up below the hijack again, but with different stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

6 participants