SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

swtrse · 2024-10-10T20:18:31Z

Description

HI,

I ran into a problem were using SortedDictionary<BigInteger, MyObject> ran into a System.AccessViolationException every single time.
The error occured during an BenchmarkDotNet run.
The error does not occur in debug mode. As far as I found out the error only appeared in Release mode.
Also, the error disappears if you want to analyze the behavior with dotMemory.

Reproduction Steps

MyObject is a sealed generic class bit I think it does not matter.

public sealed class Test
{
    private readonly SortedDictionary<BigInteger, MyObject> _limitSortedList;

    .......

    private MyObject? GetNextNode(BigInteger target)
    {
	if (_limitSortedList.TryGetValue(target, out Swtrse2Node<TKey, TOrder>? node)) { return node; }

	var low = 0;
	int high = _limitSortedList.Count - 1;
	List<BigInteger> list = _limitSortedList.Keys.ToList();   //<---- Exception

	while (low <= high)
	{
		int mid = (low + high) / 2;
		int comparison = _limitSortedList.Comparer.Compare(list[mid], target);

		if (comparison < 0) { low = mid + 1; }
		else { high = mid - 1; }
	}

	return low < _limitSortedList.Count ? _limitSortedList[list[low]] : null;
    }
}

Expected behavior

I expect no AccessViolationException when calling Framework functions

Actual behavior

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Generic.List1[[System.Numerics.BigInteger, System.Runtime.Numerics, Version=9.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]..ctor(System.Collections.Generic.IEnumerable1<System.Numerics.BigInteger>)

Regression?

No response

Known Workarounds

No response

Configuration

// BenchmarkDotNet v0.14.0
// Runtime=.NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256

Other information

No response

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-10-10T20:18:54Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-10-10T20:55:41Z

Please attach a self-contained repro or a crash dump, it's not possible to diagnose the issue from just the given snippet. It could be caused by a semi-valid unsafe/interop in your application (stack traces could be confusing in such cases) or a thread-safety issue (SortedDictionary is not thread safe). Also, is this Linux or Windows? BDN has a known issue on Linux + DisassemblyDiagnoser.

swtrse · 2024-10-11T00:00:58Z

Problem is. I tried using dotMemory to analyse the error. As soon as the memory profiler is attached the error is gone.
I see no option to generate a crashdump in any way.

I run on Windows 11.
I do not use any unsafe code.

jkotas · 2024-10-11T04:52:10Z

I see no option to generate a crashdump in any way.

You can enable crashdumps by following https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps (use FullDump - it has the complete info)

swtrse · 2024-10-11T07:40:24Z

Thx here is the dump https://we.tl/t-zWlR3MhCbz. Sorry I had to use wetransfer ,github is blocking the upload.

And I strongly believe the Error is within SortedDictionary<>.Keys in combination with ToList() or ToArray()
Justification:

If I change the code to a version without calling ToList() or ToArray() the error is gone
If I change SortedDictionary<> to SortedList<> the error is gone
The error happens instantly after starting the application. I suspect on one of the first calls.

jkotas · 2024-10-11T13:08:15Z

Details of the crash - bad GC pointer:

System_Collections!System.Collections.Generic.SortedSet<System.Collections.Generic.KeyValuePair<System.Numerics.BigInteger,System.__Canon>>.InOrderTreeWalk+0x135:
00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

The crash is on .NET 9 RC1. We fixed several bugs that can lead to intermittent crashes like this one for .NET 9 RC2. Could you please upgrade to .NET 9 RC2 and see whether it is going to fix the crash?

Thx here is the dump https://we.tl/t-zWlR3MhCbz.

Next time, you can create an issue at https://developercommunity.visualstudio.com/, attach the dump to it and link it from here. developercommunity does not have uploads limits like github and it has better privacy controls.

swtrse · 2024-10-15T19:57:31Z

The crash still remains with .Net 9 RC2.

jkotas · 2024-10-15T22:25:22Z

Could you please upload a .NET 9 RC2 crash dump?

swtrse · 2024-10-16T06:38:04Z

I uploaded the crashdump here:
https://developercommunity.visualstudio.com/t/SortedDictionary-SystemAccessViolationE/10769841

jkotas · 2024-10-16T09:06:34Z

Thanks - the RC2 crash is exactly same as the RC1 crash.

There is weird discrepancy between the state of the registers and the faulting address in EXCEPTION_RECORD in both dumps.

00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

vs.

0:024> dx -r1 ((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)
((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)                 : 0x38c4efc4f0 [Type: _EXCEPTION_RECORD *]
    [+0x000] ExceptionCode    : 0xc0000005 [Type: unsigned long]
    [+0x004] ExceptionFlags   : 0x1 [Type: unsigned long]
    [+0x008] ExceptionRecord  : 0x0 [Type: _EXCEPTION_RECORD *]
    [+0x010] ExceptionAddress : 0x7ffc351acd62 [Type: void *]
    [+0x018] NumberParameters : 0x2 [Type: unsigned long]
    [+0x020] ExceptionInformation [Type: unsigned __int64 [15]]
0:024> dx -r1 (*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))
(*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))                 [Type: unsigned __int64 [15]]
    [0]              : 0x1 [Type: unsigned __int64]
    [1]              : 0x24eba52b1a6 [Type: unsigned __int64] <- This should be same as [rbp+14h] ss:0000013a`1638abfc

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

jkotas · 2024-10-16T09:11:34Z

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

The faulting method has 4 OptimizedTier1OSR variants of the code that is unusual. It may be related.

  IL Addr:            00007fff8c6cc624
     CodeAddr:           00007ffeb79e5ca0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5F00
     CodeAddr:           00007ffeb79e5fa0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5C40
     CodeAddr:           00007ffeb79fcb40  (QuickJitted + Instrumented)
     NativeCodeVersion:  0000013A0078C9D0
     CodeAddr:           00007ffeb79fd220  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB6180
     CodeAddr:           00007ffeb79fda80  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5AC0
     CodeAddr:           0000000000000000  (OptimizedTier1)
     NativeCodeVersion:  0000013A00794FC0
     CodeAddr:           00007ffeb79e3e50  (QuickJitted)
     NativeCodeVersion:  0000000000000000

swtrse · 2024-10-16T18:52:01Z

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

I tried that with no effect, The error is still the same

swtrse · 2024-10-17T08:49:48Z

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

Thats highly unlikely. The system despite using windows 11 as an OS runs rock solid for weeks without reboot. A Hardware defect in this kind of area should have a negative impact on system stability in general, but I do not see it.

The error did not appear on a second machine I tried. But the other machine is a complexly different CPU generation. Maybe it is something CPU or chipset specific.

I did backport the solution to .net 8 as soon as the project runs with .net8 the error is gone. Even when Benchmark.net running with .net 9 environment

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.14.0
// Runtime=.NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: Job-NHHCHO(Runtime=.NET 9.0, Toolchain=InProcessEmitToolchain, InvocationCount=1, IterationCount=3, LaunchCount=1, UnrollFactor=1, WarmupCount=3)

I can provide the whole solution if that is of any help.

jkotas · 2024-10-17T10:46:02Z

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

I can provide the whole solution if that is of any help.

Yes, that would help too.

swtrse · 2024-10-17T11:11:55Z

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

That did it!
With this setting the error is gone.
I only did set it in the console app that's running the benchmarks, and the error is no more. I was prepared to add that switch to all libraries in the solution, but it was not necessary.

Is this enough information or still need the whole solution?

dotnet-policy-service · 2024-10-17T12:27:18Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

jkotas · 2024-10-17T12:32:49Z

The sequence that leads to the crash (#108763 (comment) is the full dump to look at):

SortedSet code called CastHelpers.StelemRef
CastHelpers.StelemRef tail-called WriteBarrier
We hit some sort of bad crash inside WriteBarrier
We unwound from the write-barrier and patched the exception context that makes it look like the crash is in SortedSet code

I suspect that the bad crash inside WriteBarrier is somehow caused by CET-aware thread suspension interleaving with the above sequence. The crash does not repro when CET is disabled.

@janvorli @VSadov Does it ring any bells? Could you please take it from here?

janvorli · 2024-10-17T16:12:32Z

What I don't understand though is that even if we hit whatever issue in the write barrier, then if we adjusted the context back to the SortedSet, we should have invoked the exception handling code, but the dump seems to indicate that it didn't happen.
I think that the fact that the exception location is not the one in the write barrier but instead at its caller is just a side effect of us unwinding write barrier in the exception context we get from the OS and watson seeing that modified context.

I can provide the whole solution if that is of any help.

@swtrse that would be great, it would enable me to debug why the managed exception handling didn't kick in. Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f. Since that's out of the null reference detection area, that's likely causing the AV.

jkotas · 2024-10-17T16:26:25Z

we should have invoked the exception handling code

We have invoked some exception handling code. Thread.m_ExceptionState.m_pCurrentTracker.m_ptrs has the information about the crash.

Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f.

I think you meant 0x0000017a27a803f1. It is address of RegionToGeneration map in the write barrier. I do not see how we ended up writing to that address.

CORINFO_HELP_ASSIGN_REF:
00007ffe`b6da0010 488911          mov     qword ptr [rcx],rdx
00007ffe`b6da0013 4c8bc1          mov     r8,rcx
00007ffe`b6da0016 48b8f103a8277a010000 mov rax,17A27A803F1h <----
00007ffe`b6da0020 48c1e916        shr     rcx,16h
00007ffe`b6da0024 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`b6da0028 7504            jne     00007ffe`b6da002e
...

janvorli · 2024-10-17T17:02:36Z

Ah, right. We have called EEPolicy::HandleFatalError in FailFastIfCorruptingStateException and so Watson took the context passed to the HandleFatalError as the one to show as the top frame.

janvorli · 2024-10-17T17:03:06Z

I think you meant 0x0000017a27a803f1.

Right, incomplete cut&paste

swtrse · 2024-10-17T22:42:05Z

I did upload my whole sourcecode to
https://developercommunity.visualstudio.com/t/SortedDictionary-SystemAccessViolationE/10769841#T-N10771528

janvorli · 2024-10-18T08:17:54Z

Thank you! I'll investigate it.

janvorli · 2024-10-18T12:33:13Z

@swtrse could you please also share the command line you use to run the problematic benchmark?

swtrse · 2024-10-18T13:37:40Z

@swtrse could you please also share the command line you use to run the problematic benchmark?
Just the exe without parameters compiled in release mode. The error happens in my case as soon as the first benchmark starts.

Cobra.Exoda.Matching\Cobra.Exoda.Matching.BenchmarkRunner\bin\Release\net9.0>Cobra.Exoda.Matching.BenchmarkRunner.exe

You may have to remove the <CETCompat>false</CETCompat> entry from the project. I do not remember if I edited the file before or after I uploaded the project.

janvorli · 2024-10-18T18:12:22Z

When running under WinDbg, I can see that the crash really occurs when accessing the address in rax. And the current Rip is in the middle of a machine instruction of a write barrier.

CORINFO_HELP_ASSIGN_REF:
00007ffe`ffca0010 488911          mov     qword ptr [rcx],rdx
00007ffe`ffca0013 4c8bc1          mov     r8,rcx
00007ffe`ffca0016 48b86d3e9a88eb010000 mov rax,1EB889A3E6Dh
00007ffe`ffca0020 48c1e916        shr     rcx,16h
00007ffe`ffca0024 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`ffca0028 7504            jne     00007ffe`ffca002e
00007ffe`ffca002a f3c3            rep ret
00007ffe`ffca002c 6690            nop
00007ffe`ffca002e 49b90000c074ab010000 mov r9,1AB74C00000h
00007ffe`ffca0038 493bd1          cmp     rdx,r9
00007ffe`ffca003b 7301            jae     00007ffe`ffca003e
00007ffe`ffca003d c3              ret
00007ffe`ffca003e 49b900008074eb010000 mov r9,1EB74800000h
^^^ When the crash occurs, the Rip is at 00007ffe`ffca0046. This happens in all runs. Just the address is obviously different in the bits above bit 11 due to the fact it is a dynamically allocated page. 
00007ffe`ffca0048 493bd1          cmp     rdx,r9
00007ffe`ffca004b 7202            jb      00007ffe`ffca004f

Here is my theory on what happens:

The special user mode APC we use to suspend the thread occurs at the crashing location, but in a different flavor of write barrier
While in our APC handler code, the GC write barrier is switched to another flavor - the one above (WRITE_BARRIER_BIT_REGIONS64)
The APC handler returns to the original location, but the code of the barrier is now different, so we end up in the middle of an instruction.

I still need to verify this assumption though.

janvorli · 2024-10-18T18:15:51Z

And I don't know how CET would influence that since we should be using the APC both with and without CET enabled.

EgorBo · 2024-10-18T18:20:52Z

the GC write barrier is switched to another flavor - the one above

I thought that write-barriers aren't interruptible and GC only changes write-barriers on pauses?

janvorli · 2024-10-18T20:05:05Z

Looks like my theory was wrong. The APC handler never changes the write barrier for the case when we are in the write barrier. So I don't know what's causing the execution to go to a middle of an instruction yet. I will need to investigate that further.

janvorli · 2024-10-18T21:18:21Z

Ok, I've found the real culprit. We hijack return address of the write barrier using the special hijack address for CET. Then the write barrier returns and that kicks in the vectored exception handler, as you can see below:

This is the write barrier code:

CORINFO_HELP_ASSIGN_REF:
00007ffe`94920010 488911          mov     qword ptr [rcx],rdx
00007ffe`94920013 488bc1          mov     rax,rcx
00007ffe`94920016 49b840dc29dba0020000 mov r8,2A0DB29DC40h
00007ffe`94920020 48c1e80c        shr     rax,0Ch
00007ffe`94920024 4903c0          add     rax,r8
00007ffe`94920027 4c8bc1          mov     r8,rcx
00007ffe`9492002a 48c1e916        shr     rcx,16h
00007ffe`9492002e 803800          cmp     byte ptr [rax],0
00007ffe`94920031 7503            jne     00007ffe`94920036
00007ffe`94920033 c600ff          mov     byte ptr [rax],0FFh
00007ffe`94920036 48b87b682f05a1020000 mov rax,2A1052F687Bh
00007ffe`94920040 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`94920044 7508            jne     00007ffe`9492004e
00007ffe`94920046 f3c3            rep ret
^^^ This is the ret whose return address is hijacked, hence it triggers the vectored exception handler
...

Stack when the vectored exception handler kicks in and waits for GC completion:

 # Child-SP          RetAddr               Call Site
00 000000db`c3c7cae8 00007ff8`08a89cee     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 251] 
01 000000db`c3c7caf0 00007ffe`f461ac40     KERNELBASE!WaitForSingleObjectEx+0x8e [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0xf [C:\git\runtime\src\coreclr\gc\windows\gcenv.windows.cpp @ 1381] 
03 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x16 [C:\git\runtime\src\coreclr\gc\windows\gcenv.windows.cpp @ 1431] 
04 000000db`c3c7cb90 00007ffe`f45d08b6     coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30 [C:\git\runtime\src\coreclr\gc\gcee.cpp @ 265] 
05 000000db`c3c7cbc0 00007ffe`f459df91     coreclr!Thread::RareDisablePreemptiveGC+0x1d6 [C:\git\runtime\src\coreclr\vm\threadsuspend.cpp @ 2211] 
06 (Inline Function) --------`--------     coreclr!Thread::DisablePreemptiveGC+0x19 [C:\git\runtime\src\coreclr\vm\threads.h @ 1297] 
07 (Inline Function) --------`--------     coreclr!Thread::PulseGCMode+0x27 [C:\git\runtime\src\coreclr\vm\threadsuspend.cpp @ 2405] 
08 000000db`c3c7cc50 00007ffe`f460ee18     coreclr!CommonTripThread+0x59 [C:\git\runtime\src\coreclr\vm\threads.cpp @ 6953] 
09 000000db`c3c7cc80 00007ffe`f460ec9d     coreclr!CLRVectoredExceptionHandler+0x138 [C:\git\runtime\src\coreclr\vm\excep.cpp @ 6546] 
0a 000000db`c3c7cd00 00007ff8`0b627f7a     coreclr!CLRVectoredExceptionHandlerShim+0xdd [C:\git\runtime\src\coreclr\vm\excep.cpp @ 7275] 
0b 000000db`c3c7cd50 00007ff8`0b5ce632     ntdll!RtlpCallVectoredHandlers+0x112 [minkernel\ntdll\vectxcpt.c @ 204] 
0c (Inline Function) --------`--------     ntdll!RtlCallVectoredExceptionHandlers+0xe [minkernel\ntdll\vectxcpt.c @ 358] 
0d 000000db`c3c7cdf0 00007ff8`0b65416e     ntdll!RtlDispatchException+0x62 [minkernel\ntos\rtl\amd64\exdsptch.c @ 406] 
0e 000000db`c3c7d040 00007ffe`94920046     ntdll!KiUserExceptionDispatch+0x2e [minkernel\ntos\rtl\amd64\trampoln.asm @ 755] 
0f 000000db`c3c7d768 00007ffe`955c16a2     0x00007ffe`94920046 <<< This is the write barrier code

In the CLRVectoredExceptionHandler, we move the interruptedContext to the caller of the write barrier, as you can see here:

        // Change the IP to be at the original return site, as if we have returned to the caller.
        // That IP is an interruptible safe point, so we can suspend right there.
        uintptr_t origIp = interruptedContext->Rip; << 0x00007ffe`94920046
        interruptedContext->Rip = (uintptr_t)pThread->GetHijackedReturnAddress(); << 00007ffe`955c16a2

        FrameWithCookie<ResumableFrame> frame(pExceptionInfo->ContextRecord);
        frame.Push(pThread);
        CommonTripThread();
        frame.Pop(pThread);

And then the thread waits for the GC to complete.

On another thread, the one that runs the GC, the GC barrier is changed from WRITE_BARRIER_WRITE_WATCH_BIT_REGIONS64 to WRITE_BARRIER_BIT_REGIONS64.
It stays there even after we return from the CLRVectoredExceptionHandler and we crash, because the 0x00007ffe`94920046 is now in the middle of an instruction and looks as if it was

00007ffe`94920046 0000            add     byte ptr [rax],al ds:000002a1`052f687b=??

janvorli · 2024-10-18T21:34:08Z

The thing that's not clear to me yet is how come we end up hijacking return address of a write barrier. We should only attempt to hijack return address of a managed method.

VSadov · 2024-10-18T21:51:57Z

The thing that's not clear to me yet is how come we end up hijacking return address of a write barrier. We should only attempt to hijack return address of a managed method.

Yes, this is the disturbing part. There are many ways in which this could lead to a failure.
GC barriers are classified as no-GC helpers and callsites are not safepoints. Even trying to unwind could be risky.

Besides, we generally consider native code as not signal-safe, so that is the first thing that handler checks - "are we in managed code?".

janvorli · 2024-10-18T22:04:50Z

I guess I have an answer for this. The key is that the write barrier is tail called from a managed method in this case (the CastHelpers.StelemRef). What happens is that we hijack the return address while we are in the CastHelpers.StelemRef, but then it tail calls into the barrier and the return from the barrier hits the vectored exception handler.

VSadov · 2024-10-18T22:07:51Z

Hijacking the caller should be ok, I think, even if it then tailcalls the barrier.

janvorli · 2024-10-18T22:12:17Z

The problem is that vectored exception handler returns to the location where the "ret" was located in the barrier and then the crash occurs, because there is no longer a ret. I guess that's how Windows implement the special hijacking - it just returns from the vectored exception handler. Since the return address should be fixed now, it would just re-run the ret instruction and return back to the caller. But if the return instruction is gone, it just ends up continuing execution of whatever is there.

VSadov · 2024-10-18T22:16:00Z

Would we have to suspend runtime in order to swap the barrier code though?

EgorBo · 2024-10-18T22:17:58Z

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

VSadov · 2024-10-18T22:20:22Z

And I don't know how CET would influence that since we should be using the APC both with and without CET enabled.

One way how CET can mess up hijacking is that Windows will try to unhijack as well. It will use the top of the shadow stack. This did cause problems before when our stashed hijack address (which captures return address from the actual stack) and shadow stack were not in sync.

janvorli · 2024-10-18T22:25:46Z

Now it is actually clear, as the return address hijacking using the special address is only used when CET is enabled.

VSadov · 2024-10-18T22:29:26Z

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

That is not a JIT-inserted call. StelemRef calls the barrier directly - as an FCALL

        [MethodImpl(MethodImplOptions.InternalCall)]
        private static extern void WriteBarrier(ref object? dst, object? obj);

Can these be tail-called?
(I think that still would be ok, just wonder if this is the reason why StelemRef is special here)

VSadov · 2024-10-18T22:42:21Z

Would we have to suspend runtime in order to swap the barrier code though?

Figured that.
The runtime will indeed be suspended before we resume from the handler. There will be an opportunity for the barrier code to change.

VSadov · 2024-10-18T22:56:05Z

An obvious way to fix this is to forbid tailcalling of WriteBarrier. That might regress perf of array element writes.

Another thought is:

When returning from the hijack handler, we perform the "undo the pop to hijacked call site" - that is specifically to hit the ret that is in this case problematic.

        if (areShadowStacksEnabled)
        {
            // Undo the "pop", so that the ret could now succeed.
            interruptedContext->Rsp = interruptedContext->Rsp - 8;
            interruptedContext->Rip = origIp;
        }

I do not recall all the reasons why we do this.
If it is just to match the shadow stack and let both stacks pop naturally, perhaps we could pop the shadow stack instead? This would ensure that we resume to a safe point.

janvorli · 2024-10-18T23:27:48Z

I think we need to hit the ret in order to pop from the shadow stack too.
As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

EgorBo · 2024-10-18T23:41:12Z

Can a write barrier be a source of an exception? e.g. nullref

Can these be tail-called?

Ah, I guess so then. It's not a helper call indeed in this case.

janvorli · 2024-10-18T23:42:12Z

Can a write barrier be a source of an exception? e.g. nullref

Yes, it can. But we detect that in the EH code and pretend that the exception occurred in the caller of the barrier.

janvorli · 2024-10-18T23:53:06Z

I've verified that the fix I have suggested above (return from VEH when we find the IP is not in managed code - we can actually explicitly check if it is in any helper or just write barrier) works with the repro app.

VSadov · 2024-10-19T00:09:09Z

I think we need to hit the ret in order to pop from the shadow stack too.

That is what I mean - instead of re-popping the context and relying on ret to pop both stacks, we could just pop the shadow stack (i.e. incsspq)

That would not require probing the return address for managedness or being in range of barriers.

VSadov · 2024-10-19T00:32:07Z

As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

As in - unhijack current thread and return VEH_CONTINUE_EXECUTION ?

I think that would work. Perhaps could check just for the write barrier as that is the only thing that can get its code replaced. Other cases where we tailcall FCALLs should be ok.

(and the "incsspq" solution might not work if OS pushes some stuff on SSP before calling our handler)

janvorli · 2024-10-19T09:08:17Z

We basically don't need to do anything, just check the IP to be in a write barrier and return the VEH_CONTINUE_EXECUTION if it was, right after the _ASSERTE in the code below:

runtime/src/coreclr/vm/excep.cpp

Lines 6525 to 6533 in d6ca550

    
           if (areShadowStacksEnabled) 
        
           { 
        
               // OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread 
        
               _ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress()); 
        
               // When the CET is enabled, the interruption happens on the ret instruction in the calee. 
        
               // We need to "pop" rsp to the caller, as if the ret has consumed it. 
        
               interruptedContext->Rsp += 8; 
        
           }

So it looks like this:

        if (areShadowStacksEnabled)
        {
            // OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
            _ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());

            if (IsIPInWriteBarrierCodeCopy(interruptedContext->Rip))
            {
                return VEH_CONTINUE_EXECUTION;
            }

            // When the CET is enabled, the interruption happens on the ret instruction in the calee.
            // We need to "pop" rsp to the caller, as if the ret has consumed it.
            interruptedContext->Rsp += 8;
        }

VSadov · 2024-10-19T13:18:38Z

We need to clear the hijacked state before allowing the thread to “escape” above it.
It is permitted for a thread to clear its hijack, as long as it is in coop mode, but not permitted to simply ignore it. On AOT a thread seen in “escaped” state may cause asserts, not sure about core.
The reason is that removing the hijack later could become dangerous, if the thread ends up below the hijack again, but with different stack.

dotnet-issue-labeler bot added the area-System.Collections label Oct 10, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 10, 2024

jkotas added area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-System.Collections labels Oct 17, 2024

mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 17, 2024

mangod9 added this to the 10.0.0 milestone Oct 17, 2024

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

Comments

swtrse commented Oct 10, 2024 • edited Loading

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

dotnet-policy-service bot commented Oct 10, 2024

EgorBo commented Oct 10, 2024 • edited Loading

swtrse commented Oct 11, 2024

jkotas commented Oct 11, 2024

swtrse commented Oct 11, 2024 • edited Loading

jkotas commented Oct 11, 2024

swtrse commented Oct 15, 2024

jkotas commented Oct 15, 2024

swtrse commented Oct 16, 2024

jkotas commented Oct 16, 2024

jkotas commented Oct 16, 2024

swtrse commented Oct 16, 2024 • edited Loading

swtrse commented Oct 17, 2024

jkotas commented Oct 17, 2024

swtrse commented Oct 17, 2024

dotnet-policy-service bot commented Oct 17, 2024

jkotas commented Oct 17, 2024

janvorli commented Oct 17, 2024

jkotas commented Oct 17, 2024

janvorli commented Oct 17, 2024

janvorli commented Oct 17, 2024

swtrse commented Oct 17, 2024

janvorli commented Oct 18, 2024

janvorli commented Oct 18, 2024

swtrse commented Oct 18, 2024 • edited Loading

janvorli commented Oct 18, 2024

janvorli commented Oct 18, 2024

EgorBo commented Oct 18, 2024

janvorli commented Oct 18, 2024

janvorli commented Oct 18, 2024

janvorli commented Oct 18, 2024

VSadov commented Oct 18, 2024

janvorli commented Oct 18, 2024

VSadov commented Oct 18, 2024

janvorli commented Oct 18, 2024 • edited Loading

VSadov commented Oct 18, 2024

EgorBo commented Oct 18, 2024

VSadov commented Oct 18, 2024 • edited Loading

janvorli commented Oct 18, 2024

VSadov commented Oct 18, 2024

VSadov commented Oct 18, 2024

VSadov commented Oct 18, 2024 • edited Loading

janvorli commented Oct 18, 2024

EgorBo commented Oct 18, 2024

janvorli commented Oct 18, 2024

janvorli commented Oct 18, 2024

VSadov commented Oct 19, 2024 • edited Loading

VSadov commented Oct 19, 2024

janvorli commented Oct 19, 2024

VSadov commented Oct 19, 2024 • edited Loading

swtrse commented Oct 10, 2024 •

edited

Loading

EgorBo commented Oct 10, 2024 •

edited

Loading

swtrse commented Oct 11, 2024 •

edited

Loading

swtrse commented Oct 16, 2024 •

edited

Loading

swtrse commented Oct 18, 2024 •

edited

Loading

janvorli commented Oct 18, 2024 •

edited

Loading

VSadov commented Oct 18, 2024 •

edited

Loading

VSadov commented Oct 18, 2024 •

edited

Loading

VSadov commented Oct 19, 2024 •

edited

Loading

VSadov commented Oct 19, 2024 •

edited

Loading